Notebooks: Speedup MAST PanSTARRS light curve search at scale #165

jkrick · 2023-11-01T23:02:46Z

I have asked MAST helpdesk for help with how to run this search efficiently at scale

jkrick · 2024-07-03T23:57:14Z

This should now be possible with a hipscat version of panstarrs in S3. s3://stpubdata/panstarrs/ps1/public/hipscat/

The goal of this task is to make a standalone function which can query panstarrs for a large number of targets (500k) in a reasonable amount of time (< 1 day).

troyraen · 2024-07-04T00:03:36Z

This lsdb (hipscat) tutorial looks directly relevant: https://lsdb.readthedocs.io/en/stable/tutorials/pre_executed/ztf_bts-ngc.html

troyraen · 2024-08-10T04:13:25Z

If we don't want to use Dask here (which lsdb requires), we could still use the hipscat files with the same techniques we're using for the unWISE catalog in this notebook.

jkrick · 2024-09-16T16:37:50Z

@troyraen Is there a reason not to use Dask?

troyraen · 2024-09-16T16:53:25Z

I don't know exactly. We should probably try it and see what happens. It may not work well with the parallelized notebook cell.

jkrick · 2024-09-19T21:27:28Z

I can confirm that running dask.distributed.client inside of multiprocessing crashes the kernel. I could see how this would cause problems if multiprocessing tries to make many distributed clients. Instead of trouble shooting this part of this problem, I am going to cheat a tiny bit and move the panstarrs_get_lightcurves outside of the multiprocessing.pool call. The dask client is already doing the multiprocessing after all, so the cell is still parallelizing the work.

troyraen · 2024-09-19T21:38:39Z

I can confirm that running dask.distributed.client inside of multiprocessing crashes the kernel.

Not unexpected but good to know for sure, thanks for trying it.

I am going to cheat a tiny bit and move the panstarrs_get_lightcurves outside of the multiprocessing.pool call. The dask client is already doing the multiprocessing after all, so the cell is still parallelizing the work.

Sounds like a good thing to try. If PanSTARRS+Dask works well well this way we might want to consider switching the other ones from multiprocessing to dask so that they can all run at the same time again.

troyraen · 2024-09-27T18:28:24Z

WIP #344

jkrick · 2024-10-08T15:43:08Z

Some comments and questions on benchmarking the panstarrs query with lsdb on a hipscat in S3 vs. the panstarrs query with a for loop over all sources on the API. The test I ran was 300, 3000, and 30000 sources. On the ISP, the test performed as I would have expected, with lsdb being slightly slower on the 300 sources, about the same on the 3000 sources, and much faster on the 30000 sources. Running the same tests on Fornax, I had to go to the "approval only" server to keep the kernel from crashing. Once there, lsdb and API were about the same speed for the small and medium samples, but lsdb was much slower on the largest sample. Incomplete table below of speeds. On both the ISP and Fornax I didn't actually run the 30000 sample on the API, instead I just took the average speed per iteration over the first few minutes.

The question is, do we switch the panstarrs query function to lsdb, or leave the API function? I am leaning towards lsdb because data in the cloud is the future that we are moving towards, but I'd like to hear other opinions on this, or explanations of why Fornax didn't show the speed improvement expected on the largest sample. We can also run more tests, I just thought I would gather some thoughts before heading further into this rabbit hole that I do not understand. @troyraen @bsipocz @zoghbi-a

zoghbi-a · 2024-10-08T16:13:15Z

Thanks @jkrick for running the tests. I want to investigate these more. I have two questions:

Where can I get the code you used? is it in the panstarrs_hipscat branch?
Are the resources (CPU, Memory) used in ISP and Fornax the same?

jkrick · 2024-10-08T16:31:21Z

Thanks for looking into it @zoghbi-a

yes, that is the branch, it's not so cleaned up, but I think you will figure it out without too much trouble. You need to run the pip install cell, import cell, and cell 1.4 to setup the df before getting to the panstarrs functions.
The ISP has 20 CPUs and 240Gi Memory. (so less than the fornax 128 core but more than the 16 core) When I watched top output from the lsdb run on Fornax "approval only" the CPU was pegged at 100% (or greater), but my limited knowledge of dask is that it does that no matter what the number of CPU available.

troyraen · 2024-10-08T19:37:24Z

My guess is that Dask is only using one worker by default, but if we start a Dask cluster first it will use more. Checking now...

jkrick · 2024-10-08T19:44:26Z

My guess is that Dask is only using one worker by default, but if we start a Dask cluster first it will use more.

I do start a dask cluster in panstarrs_get_lightcurves_lsdb. At least that is my understanding of using with Client from dask.distributed

troyraen · 2024-10-08T19:59:03Z

Yes, sorry, I see that now and am also running it and seeing multiple workers. My next guess is that it will be faster to switch the catalog order in the cross match call panstarrs_object.crossmatch(sample_lsdb,...) -> sample_lsdb.crossmatch(panstarrs_object,...). I will try that next. (I need to read their docs to see if they mention it, but I've noticed in some of LINCC's demos that they put the smaller catalog on the left. I experimented with this a tiny bit last night and that did seem to be faster.)

troyraen · 2024-10-08T20:02:17Z

Side note: If you start the Dask cluster from console first, you can see some nice charts like this. I can show you how if you want.

jkrick · 2024-10-08T20:20:46Z

I always thought the rule of thumb was to put the smaller catalog on the right, so that would definitely be good to know/document for lsdb. But, regardless of left/right, I think we still expect the queries to be the same relative speed on Fornax as on ISP. ie., I don't think switching their order is going to help understand my problem, although it might be an overall speedup.

jkrick · 2024-10-08T20:21:52Z

and I don't know what "start a dask cluster from console" means, so yes please, I would like to know that.

troyraen · 2024-10-08T20:38:08Z

I always thought the rule of thumb was to put the smaller catalog on the right, so that would definitely be good to know/document for lsdb.

That's what I thought too and was thus surprised to see it opposite in LINCC's notebook. I tried switching them here... it was not obviously faster, but I still haven't don't anything like proper testing. I'll educate myself more about it at some point, but not today.

OTOH, I doubled the number of workers and that did seem to reduce the speed by half (again though, it was nothing like a proper test). My hunch is that it would be better to use n_workers = n_cpus and then increase the number of threads per worker (again though, I need to educate myself more about this). But either way, this seems promising. It's not using very much of the total available CPU and RAM with the default number of workers and threads, but does seem to be responding well to increasing beyond the default. We ought to be able to figure out what factor to increase num workers (or threads) by so that it will use most of the available CPU even with different server sizes.

Bottom line, I agree that this LSDB/Dask call is probably the way to go both because it's the way we're headed in general and because I think we can figure out how to make this call faster than the original one at scale.

troyraen · 2024-10-08T21:22:13Z

and I don't know what "start a dask cluster from console" means, so yes please, I would like to know that.

I don't know a ton, but I happened to mess around with it last night following nasa-fornax/fornax-images#6 to get ready for a demo this morning. Steps I took to get it going are:

On left-hand side of console, click the icon that looks like this:
Scroll down and click "+NEW" next to "CLUSTERS".
Right above that, you'll see a bunch of yellow boxes/buttons appear after the cluster starts. Click on any of them and a new tab will open showing the selected chart. You can drag and drop the tab to move it somewhere else on the screen. In the screenshot above, I have several open and arranged them next to each other at the bottom.
Start or restart the notebook kernel so that it recognizes the cluster.
Then you don't need to explicitly call Client() in your code. The issue linked above actually says that you shouldn't, though I left it in your code and it seems fine.

jkrick · 2024-10-08T21:47:58Z

Bottom line, I agree that this LSDB/Dask call is probably the way to go both because it's the way we're headed in general and because I think we can figure out how to make this call faster than the original one at scale.

Thanks for this comment. What I suggest is that I clean up the current PR with the lsdb function and then you can either suggest some of the above speedups in a review, or do a second PR with those speedups after mine goes through.

As for the steps above to get plotting on a dask client, can you please add those to fornax-documentation? Even if they are not complete, its a good starting point.

troyraen · 2024-10-08T22:00:19Z

Yes, sounds good. And I'm still interested to hear from @zoghbi-a and/or @bsipocz who may have more well-informed ideas than I do at this point.

troyraen · 2024-10-08T23:44:46Z

As for the steps above to get plotting on a dask client, can you please add those to fornax-documentation?

nasa-fornax/fornax-documentation#52

jkrick · 2024-10-24T01:11:09Z

closed with #344

jkrick added the use case: light curves label Nov 1, 2023

jkrick self-assigned this Nov 1, 2023

jkrick changed the title ~~Speedup MAST PanSTARRS light curve search at scale~~ Notebooks: Speedup MAST PanSTARRS light curve search at scale Dec 4, 2023

troyraen mentioned this issue Dec 6, 2023

List of things to consider cleaning up in the light curve notebook #177

Closed

17 tasks

troyraen mentioned this issue Jan 12, 2024

Lists and SkyCoord do not scale well #179

Open

jkrick mentioned this issue Oct 11, 2024

panstarrs_get_lightcurve using lsdb #344

Merged

jkrick closed this as completed Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Notebooks: Speedup MAST PanSTARRS light curve search at scale #165

Notebooks: Speedup MAST PanSTARRS light curve search at scale #165

jkrick commented Nov 1, 2023

jkrick commented Jul 3, 2024

troyraen commented Jul 4, 2024

troyraen commented Aug 10, 2024

jkrick commented Sep 16, 2024

troyraen commented Sep 16, 2024

jkrick commented Sep 19, 2024

troyraen commented Sep 19, 2024 •

edited

Loading

troyraen commented Sep 27, 2024

jkrick commented Oct 8, 2024 •

edited

Loading

zoghbi-a commented Oct 8, 2024

jkrick commented Oct 8, 2024

troyraen commented Oct 8, 2024

jkrick commented Oct 8, 2024

troyraen commented Oct 8, 2024

troyraen commented Oct 8, 2024

jkrick commented Oct 8, 2024

jkrick commented Oct 8, 2024

troyraen commented Oct 8, 2024

troyraen commented Oct 8, 2024

jkrick commented Oct 8, 2024

troyraen commented Oct 8, 2024

troyraen commented Oct 8, 2024

jkrick commented Oct 24, 2024

Notebooks: Speedup MAST PanSTARRS light curve search at scale #165

Notebooks: Speedup MAST PanSTARRS light curve search at scale #165

Comments

jkrick commented Nov 1, 2023

jkrick commented Jul 3, 2024

troyraen commented Jul 4, 2024

troyraen commented Aug 10, 2024

jkrick commented Sep 16, 2024

troyraen commented Sep 16, 2024

jkrick commented Sep 19, 2024

troyraen commented Sep 19, 2024 • edited Loading

troyraen commented Sep 27, 2024

jkrick commented Oct 8, 2024 • edited Loading

zoghbi-a commented Oct 8, 2024

jkrick commented Oct 8, 2024

troyraen commented Oct 8, 2024

jkrick commented Oct 8, 2024

troyraen commented Oct 8, 2024

troyraen commented Oct 8, 2024

jkrick commented Oct 8, 2024

jkrick commented Oct 8, 2024

troyraen commented Oct 8, 2024

troyraen commented Oct 8, 2024

jkrick commented Oct 8, 2024

troyraen commented Oct 8, 2024

troyraen commented Oct 8, 2024

jkrick commented Oct 24, 2024

troyraen commented Sep 19, 2024 •

edited

Loading

jkrick commented Oct 8, 2024 •

edited

Loading