-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Notebooks: Speedup MAST PanSTARRS light curve search at scale #165
Comments
This should now be possible with a hipscat version of panstarrs in S3. s3://stpubdata/panstarrs/ps1/public/hipscat/ The goal of this task is to make a standalone function which can query panstarrs for a large number of targets (500k) in a reasonable amount of time (< 1 day). |
This lsdb (hipscat) tutorial looks directly relevant: https://lsdb.readthedocs.io/en/stable/tutorials/pre_executed/ztf_bts-ngc.html |
If we don't want to use Dask here (which lsdb requires), we could still use the hipscat files with the same techniques we're using for the unWISE catalog in this notebook. |
@troyraen Is there a reason not to use Dask? |
I don't know exactly. We should probably try it and see what happens. It may not work well with the parallelized notebook cell. |
I can confirm that running dask.distributed.client inside of multiprocessing crashes the kernel. I could see how this would cause problems if multiprocessing tries to make many distributed clients. Instead of trouble shooting this part of this problem, I am going to cheat a tiny bit and move the panstarrs_get_lightcurves outside of the multiprocessing.pool call. The dask client is already doing the multiprocessing after all, so the cell is still parallelizing the work. |
Not unexpected but good to know for sure, thanks for trying it.
Sounds like a good thing to try. If PanSTARRS+Dask works well well this way we might want to consider switching the other ones from multiprocessing to dask so that they can all run at the same time again. |
WIP #344 |
Some comments and questions on benchmarking the panstarrs query with lsdb on a hipscat in S3 vs. the panstarrs query with a for loop over all sources on the API. The test I ran was 300, 3000, and 30000 sources. On the ISP, the test performed as I would have expected, with lsdb being slightly slower on the 300 sources, about the same on the 3000 sources, and much faster on the 30000 sources. Running the same tests on Fornax, I had to go to the "approval only" server to keep the kernel from crashing. Once there, lsdb and API were about the same speed for the small and medium samples, but lsdb was much slower on the largest sample. Incomplete table below of speeds. On both the ISP and Fornax I didn't actually run the 30000 sample on the API, instead I just took the average speed per iteration over the first few minutes. The question is, do we switch the panstarrs query function to lsdb, or leave the API function? I am leaning towards lsdb because data in the cloud is the future that we are moving towards, but I'd like to hear other opinions on this, or explanations of why Fornax didn't show the speed improvement expected on the largest sample. We can also run more tests, I just thought I would gather some thoughts before heading further into this rabbit hole that I do not understand. @troyraen @bsipocz @zoghbi-a |
Thanks @jkrick for running the tests. I want to investigate these more. I have two questions:
|
Thanks for looking into it @zoghbi-a
|
My guess is that Dask is only using one worker by default, but if we start a Dask cluster first it will use more. Checking now... |
I do start a dask cluster in panstarrs_get_lightcurves_lsdb. At least that is my understanding of using |
Yes, sorry, I see that now and am also running it and seeing multiple workers. My next guess is that it will be faster to switch the catalog order in the cross match call |
I always thought the rule of thumb was to put the smaller catalog on the right, so that would definitely be good to know/document for lsdb. But, regardless of left/right, I think we still expect the queries to be the same relative speed on Fornax as on ISP. ie., I don't think switching their order is going to help understand my problem, although it might be an overall speedup. |
and I don't know what "start a dask cluster from console" means, so yes please, I would like to know that. |
That's what I thought too and was thus surprised to see it opposite in LINCC's notebook. I tried switching them here... it was not obviously faster, but I still haven't don't anything like proper testing. I'll educate myself more about it at some point, but not today. OTOH, I doubled the number of workers and that did seem to reduce the speed by half (again though, it was nothing like a proper test). My hunch is that it would be better to use n_workers = n_cpus and then increase the number of threads per worker (again though, I need to educate myself more about this). But either way, this seems promising. It's not using very much of the total available CPU and RAM with the default number of workers and threads, but does seem to be responding well to increasing beyond the default. We ought to be able to figure out what factor to increase num workers (or threads) by so that it will use most of the available CPU even with different server sizes. Bottom line, I agree that this LSDB/Dask call is probably the way to go both because it's the way we're headed in general and because I think we can figure out how to make this call faster than the original one at scale. |
I don't know a ton, but I happened to mess around with it last night following nasa-fornax/fornax-images#6 to get ready for a demo this morning. Steps I took to get it going are:
|
Thanks for this comment. What I suggest is that I clean up the current PR with the lsdb function and then you can either suggest some of the above speedups in a review, or do a second PR with those speedups after mine goes through. As for the steps above to get plotting on a dask client, can you please add those to fornax-documentation? Even if they are not complete, its a good starting point. |
|
closed with #344 |
I have asked MAST helpdesk for help with how to run this search efficiently at scale
The text was updated successfully, but these errors were encountered: