Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notebooks: Speedup MAST PanSTARRS light curve search at scale #165

Closed
jkrick opened this issue Nov 1, 2023 · 23 comments
Closed

Notebooks: Speedup MAST PanSTARRS light curve search at scale #165

jkrick opened this issue Nov 1, 2023 · 23 comments

Comments

@jkrick
Copy link
Contributor

jkrick commented Nov 1, 2023

I have asked MAST helpdesk for help with how to run this search efficiently at scale

@jkrick jkrick self-assigned this Nov 1, 2023
@jkrick jkrick changed the title Speedup MAST PanSTARRS light curve search at scale Notebooks: Speedup MAST PanSTARRS light curve search at scale Dec 4, 2023
@jkrick
Copy link
Contributor Author

jkrick commented Jul 3, 2024

This should now be possible with a hipscat version of panstarrs in S3. s3://stpubdata/panstarrs/ps1/public/hipscat/

The goal of this task is to make a standalone function which can query panstarrs for a large number of targets (500k) in a reasonable amount of time (< 1 day).

@troyraen
Copy link
Contributor

troyraen commented Jul 4, 2024

This lsdb (hipscat) tutorial looks directly relevant: https://lsdb.readthedocs.io/en/stable/tutorials/pre_executed/ztf_bts-ngc.html

@troyraen
Copy link
Contributor

If we don't want to use Dask here (which lsdb requires), we could still use the hipscat files with the same techniques we're using for the unWISE catalog in this notebook.

@jkrick
Copy link
Contributor Author

jkrick commented Sep 16, 2024

@troyraen Is there a reason not to use Dask?

@troyraen
Copy link
Contributor

I don't know exactly. We should probably try it and see what happens. It may not work well with the parallelized notebook cell.

@jkrick
Copy link
Contributor Author

jkrick commented Sep 19, 2024

I can confirm that running dask.distributed.client inside of multiprocessing crashes the kernel. I could see how this would cause problems if multiprocessing tries to make many distributed clients. Instead of trouble shooting this part of this problem, I am going to cheat a tiny bit and move the panstarrs_get_lightcurves outside of the multiprocessing.pool call. The dask client is already doing the multiprocessing after all, so the cell is still parallelizing the work.

@troyraen
Copy link
Contributor

troyraen commented Sep 19, 2024

I can confirm that running dask.distributed.client inside of multiprocessing crashes the kernel.

Not unexpected but good to know for sure, thanks for trying it.

I am going to cheat a tiny bit and move the panstarrs_get_lightcurves outside of the multiprocessing.pool call. The dask client is already doing the multiprocessing after all, so the cell is still parallelizing the work.

Sounds like a good thing to try. If PanSTARRS+Dask works well well this way we might want to consider switching the other ones from multiprocessing to dask so that they can all run at the same time again.

@troyraen
Copy link
Contributor

WIP #344

@jkrick
Copy link
Contributor Author

jkrick commented Oct 8, 2024

Some comments and questions on benchmarking the panstarrs query with lsdb on a hipscat in S3 vs. the panstarrs query with a for loop over all sources on the API. The test I ran was 300, 3000, and 30000 sources. On the ISP, the test performed as I would have expected, with lsdb being slightly slower on the 300 sources, about the same on the 3000 sources, and much faster on the 30000 sources. Running the same tests on Fornax, I had to go to the "approval only" server to keep the kernel from crashing. Once there, lsdb and API were about the same speed for the small and medium samples, but lsdb was much slower on the largest sample. Incomplete table below of speeds. On both the ISP and Fornax I didn't actually run the 30000 sample on the API, instead I just took the average speed per iteration over the first few minutes.

The question is, do we switch the panstarrs query function to lsdb, or leave the API function? I am leaning towards lsdb because data in the cloud is the future that we are moving towards, but I'd like to hear other opinions on this, or explanations of why Fornax didn't show the speed improvement expected on the largest sample. We can also run more tests, I just thought I would gather some thoughts before heading further into this rabbit hole that I do not understand. @troyraen @bsipocz @zoghbi-a

Image

@zoghbi-a
Copy link
Contributor

zoghbi-a commented Oct 8, 2024

Thanks @jkrick for running the tests. I want to investigate these more. I have two questions:

  • Where can I get the code you used? is it in the panstarrs_hipscat branch?
  • Are the resources (CPU, Memory) used in ISP and Fornax the same?

@jkrick
Copy link
Contributor Author

jkrick commented Oct 8, 2024

Thanks for looking into it @zoghbi-a

  1. yes, that is the branch, it's not so cleaned up, but I think you will figure it out without too much trouble. You need to run the pip install cell, import cell, and cell 1.4 to setup the df before getting to the panstarrs functions.
  2. The ISP has 20 CPUs and 240Gi Memory. (so less than the fornax 128 core but more than the 16 core) When I watched top output from the lsdb run on Fornax "approval only" the CPU was pegged at 100% (or greater), but my limited knowledge of dask is that it does that no matter what the number of CPU available.

@troyraen
Copy link
Contributor

troyraen commented Oct 8, 2024

My guess is that Dask is only using one worker by default, but if we start a Dask cluster first it will use more. Checking now...

@jkrick
Copy link
Contributor Author

jkrick commented Oct 8, 2024

My guess is that Dask is only using one worker by default, but if we start a Dask cluster first it will use more.

I do start a dask cluster in panstarrs_get_lightcurves_lsdb. At least that is my understanding of using with Client from dask.distributed

@troyraen
Copy link
Contributor

troyraen commented Oct 8, 2024

Yes, sorry, I see that now and am also running it and seeing multiple workers. My next guess is that it will be faster to switch the catalog order in the cross match call panstarrs_object.crossmatch(sample_lsdb,...) -> sample_lsdb.crossmatch(panstarrs_object,...). I will try that next. (I need to read their docs to see if they mention it, but I've noticed in some of LINCC's demos that they put the smaller catalog on the left. I experimented with this a tiny bit last night and that did seem to be faster.)

@troyraen
Copy link
Contributor

troyraen commented Oct 8, 2024

Side note: If you start the Dask cluster from console first, you can see some nice charts like this. I can show you how if you want.

Screenshot 2024-10-08 at 1 00 40 PM

@jkrick
Copy link
Contributor Author

jkrick commented Oct 8, 2024

I always thought the rule of thumb was to put the smaller catalog on the right, so that would definitely be good to know/document for lsdb. But, regardless of left/right, I think we still expect the queries to be the same relative speed on Fornax as on ISP. ie., I don't think switching their order is going to help understand my problem, although it might be an overall speedup.

@jkrick
Copy link
Contributor Author

jkrick commented Oct 8, 2024

and I don't know what "start a dask cluster from console" means, so yes please, I would like to know that.

@troyraen
Copy link
Contributor

troyraen commented Oct 8, 2024

I always thought the rule of thumb was to put the smaller catalog on the right, so that would definitely be good to know/document for lsdb.

That's what I thought too and was thus surprised to see it opposite in LINCC's notebook. I tried switching them here... it was not obviously faster, but I still haven't don't anything like proper testing. I'll educate myself more about it at some point, but not today.

OTOH, I doubled the number of workers and that did seem to reduce the speed by half (again though, it was nothing like a proper test). My hunch is that it would be better to use n_workers = n_cpus and then increase the number of threads per worker (again though, I need to educate myself more about this). But either way, this seems promising. It's not using very much of the total available CPU and RAM with the default number of workers and threads, but does seem to be responding well to increasing beyond the default. We ought to be able to figure out what factor to increase num workers (or threads) by so that it will use most of the available CPU even with different server sizes.

Bottom line, I agree that this LSDB/Dask call is probably the way to go both because it's the way we're headed in general and because I think we can figure out how to make this call faster than the original one at scale.

@troyraen
Copy link
Contributor

troyraen commented Oct 8, 2024

and I don't know what "start a dask cluster from console" means, so yes please, I would like to know that.

I don't know a ton, but I happened to mess around with it last night following nasa-fornax/fornax-images#6 to get ready for a demo this morning. Steps I took to get it going are:

  • On left-hand side of console, click the icon that looks like this: Screenshot 2024-10-08 at 2 11 34 PM
  • Scroll down and click "+NEW" next to "CLUSTERS".
  • Right above that, you'll see a bunch of yellow boxes/buttons appear after the cluster starts. Click on any of them and a new tab will open showing the selected chart. You can drag and drop the tab to move it somewhere else on the screen. In the screenshot above, I have several open and arranged them next to each other at the bottom.
  • Start or restart the notebook kernel so that it recognizes the cluster.
  • Then you don't need to explicitly call Client() in your code. The issue linked above actually says that you shouldn't, though I left it in your code and it seems fine.

@jkrick
Copy link
Contributor Author

jkrick commented Oct 8, 2024

Bottom line, I agree that this LSDB/Dask call is probably the way to go both because it's the way we're headed in general and because I think we can figure out how to make this call faster than the original one at scale.

Thanks for this comment. What I suggest is that I clean up the current PR with the lsdb function and then you can either suggest some of the above speedups in a review, or do a second PR with those speedups after mine goes through.

As for the steps above to get plotting on a dask client, can you please add those to fornax-documentation? Even if they are not complete, its a good starting point.

@troyraen
Copy link
Contributor

troyraen commented Oct 8, 2024

Yes, sounds good. And I'm still interested to hear from @zoghbi-a and/or @bsipocz who may have more well-informed ideas than I do at this point.

@troyraen
Copy link
Contributor

troyraen commented Oct 8, 2024

As for the steps above to get plotting on a dask client, can you please add those to fornax-documentation?

nasa-fornax/fornax-documentation#52

@jkrick
Copy link
Contributor Author

jkrick commented Oct 24, 2024

closed with #344

@jkrick jkrick closed this as completed Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants