panstarrs_get_lightcurve using lsdb #344

jkrick · 2024-09-23T22:02:37Z

having some trouble catching exceptions for FileNotFound in the call to lsdb's crossmatch with the Panstarrs catalog. It would be nice if we could somehow catch this error so that the crossmatch doesn't fail if a file is missing. Instead, it would be ideal if one of the objects fails the crossmatch, that the crossmatch still finishes with the other objects complete, and the final result just looks like no match was found for that trouble-causing object/file.

Helpdesk request submitted to MAST 9/19 about the missing file.

jkrick · 2024-09-24T19:52:28Z

MAST helpdesk responded that they will fix the file problem on S3 before Oct 7, so this PR will wait for that fix to move forward, so that testing can be completed.

jkrick · 2024-10-11T18:56:24Z

Now fully functional!
This PR includes:

new panstarrs function which uses lsdb to access a hipscat parquet in S3
removes the 'old' functions which use the panstarrs API
changes name in plotting to be in line with actual name of the dataset "Pan-STARRS" (but is too hard for me to type every time)
added runtime info to the intro cell (800s to run the full notebook) @bsipocz is this too long?

Notes

along the way I discovered missing files in the panstarrs S3 parquets, MAST fixed those, and Troy assures me that he has checks for these things when he pushes data to S3. So I think that is one problem that got fixed, that I hope we will not have to see in the future.
This function is not actually faster for the sample size of 30 that we use in the notebook, but will be faster than the API for larger samples, and in general accessing data in parquet format in the cloud is the direction we want to be going.
this might be the first calls to lsdb in our notebooks? but probably not the last.

This will close #165
This uses help from: astronomy-commons/lsdb#416

bsipocz · 2024-10-11T19:19:41Z

800s to run the full notebook

It's a bit long, but for fornax it's totally acceptable unless we hit a VM limit. We currently skip this notebook due to the ZTF private bucket access.

(and I don't know why we see the error with the classifier notebook, that should not be affected by this PR)

bsipocz

Some comments. Mostly minor, but I picked up on some things to make the API nicer (e.g. quantity inputs, or being explicit with parameters)

light_curves/requirements_light_curve_generator.txt

bsipocz · 2024-10-11T19:26:05Z

light_curves/light_curve_generator.md

-# num_normal_QSO = 5000
-# zmin, zmax = 0, 10
-# randomize_z = False
+#num_normal_QSO = 30
+#zmin, zmax = 0, 10
+#randomize_z = False


are these changes intended?

From my perspective, it doesn't matter what the values of these are since they are commented out. I do change them when I am working on the code and testing different sample sizes, and then most often forget what they originally were and leave them at whatever the last value I tried. Does that cause a problem? Or is it the spaces between the # and the commands that is concerning?

It's the adding unrelated and unnecessary changes. I see why it happens, but it would be nice to figure out a convenient way to change the workflow so only intended and necessary changes get added.
I added this topic to my list for the ipac visit.

yes! I don't know how to do this, so perfect to discuss later this month.

I've seen a new (to me anyway) "git" tab on the LHS of the Fornax console that gives GUI access to changed and staged files, etc. Maybe there's a way to select and commit individual lines/changes from there? I haven't played around with it at all but seems worth checking out.

Yeap, I put on the list for my visit to experiment something out.

light_curves/light_curve_generator.md

light_curves/code_src/panstarrs_functions.py

light_curves/light_curve_generator.md

jkrick

Thanks for looking it over and the comments, I think I got them all.

light_curves/code_src/panstarrs_functions.py

light_curves/light_curve_generator.md

light_curves/requirements_light_curve_generator.txt

bsipocz

cleanup changes to be committed and then this is good to go.

light_curves/requirements_light_curve_generator.txt

light_curves/light_curve_generator.md

light_curves/code_src/panstarrs_functions.py

light_curves/requirements_light_curve_generator.txt

troyraen

Thanks for doing this, glad it's working! I left a mixture of comments below ranging from minor naming suggestions to code structure questions that will take a little more work to figure out. Anything you can't or don't want to address here can go into an issue and be assigned to me.

light_curves/code_src/panstarrs_functions.py

light_curves/light_curve_generator.md

Co-authored-by: Troy Raen <[email protected]>

jkrick

I think I have addressed all the comments and made all the requested changes. Let me know if I missed anything.

light_curves/code_src/panstarrs_functions.py

jkrick · 2024-10-15T01:22:04Z

light_curves/code_src/panstarrs_functions.py

+        dict(flux=pd.to_numeric(flux_panstarrs, errors='coerce').astype(np.float64), 
+             err=pd.to_numeric(err_panstarrs, errors='coerce').astype(np.float64), 
+             time=pd.to_numeric(t_panstarrs, errors='coerce').astype(np.float64), 
+             objectid=pd.to_numeric(objectid, errors='coerce').astype(np.int64), 


This represents a day of my life that I will never get back. It turns out that lsdb returns some interesting data types (eg. double[pyarrow]), which pandas seamlessly handles, and hides, unless you think to ask about data types. Unfortunately, other codes do not handle them well, namely our plotting functions.... This is the way I found of getting the data types converted into data types in the data frames that can be handled by plotting. Maybe there is another, simpler, way of doing it, but this is functional, and not all combinations of these things are functional, ie., I believe .astype alone doesn't work. The coerce I believe was an attempt to handle nan.

light_curves/light_curve_generator.md

light_curves/code_src/panstarrs_functions.py

bsipocz · 2024-10-16T19:59:37Z

Your comments and experience around working with the dtypes would be a super useful feedback to the lincc team I think.

troyraen · 2024-10-16T21:48:39Z

LGTM. Thanks!

jkrick · 2024-10-16T23:48:05Z

Your comments and experience around working with the dtypes would be a super useful feedback to the lincc team I think.

message received, I put this in as feedback to lsdb in issue #441 over at astronomy-commons/lsdb

jkrick · 2024-10-16T23:49:27Z

@bsipocz when you get a chance, can you please merge this PR? I don't know why it is failing ci.

bsipocz · 2024-10-16T23:52:35Z

It seems to be something unrelated, but I'm not sure I can dive into the details before I get back home.

jkrick · 2024-10-16T23:53:19Z

I'm not in a hurry, thanks.

We only need distributed extras here, but having any dask triggers the need for dataframe in the classifier notebook

bsipocz · 2024-10-17T20:46:31Z

OK, my latest commit should fix the failure. Basically adding some dask would trigger sktime to check if dask[dataframe] is all installed, and it wasn't.

bsipocz · 2024-10-17T21:19:56Z

Hmm, apparently this is in fact an sktime issue: sktime/sktime#7250. Someone reported that downgrading sktime solves the issue, so I was trying that. If it doesn't work, I'll look into more hackery.

bsipocz · 2024-10-18T18:40:21Z

I'm not sure why the classifier notebook run into VM limits now, I'm doing some more debugs and see if it is still OK with GHA. If yes, then I'll bring over the configs from the IRSA notebooks to we can easily opt in GHA build here, too.

jkrick · 2024-10-23T20:10:19Z

see if it is still OK with GHA
@bsipocz what is "GHA"?

bsipocz · 2024-10-23T21:29:09Z

GHA: GitHub Actions. We use GHA for doing the HTML rendering and hosting the pages with github. But for the PRs we use a different system, and their limitations differ a bit. It was already close to the limit but from below, I suspect adding the new dependencies may changed it worse the worse a bit, but we can't do much about it.

TL;DR, I may need to tweak the configs but I don't think there is much here to do about it. So I would merge this now and figure out issues with circleCI if/when they arise on main.

panstarrs_get_lightcurve using lsdb ee9f4f1

functional but not fully tested pastarrs_get_lightcurve_lsdb

f80e71a

troyraen mentioned this pull request Sep 27, 2024

Notebooks: Speedup MAST PanSTARRS light curve search at scale #165

Closed

almost got the functions all working nicely together

73bbc93

jkrick self-assigned this Oct 11, 2024

jkrick added the use case: light curves label Oct 11, 2024

jkrick added 2 commits October 11, 2024 18:49

fully functional

0fc8d3a

fully functional

c31a2d8

jkrick changed the title ~~functional but not fully tested pastarrs_get_lightcurve_lsdb~~ pastarrs_get_lightcurve using lsdb Oct 11, 2024

jkrick requested review from troyraen and bsipocz October 11, 2024 18:56

jkrick marked this pull request as ready for review October 11, 2024 18:57

bsipocz reviewed Oct 11, 2024

View reviewed changes

jkrick changed the title ~~pastarrs_get_lightcurve using lsdb~~ panstarrs_get_lightcurve using lsdb Oct 11, 2024

jkrick mentioned this pull request Oct 11, 2024

ENH: turn query radii for the different archives into astropy quantities #350

Open

response to comments on PR 344

0acf2ee

jkrick commented Oct 11, 2024

View reviewed changes

bsipocz approved these changes Oct 11, 2024

View reviewed changes

Minor cleanups before merging [skip ci]

0350d1b

troyraen approved these changes Oct 12, 2024

View reviewed changes

jkrick and others added 7 commits October 14, 2024 18:06

Apply suggestions from code review

381baad

Co-authored-by: Troy Raen <[email protected]>

removing the name yang from the sample

2d82df4

removing extra saving cell that was included by mistake

58f0df7

renaming parquet filesnames for saving

a92a4c5

returning multiindex instead of regular df

75b7337

handle empty matches better

5a32024

trying to make building the dataframe more efficient

ebb9c75

jkrick commented Oct 16, 2024

View reviewed changes

MAINT: fix dependencies

3638adc

We only need distributed extras here, but having any dask triggers the need for dataframe in the classifier notebook

DOC: add description why we need dask[dataframe]

8b45d9b

bsipocz mentioned this pull request Oct 17, 2024

[BUG] AttributeError: module 'dask' has no attribute 'dataframe' when using ARIMA sktime/sktime#7250

Closed

TST: try pinning sktime version (to fix its failure with dask)

fedbd7e

bsipocz force-pushed the panstarrs_hipscat branch from 26dd3be to fedbd7e Compare October 18, 2024 18:32

bsipocz merged commit ee9f4f1 into main Oct 23, 2024
2 of 4 checks passed

bsipocz deleted the panstarrs_hipscat branch October 23, 2024 21:29

github-actions bot pushed a commit that referenced this pull request Oct 23, 2024

Merge pull request #344 from nasa-fornax/panstarrs_hipscat

f487b2f

panstarrs_get_lightcurve using lsdb ee9f4f1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

panstarrs_get_lightcurve using lsdb #344

panstarrs_get_lightcurve using lsdb #344

jkrick commented Sep 23, 2024

jkrick commented Sep 24, 2024

jkrick commented Oct 11, 2024

bsipocz commented Oct 11, 2024

bsipocz left a comment

bsipocz Oct 11, 2024

jkrick Oct 11, 2024

bsipocz Oct 11, 2024

jkrick Oct 11, 2024

troyraen Oct 12, 2024

bsipocz Oct 14, 2024

jkrick left a comment

bsipocz left a comment

troyraen left a comment

jkrick left a comment

jkrick Oct 15, 2024

bsipocz commented Oct 16, 2024

troyraen commented Oct 16, 2024

jkrick commented Oct 16, 2024

jkrick commented Oct 16, 2024

bsipocz commented Oct 16, 2024

jkrick commented Oct 16, 2024

bsipocz commented Oct 17, 2024

bsipocz commented Oct 17, 2024

bsipocz commented Oct 18, 2024

jkrick commented Oct 23, 2024

bsipocz commented Oct 23, 2024

panstarrs_get_lightcurve using lsdb #344

panstarrs_get_lightcurve using lsdb #344

Conversation

jkrick commented Sep 23, 2024

jkrick commented Sep 24, 2024

jkrick commented Oct 11, 2024

bsipocz commented Oct 11, 2024

bsipocz left a comment

Choose a reason for hiding this comment

bsipocz Oct 11, 2024

Choose a reason for hiding this comment

jkrick Oct 11, 2024

Choose a reason for hiding this comment

bsipocz Oct 11, 2024

Choose a reason for hiding this comment

jkrick Oct 11, 2024

Choose a reason for hiding this comment

troyraen Oct 12, 2024

Choose a reason for hiding this comment

bsipocz Oct 14, 2024

Choose a reason for hiding this comment

jkrick left a comment

Choose a reason for hiding this comment

bsipocz left a comment

Choose a reason for hiding this comment

troyraen left a comment

Choose a reason for hiding this comment

jkrick left a comment

Choose a reason for hiding this comment

jkrick Oct 15, 2024

Choose a reason for hiding this comment

bsipocz commented Oct 16, 2024

troyraen commented Oct 16, 2024

jkrick commented Oct 16, 2024

jkrick commented Oct 16, 2024

bsipocz commented Oct 16, 2024

jkrick commented Oct 16, 2024

bsipocz commented Oct 17, 2024

bsipocz commented Oct 17, 2024

bsipocz commented Oct 18, 2024

jkrick commented Oct 23, 2024

bsipocz commented Oct 23, 2024