Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panstarrs_get_lightcurve using lsdb #344

Merged
merged 16 commits into from
Oct 23, 2024
Merged

panstarrs_get_lightcurve using lsdb #344

merged 16 commits into from
Oct 23, 2024

Conversation

jkrick
Copy link
Contributor

@jkrick jkrick commented Sep 23, 2024

having some trouble catching exceptions for FileNotFound in the call to lsdb's crossmatch with the Panstarrs catalog. It would be nice if we could somehow catch this error so that the crossmatch doesn't fail if a file is missing. Instead, it would be ideal if one of the objects fails the crossmatch, that the crossmatch still finishes with the other objects complete, and the final result just looks like no match was found for that trouble-causing object/file.

Helpdesk request submitted to MAST 9/19 about the missing file.

@jkrick
Copy link
Contributor Author

jkrick commented Sep 24, 2024

MAST helpdesk responded that they will fix the file problem on S3 before Oct 7, so this PR will wait for that fix to move forward, so that testing can be completed.

@jkrick
Copy link
Contributor Author

jkrick commented Oct 11, 2024

Now fully functional!
This PR includes:

  • new panstarrs function which uses lsdb to access a hipscat parquet in S3
  • removes the 'old' functions which use the panstarrs API
  • changes name in plotting to be in line with actual name of the dataset "Pan-STARRS" (but is too hard for me to type every time)
  • added runtime info to the intro cell (800s to run the full notebook) @bsipocz is this too long?

Notes

  • along the way I discovered missing files in the panstarrs S3 parquets, MAST fixed those, and Troy assures me that he has checks for these things when he pushes data to S3. So I think that is one problem that got fixed, that I hope we will not have to see in the future.
  • This function is not actually faster for the sample size of 30 that we use in the notebook, but will be faster than the API for larger samples, and in general accessing data in parquet format in the cloud is the direction we want to be going.
  • this might be the first calls to lsdb in our notebooks? but probably not the last.

This will close #165
This uses help from: astronomy-commons/lsdb#416

@jkrick jkrick changed the title functional but not fully tested pastarrs_get_lightcurve_lsdb pastarrs_get_lightcurve using lsdb Oct 11, 2024
@jkrick jkrick requested review from troyraen and bsipocz October 11, 2024 18:56
@jkrick jkrick marked this pull request as ready for review October 11, 2024 18:57
@bsipocz
Copy link
Member

bsipocz commented Oct 11, 2024

800s to run the full notebook

It's a bit long, but for fornax it's totally acceptable unless we hit a VM limit. We currently skip this notebook due to the ZTF private bucket access.

(and I don't know why we see the error with the classifier notebook, that should not be affected by this PR)

Copy link
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments. Mostly minor, but I picked up on some things to make the API nicer (e.g. quantity inputs, or being explicit with parameters)

light_curves/requirements_light_curve_generator.txt Outdated Show resolved Hide resolved
light_curves/requirements_light_curve_generator.txt Outdated Show resolved Hide resolved
Comment on lines 134 to 140
# num_normal_QSO = 5000
# zmin, zmax = 0, 10
# randomize_z = False
#num_normal_QSO = 30
#zmin, zmax = 0, 10
#randomize_z = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these changes intended?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my perspective, it doesn't matter what the values of these are since they are commented out. I do change them when I am working on the code and testing different sample sizes, and then most often forget what they originally were and leave them at whatever the last value I tried. Does that cause a problem? Or is it the spaces between the # and the commands that is concerning?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the adding unrelated and unnecessary changes. I see why it happens, but it would be nice to figure out a convenient way to change the workflow so only intended and necessary changes get added.
I added this topic to my list for the ipac visit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! I don't know how to do this, so perfect to discuss later this month.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen a new (to me anyway) "git" tab on the LHS of the Fornax console that gives GUI access to changed and staged files, etc. Maybe there's a way to select and commit individual lines/changes from there? I haven't played around with it at all but seems worth checking out.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeap, I put on the list for my visit to experiment something out.

light_curves/light_curve_generator.md Outdated Show resolved Hide resolved
light_curves/code_src/panstarrs_functions.py Outdated Show resolved Hide resolved
light_curves/light_curve_generator.md Show resolved Hide resolved
light_curves/light_curve_generator.md Show resolved Hide resolved
@jkrick jkrick changed the title pastarrs_get_lightcurve using lsdb panstarrs_get_lightcurve using lsdb Oct 11, 2024
Copy link
Contributor Author

@jkrick jkrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking it over and the comments, I think I got them all.

light_curves/code_src/panstarrs_functions.py Outdated Show resolved Hide resolved
light_curves/light_curve_generator.md Outdated Show resolved Hide resolved
light_curves/light_curve_generator.md Show resolved Hide resolved
light_curves/light_curve_generator.md Show resolved Hide resolved
light_curves/requirements_light_curve_generator.txt Outdated Show resolved Hide resolved
Copy link
Member

@bsipocz bsipocz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cleanup changes to be committed and then this is good to go.

light_curves/requirements_light_curve_generator.txt Outdated Show resolved Hide resolved
light_curves/light_curve_generator.md Outdated Show resolved Hide resolved
light_curves/light_curve_generator.md Outdated Show resolved Hide resolved
light_curves/light_curve_generator.md Outdated Show resolved Hide resolved
light_curves/code_src/panstarrs_functions.py Outdated Show resolved Hide resolved
light_curves/code_src/panstarrs_functions.py Show resolved Hide resolved
light_curves/requirements_light_curve_generator.txt Outdated Show resolved Hide resolved
Copy link
Contributor

@troyraen troyraen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, glad it's working! I left a mixture of comments below ranging from minor naming suggestions to code structure questions that will take a little more work to figure out. Anything you can't or don't want to address here can go into an issue and be assigned to me.

light_curves/code_src/panstarrs_functions.py Outdated Show resolved Hide resolved
light_curves/code_src/panstarrs_functions.py Outdated Show resolved Hide resolved
light_curves/code_src/panstarrs_functions.py Outdated Show resolved Hide resolved
light_curves/code_src/panstarrs_functions.py Outdated Show resolved Hide resolved
light_curves/code_src/panstarrs_functions.py Outdated Show resolved Hide resolved
light_curves/code_src/panstarrs_functions.py Outdated Show resolved Hide resolved
light_curves/code_src/panstarrs_functions.py Outdated Show resolved Hide resolved
light_curves/light_curve_generator.md Outdated Show resolved Hide resolved
light_curves/light_curve_generator.md Outdated Show resolved Hide resolved
light_curves/light_curve_generator.md Outdated Show resolved Hide resolved
Copy link
Contributor Author

@jkrick jkrick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have addressed all the comments and made all the requested changes. Let me know if I missed anything.

light_curves/code_src/panstarrs_functions.py Outdated Show resolved Hide resolved
light_curves/code_src/panstarrs_functions.py Outdated Show resolved Hide resolved
Comment on lines 122 to 125
dict(flux=pd.to_numeric(flux_panstarrs, errors='coerce').astype(np.float64),
err=pd.to_numeric(err_panstarrs, errors='coerce').astype(np.float64),
time=pd.to_numeric(t_panstarrs, errors='coerce').astype(np.float64),
objectid=pd.to_numeric(objectid, errors='coerce').astype(np.int64),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This represents a day of my life that I will never get back. It turns out that lsdb returns some interesting data types (eg. double[pyarrow]), which pandas seamlessly handles, and hides, unless you think to ask about data types. Unfortunately, other codes do not handle them well, namely our plotting functions.... This is the way I found of getting the data types converted into data types in the data frames that can be handled by plotting. Maybe there is another, simpler, way of doing it, but this is functional, and not all combinations of these things are functional, ie., I believe .astype alone doesn't work. The coerce I believe was an attempt to handle nan.

light_curves/light_curve_generator.md Outdated Show resolved Hide resolved
light_curves/light_curve_generator.md Outdated Show resolved Hide resolved
light_curves/code_src/panstarrs_functions.py Outdated Show resolved Hide resolved
light_curves/code_src/panstarrs_functions.py Outdated Show resolved Hide resolved
light_curves/code_src/panstarrs_functions.py Outdated Show resolved Hide resolved
@bsipocz
Copy link
Member

bsipocz commented Oct 16, 2024

Your comments and experience around working with the dtypes would be a super useful feedback to the lincc team I think.

@troyraen
Copy link
Contributor

LGTM. Thanks!

@jkrick
Copy link
Contributor Author

jkrick commented Oct 16, 2024

Your comments and experience around working with the dtypes would be a super useful feedback to the lincc team I think.

message received, I put this in as feedback to lsdb in issue #441 over at astronomy-commons/lsdb

@jkrick
Copy link
Contributor Author

jkrick commented Oct 16, 2024

@bsipocz when you get a chance, can you please merge this PR? I don't know why it is failing ci.

@bsipocz
Copy link
Member

bsipocz commented Oct 16, 2024

It seems to be something unrelated, but I'm not sure I can dive into the details before I get back home.

@jkrick
Copy link
Contributor Author

jkrick commented Oct 16, 2024

I'm not in a hurry, thanks.

We only need distributed extras here, but having any dask triggers the need for dataframe in the classifier notebook
@bsipocz
Copy link
Member

bsipocz commented Oct 17, 2024

OK, my latest commit should fix the failure. Basically adding some dask would trigger sktime to check if dask[dataframe] is all installed, and it wasn't.

@bsipocz
Copy link
Member

bsipocz commented Oct 17, 2024

Hmm, apparently this is in fact an sktime issue: sktime/sktime#7250. Someone reported that downgrading sktime solves the issue, so I was trying that. If it doesn't work, I'll look into more hackery.

@bsipocz
Copy link
Member

bsipocz commented Oct 18, 2024

I'm not sure why the classifier notebook run into VM limits now, I'm doing some more debugs and see if it is still OK with GHA. If yes, then I'll bring over the configs from the IRSA notebooks to we can easily opt in GHA build here, too.

@jkrick
Copy link
Contributor Author

jkrick commented Oct 23, 2024

see if it is still OK with GHA
@bsipocz what is "GHA"?

@bsipocz
Copy link
Member

bsipocz commented Oct 23, 2024

GHA: GitHub Actions. We use GHA for doing the HTML rendering and hosting the pages with github. But for the PRs we use a different system, and their limitations differ a bit. It was already close to the limit but from below, I suspect adding the new dependencies may changed it worse the worse a bit, but we can't do much about it.

TL;DR, I may need to tweak the configs but I don't think there is much here to do about it. So I would merge this now and figure out issues with circleCI if/when they arise on main.

@bsipocz bsipocz merged commit ee9f4f1 into main Oct 23, 2024
2 of 4 checks passed
@bsipocz bsipocz deleted the panstarrs_hipscat branch October 23, 2024 21:29
github-actions bot pushed a commit that referenced this pull request Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants