Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor metrics in the IOOS_BTN.ipynb notebook #53

Merged
merged 13 commits into from
Feb 13, 2024

Conversation

ocefpaf
Copy link
Member

@ocefpaf ocefpaf commented Feb 7, 2024

@MathewBiddle these are the first functions I extracted from IOOS_BTN.ipynb. I'll do the rest soon but I'd like to get your input before moving forward with this idea. For now you can run it with:

from ioos_metrics.ioos_metrics import update_metrics

df = update_metrics()
df.dropna(axis=1)

When we have all the metrics as their own function we can:

  • have a main script that collects them all and create the same output from the notebook
  • test them individually in the CIs with a cronjob so w catch breakages earlier
  • run automate code quality/lints on the individual scripts

This will also be easier to maintain and review as individual pieces. Here, for example, I removed on dependency (bs4) , two extra steps, and fixed a pandas deprecation when creating the table to make it future proof.

Conditions on our calculations:
* drops all datasets with `datasetID` containing `delayed`.
* duration is calculated based on the metadata ERDDAP generates (time_coverage) which usually over-estimate a bit b/c it includes empty data (NaN).
Note that data with NaN can be real glider day with lost data. Which is OK for this metric.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MathewBiddle I capture our slack convo here. Let me know if this is sufficient for future understanding of what we are doing.

@kbailey-noaa this script can compute gliders days much faster than my notebook or gdutils b/c it uses only the time_coverage metadata from the allDatasets entry in ERDDAP. The main difference is that it will "over-estimate" the days b/c it will count the time with NaNs data as part of the glider day.

If we see this as "the glider was in the water anyway" this metric is "more correct" than my notebook or gdutils. If we want to compute glider days strictly based on data collected, then this one overestimate by 621 days (75929 against 75308 for this time period).

Copy link

@kbailey-noaa kbailey-noaa Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocefpaf Editing this comment:
Glider days = # days the instrument was in the water collecting data. Is that how your notebook and gdutils compute days (hopefully)? If a glider is deployed and in the water for 1 week but doesn't collect any data then it's worthless and days = 0.
And, overestimating by > 1 year's worth of days feels like too much.
From a metrics gathering standpoint, we're more interested in the data the gliders collect rather than the duration of the platform itself. Glider days is a measure of success related to data collection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MathewBiddle Thanks haha. It might be moot though. Not sure if you saw I had updated my comment above to specify glider days = # days the instrument was in the water collecting data...

Copy link
Member Author

@ocefpaf ocefpaf Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kbailey-noaa kudos to finding the only dataset with no minTime 👏

@MathewBiddle I'm not sure how Kathy eyes work but you had to see how many flaws in my code (and the data) she found when we were doing the hurricane animation script by just looking at the final animation! I guess that we bury our heads in the code and forget the big picture, she doesn't. That is why we need a diverse group.

But now I'm more curious why our code did not break! It should have. Investigating that... Anyway, we are at a point were we can choose the best of 3 imperfect metrics. IMO it is Matt's code with the fixed rounded days, when only glider days is the question we want to answer. There is no silver bullet for such a big dataset and so many small metadata problems.


Edit: The conversion of empty, not a NaN, to datetime, returns an empty value:

pd.to_datetime(df.loc[df["datasetID"] == "Nemesis-20170512T0000"]["minTime (UTC)"], errors="raise")

That is a bad silent error :-/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocefpaf Editing this comment: Glider days = # days the instrument was in the water collecting data. Is that how your notebook and gdutils compute days (hopefully)? If a glider is deployed and in the water for 1 week but doesn't collect any data then it's worthless and days = 0. And, overestimating by > 1 year's worth of days feels like too much. From a metrics gathering standpoint, we're more interested in the data the gliders collect rather than the duration of the platform itself. Glider days is a measure of success related to data collection.

@kbailey-noaa happy to change to "data days" however, when the machines take over the world and demand credit for their labor, I'm not answering for it ;-p

Jokes aside, the overestimation is 2 orders os magnitude smaller than both estimates. I don't think it is that bad.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you added some logic in there. Nice addition.

UserWarning: The following rows have missing data:
    minTime (UTC)         maxTime (UTC)              datasetID
752           NaN  2017-08-02T19:16:37Z  Nemesis-20170512T0000
  warnings.warn(f"The following rows have missing data:\n{rows}")

@ocefpaf ocefpaf marked this pull request as ready for review February 8, 2024 13:22
@ocefpaf
Copy link
Member Author

ocefpaf commented Feb 8, 2024

@MathewBiddle this one is ready for review. I don't want to add more metrics to avoid making it to hard for reviewing. I'll add 3-4 metrics per PR to keep the context short. Is that OK?

@MathewBiddle
Copy link
Contributor

I like the direction this is going. Thanks @ocefpaf. I'd like to test this locally before I say anything else, however.

@ocefpaf ocefpaf changed the title add first metric, federal_partners Refactor metrics in the IOOS_BTN.ipynb notebook Feb 9, 2024
@MathewBiddle
Copy link
Contributor

@ocefpaf this is looking good. Did you want to add any more to this PR, or should we merge and move to the next batch?

@ocefpaf
Copy link
Member Author

ocefpaf commented Feb 13, 2024

@ocefpaf this is looking good. Did you want to add any more to this PR, or should we merge and move to the next batch?

Let's merge and move to the next batch. I have a few extra commits here that could use a rebase and a new PR to avoid confusion.

@MathewBiddle MathewBiddle merged commit b5c5c36 into ioos:main Feb 13, 2024
1 check passed
@ocefpaf ocefpaf deleted the ioos_btn_metrics branch February 13, 2024 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants