Refactor metrics in the IOOS_BTN.ipynb notebook #53

ocefpaf · 2024-02-07T20:08:33Z

@MathewBiddle these are the first functions I extracted from IOOS_BTN.ipynb. I'll do the rest soon but I'd like to get your input before moving forward with this idea. For now you can run it with:

from ioos_metrics.ioos_metrics import update_metrics

df = update_metrics()
df.dropna(axis=1)

When we have all the metrics as their own function we can:

have a main script that collects them all and create the same output from the notebook
test them individually in the CIs with a cronjob so w catch breakages earlier
run automate code quality/lints on the individual scripts

This will also be easier to maintain and review as individual pieces. Here, for example, I removed on dependency (bs4) , two extra steps, and fixed a pandas deprecation when creating the table to make it future proof.

ocefpaf · 2024-02-08T13:21:46Z

ioos_metrics/ioos_metrics.py

+    Conditions on our calculations:
+    * drops all datasets with `datasetID` containing `delayed`.
+    * duration is calculated based on the metadata ERDDAP generates (time_coverage) which usually over-estimate a bit b/c it includes empty data (NaN).
+      Note that data with NaN can be real glider day with lost data. Which is OK for this metric.


@MathewBiddle I capture our slack convo here. Let me know if this is sufficient for future understanding of what we are doing.

@kbailey-noaa this script can compute gliders days much faster than my notebook or gdutils b/c it uses only the time_coverage metadata from the allDatasets entry in ERDDAP. The main difference is that it will "over-estimate" the days b/c it will count the time with NaNs data as part of the glider day.

If we see this as "the glider was in the water anyway" this metric is "more correct" than my notebook or gdutils. If we want to compute glider days strictly based on data collected, then this one overestimate by 621 days (75929 against 75308 for this time period).

@ocefpaf Editing this comment:
Glider days = # days the instrument was in the water collecting data. Is that how your notebook and gdutils compute days (hopefully)? If a glider is deployed and in the water for 1 week but doesn't collect any data then it's worthless and days = 0.
And, overestimating by > 1 year's worth of days feels like too much.
From a metrics gathering standpoint, we're more interested in the data the gliders collect rather than the duration of the platform itself. Glider days is a measure of success related to data collection.

ERDDAP calculates those two attributes. I assumed (bad on my part) that those would be correct based on the time variable in each dataset. You raise a great point and something we will have to consider in the current code.

Here is that entry in the allDatasets response (notice minTime is empty):

https://gliders.ioos.us/erddap/tabledap/allDatasets.htmlTable?datasetID%2Caccessible%2Cinstitution%2CdataStructure%2Ccdm_data_type%2Cclass%2Ctitle%2CminLongitude%2CmaxLongitude%2ClongitudeSpacing%2CminLatitude%2CmaxLatitude%2ClatitudeSpacing%2CminAltitude%2CmaxAltitude%2CminTime%2CmaxTime%2CtimeSpacing%2Cgriddap%2Csubset%2Ctabledap%2CMakeAGraph%2Csos%2Cwcs%2Cwms%2Cfiles%2Cfgdc%2Ciso19115%2Cmetadata%2CsourceUrl%2CinfoUrl%2Crss%2Cemail%2CtestOutOfDate%2CoutOfDate%2Csummary&datasetID=%22Nemesis-20170512T0000%22

@kbailey-noaa kudos to finding the only dataset with no minTime 👏 https://gliders.ioos.us/erddap/tabledap/allDatasets.htmlTable?datasetID%2CminTime&distinct()

Now I'm curious why there are fill values in the time variable for that dataset?

https://gliders.ioos.us/erddap/tabledap/Nemesis-20170512T0000.htmlTable?trajectory%2Cwmo_id%2Cprofile_id%2Ctime%2Clatitude%2Clongitude%2Cdepth%2Ccdom%2Ccdom_qc%2Cconductivity%2Cconductivity_qc%2Cdensity%2Cdensity_qc%2Cdepth_qc%2Cfluorescence%2Cfluorescence_qc%2Cinstrument_ctd%2Clat_qc%2Clat_uv%2Clat_uv_qc%2Clatitude_qc%2Clon_qc%2Clon_uv%2Clon_uv_qc%2Clongitude_qc%2Copbs%2Copbs_qc%2Coxygen%2Coxygen_qc%2Cplatform_meta%2Cprecise_lat%2Cprecise_lat_qc%2Cprecise_lon%2Cprecise_lon_qc%2Cprecise_time%2Cprecise_time_qc%2Cpressure%2Cpressure_qc%2Csalinity%2Csalinity_qc%2Ctemperature%2Ctemperature_qc%2Ctime_qc%2Ctime_uv%2Ctime_uv_qc%2Cu%2Cu_qc%2Cv%2Cv_qc&time%3C=2017-05-16T18%3A13%3A46Z

@MathewBiddle Thanks haha. It might be moot though. Not sure if you saw I had updated my comment above to specify glider days = # days the instrument was in the water collecting data...

@kbailey-noaa kudos to finding the only dataset with no minTime 👏

@MathewBiddle I'm not sure how Kathy eyes work but you had to see how many flaws in my code (and the data) she found when we were doing the hurricane animation script by just looking at the final animation! I guess that we bury our heads in the code and forget the big picture, she doesn't. That is why we need a diverse group.

But now I'm more curious why our code did not break! It should have. Investigating that... Anyway, we are at a point were we can choose the best of 3 imperfect metrics. IMO it is Matt's code with the fixed rounded days, when only glider days is the question we want to answer. There is no silver bullet for such a big dataset and so many small metadata problems.

Edit: The conversion of empty, not a NaN, to datetime, returns an empty value:

pd.to_datetime(df.loc[df["datasetID"] == "Nemesis-20170512T0000"]["minTime (UTC)"], errors="raise")

That is a bad silent error :-/

@ocefpaf Editing this comment: Glider days = # days the instrument was in the water collecting data. Is that how your notebook and gdutils compute days (hopefully)? If a glider is deployed and in the water for 1 week but doesn't collect any data then it's worthless and days = 0. And, overestimating by > 1 year's worth of days feels like too much. From a metrics gathering standpoint, we're more interested in the data the gliders collect rather than the duration of the platform itself. Glider days is a measure of success related to data collection.

@kbailey-noaa happy to change to "data days" however, when the machines take over the world and demand credit for their labor, I'm not answering for it ;-p

Jokes aside, the overestimation is 2 orders os magnitude smaller than both estimates. I don't think it is that bad.

Looks like you added some logic in there. Nice addition.

UserWarning: The following rows have missing data: minTime (UTC) maxTime (UTC) datasetID 752 NaN 2017-08-02T19:16:37Z Nemesis-20170512T0000 warnings.warn(f"The following rows have missing data:\n{rows}")

ocefpaf · 2024-02-08T13:24:03Z

@MathewBiddle this one is ready for review. I don't want to add more metrics to avoid making it to hard for reviewing. I'll add 3-4 metrics per PR to keep the context short. Is that OK?

MathewBiddle · 2024-02-08T16:54:14Z

I like the direction this is going. Thanks @ocefpaf. I'd like to test this locally before I say anything else, however.

MathewBiddle · 2024-02-13T14:14:29Z

@ocefpaf this is looking good. Did you want to add any more to this PR, or should we merge and move to the next batch?

ocefpaf · 2024-02-13T16:00:42Z

@ocefpaf this is looking good. Did you want to add any more to this PR, or should we merge and move to the next batch?

Let's merge and move to the next batch. I have a few extra commits here that could use a rebase and a new PR to avoid confusion.

ocefpaf added 3 commits February 7, 2024 17:04

add first metric, federal_partners

8aa704e

add tests

0a741e0

add GHA

2b8a954

ocefpaf force-pushed the ioos_btn_metrics branch from 1409881 to 2b8a954 Compare February 8, 2024 12:46

ocefpaf added 3 commits February 8, 2024 09:59

uncomment branch

e28a786

add previous metrics

e2b966a

add glider metric

220784a

ocefpaf commented Feb 8, 2024

View reviewed changes

ocefpaf marked this pull request as ready for review February 8, 2024 13:22

ocefpaf added 3 commits February 8, 2024 10:30

add formatting and linting

38b636c

add update metrics

88cdeec

cast counts to integer for safety

e21ddc3

ocefpaf added 2 commits February 9, 2024 15:12

add glider days >= than previous tests

a1eca05

warn for missing data in ngdac_gliders

008c934

ocefpaf force-pushed the ioos_btn_metrics branch from c391bcd to 008c934 Compare February 9, 2024 19:03

add comt

f0cc690

ocefpaf changed the title ~~add first metric, federal_partners~~ Refactor metrics in the IOOS_BTN.ipynb notebook Feb 9, 2024

add regional_associations

8079f67

MathewBiddle merged commit b5c5c36 into ioos:main Feb 13, 2024
1 check passed

ocefpaf deleted the ioos_btn_metrics branch February 13, 2024 16:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor metrics in the IOOS_BTN.ipynb notebook #53

Refactor metrics in the IOOS_BTN.ipynb notebook #53

ocefpaf commented Feb 7, 2024 •

edited

Loading

ocefpaf Feb 8, 2024

kbailey-noaa Feb 8, 2024 •

edited

Loading

MathewBiddle Feb 8, 2024

MathewBiddle Feb 8, 2024 •

edited

Loading

kbailey-noaa Feb 8, 2024

ocefpaf Feb 9, 2024 •

edited

Loading

ocefpaf Feb 9, 2024

MathewBiddle Feb 13, 2024

ocefpaf commented Feb 8, 2024

MathewBiddle commented Feb 8, 2024

MathewBiddle commented Feb 13, 2024

ocefpaf commented Feb 13, 2024

Refactor metrics in the IOOS_BTN.ipynb notebook #53

Refactor metrics in the IOOS_BTN.ipynb notebook #53

Conversation

ocefpaf commented Feb 7, 2024 • edited Loading

ocefpaf Feb 8, 2024

Choose a reason for hiding this comment

kbailey-noaa Feb 8, 2024 • edited Loading

Choose a reason for hiding this comment

MathewBiddle Feb 8, 2024

Choose a reason for hiding this comment

MathewBiddle Feb 8, 2024 • edited Loading

Choose a reason for hiding this comment

kbailey-noaa Feb 8, 2024

Choose a reason for hiding this comment

ocefpaf Feb 9, 2024 • edited Loading

Choose a reason for hiding this comment

ocefpaf Feb 9, 2024

Choose a reason for hiding this comment

MathewBiddle Feb 13, 2024

Choose a reason for hiding this comment

ocefpaf commented Feb 8, 2024

MathewBiddle commented Feb 8, 2024

MathewBiddle commented Feb 13, 2024

ocefpaf commented Feb 13, 2024

ocefpaf commented Feb 7, 2024 •

edited

Loading

kbailey-noaa Feb 8, 2024 •

edited

Loading

MathewBiddle Feb 8, 2024 •

edited

Loading

ocefpaf Feb 9, 2024 •

edited

Loading