-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor metrics in the IOOS_BTN.ipynb notebook #53
Conversation
1409881
to
2b8a954
Compare
Conditions on our calculations: | ||
* drops all datasets with `datasetID` containing `delayed`. | ||
* duration is calculated based on the metadata ERDDAP generates (time_coverage) which usually over-estimate a bit b/c it includes empty data (NaN). | ||
Note that data with NaN can be real glider day with lost data. Which is OK for this metric. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MathewBiddle I capture our slack convo here. Let me know if this is sufficient for future understanding of what we are doing.
@kbailey-noaa this script can compute gliders days much faster than my notebook or gdutils b/c it uses only the time_coverage metadata from the allDatasets entry in ERDDAP. The main difference is that it will "over-estimate" the days b/c it will count the time with NaNs data as part of the glider day.
If we see this as "the glider was in the water anyway" this metric is "more correct" than my notebook or gdutils. If we want to compute glider days strictly based on data collected, then this one overestimate by 621 days (75929 against 75308 for this time period).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ocefpaf Editing this comment:
Glider days = # days the instrument was in the water collecting data. Is that how your notebook and gdutils compute days (hopefully)? If a glider is deployed and in the water for 1 week but doesn't collect any data then it's worthless and days = 0.
And, overestimating by > 1 year's worth of days feels like too much.
From a metrics gathering standpoint, we're more interested in the data the gliders collect rather than the duration of the platform itself. Glider days is a measure of success related to data collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ERDDAP calculates those two attributes. I assumed (bad on my part) that those would be correct based on the time variable in each dataset. You raise a great point and something we will have to consider in the current code.
Here is that entry in the allDatasets response (notice minTime is empty):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kbailey-noaa kudos to finding the only dataset with no minTime 👏 https://gliders.ioos.us/erddap/tabledap/allDatasets.htmlTable?datasetID%2CminTime&distinct()
Now I'm curious why there are fill values in the time variable for that dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MathewBiddle Thanks haha. It might be moot though. Not sure if you saw I had updated my comment above to specify glider days = # days the instrument was in the water collecting data...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kbailey-noaa kudos to finding the only dataset with no minTime 👏
@MathewBiddle I'm not sure how Kathy eyes work but you had to see how many flaws in my code (and the data) she found when we were doing the hurricane animation script by just looking at the final animation! I guess that we bury our heads in the code and forget the big picture, she doesn't. That is why we need a diverse group.
But now I'm more curious why our code did not break! It should have. Investigating that... Anyway, we are at a point were we can choose the best of 3 imperfect metrics. IMO it is Matt's code with the fixed rounded days, when only glider days is the question we want to answer. There is no silver bullet for such a big dataset and so many small metadata problems.
Edit: The conversion of empty, not a NaN, to datetime, returns an empty value:
pd.to_datetime(df.loc[df["datasetID"] == "Nemesis-20170512T0000"]["minTime (UTC)"], errors="raise")
That is a bad silent error :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ocefpaf Editing this comment: Glider days = # days the instrument was in the water collecting data. Is that how your notebook and gdutils compute days (hopefully)? If a glider is deployed and in the water for 1 week but doesn't collect any data then it's worthless and days = 0. And, overestimating by > 1 year's worth of days feels like too much. From a metrics gathering standpoint, we're more interested in the data the gliders collect rather than the duration of the platform itself. Glider days is a measure of success related to data collection.
@kbailey-noaa happy to change to "data days" however, when the machines take over the world and demand credit for their labor, I'm not answering for it ;-p
Jokes aside, the overestimation is 2 orders os magnitude smaller than both estimates. I don't think it is that bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you added some logic in there. Nice addition.
UserWarning: The following rows have missing data:
minTime (UTC) maxTime (UTC) datasetID
752 NaN 2017-08-02T19:16:37Z Nemesis-20170512T0000
warnings.warn(f"The following rows have missing data:\n{rows}")
@MathewBiddle this one is ready for review. I don't want to add more metrics to avoid making it to hard for reviewing. I'll add 3-4 metrics per PR to keep the context short. Is that OK? |
I like the direction this is going. Thanks @ocefpaf. I'd like to test this locally before I say anything else, however. |
c391bcd
to
008c934
Compare
@ocefpaf this is looking good. Did you want to add any more to this PR, or should we merge and move to the next batch? |
Let's merge and move to the next batch. I have a few extra commits here that could use a rebase and a new PR to avoid confusion. |
@MathewBiddle these are the first functions I extracted from IOOS_BTN.ipynb. I'll do the rest soon but I'd like to get your input before moving forward with this idea. For now you can run it with:
When we have all the metrics as their own function we can:
This will also be easier to maintain and review as individual pieces. Here, for example, I removed on dependency (
bs4
) , two extra steps, and fixed a pandas deprecation when creating the table to make it future proof.