-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor metrics in the IOOS_BTN.ipynb notebook #53
Merged
+12,825
−11,176
Merged
Changes from all commits
Commits
Show all changes
13 commits
Select commit
Hold shift + click to select a range
8aa704e
add first metric, federal_partners
ocefpaf 0a741e0
add tests
ocefpaf 2b8a954
add GHA
ocefpaf e28a786
uncomment branch
ocefpaf e2b966a
add previous metrics
ocefpaf 220784a
add glider metric
ocefpaf 38b636c
add formatting and linting
ocefpaf 88cdeec
add update metrics
ocefpaf e21ddc3
cast counts to integer for safety
ocefpaf a1eca05
add glider days >= than previous tests
ocefpaf 008c934
warn for missing data in ngdac_gliders
ocefpaf f0cc690
add comt
ocefpaf 8079f67
add regional_associations
ocefpaf File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
name: Full Tests | ||
|
||
on: | ||
pull_request: | ||
push: | ||
branches: [main] | ||
|
||
jobs: | ||
run: | ||
runs-on: ubuntu-latest | ||
|
||
steps: | ||
- uses: actions/checkout@v3 | ||
|
||
- name: Setup Micromamba | ||
uses: mamba-org/setup-micromamba@v1 | ||
with: | ||
init-shell: bash | ||
environment-file: conda-lock.yml | ||
environment-name: TEST | ||
|
||
- name: Tests | ||
shell: bash -l {0} | ||
run: | | ||
python -m pytest -rxs tests |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
repos: | ||
- repo: https://github.com/pre-commit/pre-commit-hooks | ||
rev: v4.5.0 | ||
hooks: | ||
- id: trailing-whitespace | ||
- id: check-ast | ||
- id: debug-statements | ||
- id: end-of-file-fixer | ||
- id: check-added-large-files | ||
|
||
- repo: https://github.com/psf/black | ||
rev: 24.1.1 | ||
hooks: | ||
- id: black | ||
language_version: python3 | ||
|
||
- repo: https://github.com/codespell-project/codespell | ||
rev: v2.2.6 | ||
hooks: | ||
- id: codespell | ||
|
||
- repo: https://github.com/asottile/add-trailing-comma | ||
rev: v3.1.0 | ||
hooks: | ||
- id: add-trailing-comma | ||
|
||
- repo: https://github.com/astral-sh/ruff-pre-commit | ||
rev: v0.2.1 | ||
hooks: | ||
- id: ruff | ||
|
||
ci: | ||
autofix_commit_msg: | | ||
[pre-commit.ci] auto fixes from pre-commit.com hooks | ||
|
||
for more information, see https://pre-commit.ci | ||
autofix_prs: false | ||
autoupdate_commit_msg: '[pre-commit.ci] pre-commit autoupdate' | ||
autoupdate_schedule: monthly | ||
skip: [] | ||
submodules: false |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,207 @@ | ||
""" | ||
Code extracted from IOOS_BTN.ipynb | ||
""" | ||
|
||
import io | ||
import warnings | ||
|
||
import pandas as pd | ||
import requests | ||
from bs4 import BeautifulSoup | ||
|
||
_HEADERS = { | ||
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36", | ||
} | ||
|
||
|
||
def previous_metrics(): | ||
""" | ||
Loads the previous metrics as a DataFrame for updating. | ||
""" | ||
df = pd.read_csv( | ||
"https://github.com/ioos/ioos_metrics/raw/main/ioos_btn_metrics.csv", | ||
) | ||
|
||
number_cols = [ | ||
"Federal Partners", | ||
"Regional Associations", | ||
"HF Radar Stations", | ||
"NGDAC Glider Days", | ||
"National Platforms", | ||
"Regional Platforms", | ||
"ATN Deployments", | ||
"MBON Projects", | ||
"OTT Projects", | ||
"HAB Pilot Projects", | ||
"QARTOD Manuals", | ||
"IOOS Core Variables", | ||
"Metadata Records", | ||
"IOOS", | ||
"COMT Projects", | ||
] | ||
df[number_cols] = df[number_cols].astype("Int64") | ||
return df | ||
|
||
|
||
def federal_partners(): | ||
""" | ||
ICOOS Act/COORA | ||
Typically 17, from https://ioos.noaa.gov/community/national#federal. | ||
""" | ||
|
||
url = "https://ioos.noaa.gov/community/national#federal" | ||
|
||
html = requests.get(url, headers=_HEADERS).text | ||
|
||
df = pd.read_html(io.StringIO(html)) | ||
df_clean = df[1].drop(columns=[0, 2]) | ||
df_fed_partners = pd.concat([df_clean[1], df_clean[3]]).dropna().reset_index() | ||
return df_fed_partners.shape[0] | ||
|
||
|
||
def ngdac_gliders(start_date="2000-01-01", end_date="2023-12-31"): | ||
""" | ||
NGDAC Glider Days | ||
Gliders monitor water currents, temperature, and conditions that reveal effects from storms, | ||
impacts on fisheries, and the quality of our water. | ||
This information creates a more complete picture of what is happening in the ocean, | ||
as well as trends scientists might be able to detect. | ||
U.S. IOOS began counting “Glider days” in 2008 with the intent to better coordinate across | ||
U.S. glider operations and to increase the data sharing and data management of this technology. | ||
One "Glider Day" is defined as 1 glider in the water collecting data for 1 day. | ||
From https://gliders.ioos.us/erddap/info/index.html?page=1&itemsPerPage=1000 | ||
Cumulative from 2008 - present | ||
Conditions on our calculations: | ||
* drops all datasets with `datasetID` containing `delayed`. | ||
* duration is calculated based on the metadata ERDDAP generates (time_coverage) which usually over-estimate a bit b/c it includes empty data (NaN). | ||
Note that data with NaN can be real glider day with lost data. Which is OK for this metric. | ||
""" | ||
df = pd.read_csv( | ||
"https://gliders.ioos.us/erddap/tabledap/allDatasets.csvp?minTime,maxTime,datasetID", | ||
) | ||
|
||
# We don't want allDatasets in our numbers. | ||
df = df.loc[~(df["datasetID"] == "allDatasets")] | ||
df.describe().T["count"] | ||
|
||
# Check if any value is NaN and report it. | ||
if df.isnull().sum().sum(): | ||
rows = df.loc[df.isnull().sum(axis=1).astype(bool)] | ||
warnings.warn(f"The following rows have missing data:\n{rows}") | ||
|
||
df.dropna( | ||
axis=0, | ||
inplace=True, | ||
) | ||
|
||
# drop delayed datasets | ||
df = df.loc[df["datasetID"].str.contains("delayed") == False] | ||
|
||
df[["minTime (UTC)", "maxTime (UTC)"]] = df[ | ||
["minTime (UTC)", "maxTime (UTC)"] | ||
].apply(pd.to_datetime) | ||
|
||
df = df["maxTime (UTC)"].apply(lambda x: x.ceil("D")) - df["minTime (UTC)"].apply( | ||
lambda x: x.floor("D"), | ||
) | ||
return df.sum().days | ||
|
||
|
||
def comt(): | ||
""" | ||
The COMT serves as a conduit between the federal operational and research communities and allows sharing of numerical models, | ||
observations and software tools. | ||
The COMT supports integration, comparison, scientific analyses and archiving of data and model output needed to elucidate, | ||
prioritize, and resolve federal and regional operational coastal ocean issues associated with a range of existing and emerging coastal oceanic, | ||
hydrologic, and ecological models. | ||
The Testbed has enabled significant community building (within the modeling community as well as enhancing academic and federal operational relations) which has dramatically improved model development. | ||
Number of Active Projects via personal communication from COMT program manager. | ||
""" | ||
|
||
url = "https://ioos.noaa.gov/project/comt/" | ||
|
||
html = requests.get(url, headers=_HEADERS).text | ||
|
||
soup = BeautifulSoup(html, "html.parser") | ||
|
||
for tag in soup.find_all("h2"): | ||
if tag.text == "Current Projects": | ||
comt = len(tag.next_sibling.find_all("li")) | ||
|
||
return comt | ||
|
||
|
||
def regional_associations(): | ||
ras = 0 | ||
url = "https://ioos.noaa.gov/regions/regions-at-a-glance/" | ||
|
||
html = requests.get(url, headers=_HEADERS).text | ||
soup = BeautifulSoup(html, "html.parser") | ||
|
||
for tag in soup.find_all("a"): | ||
if tag.find("strong") is not None: | ||
ra = tag.find("strong").text | ||
# TODO: change to log | ||
# print(f"Found RA {ra}") | ||
ras += 1 | ||
|
||
return ras | ||
|
||
|
||
def update_metrics(): | ||
""" | ||
Load previous metrics and update the spreadsheet. | ||
""" | ||
df = previous_metrics() | ||
|
||
federal_partners_number = federal_partners() | ||
glider_days = ngdac_gliders() | ||
comt_number = comt() | ||
ras = regional_associations() | ||
|
||
_TODO = [ | ||
# "NGDAC Glider Days", (TODO: change to data days) | ||
"HF Radar Stations", # It is a hardcoded number at the moment | ||
"National Platforms", | ||
"Regional Platforms", | ||
"ATN Deployments", | ||
"MBON Projects", | ||
"OTT Projects", | ||
"HAB Pilot Projects", | ||
"QARTOD Manuals", | ||
"IOOS Core Variables", | ||
"Metadata Records", | ||
"IOOS", | ||
] | ||
|
||
today = pd.Timestamp.strftime(pd.Timestamp.today(tz="UTC"), "%Y-%m-%d") | ||
new_metric_row = pd.DataFrame( | ||
[today, federal_partners_number, glider_days, comt_number, ras], | ||
index=[ | ||
"date_UTC", | ||
"Federal Partners", | ||
"NGDAC Glider Days", | ||
"COMT Projects", | ||
"Regional Associations", | ||
], | ||
).T | ||
# only update numbers if it's a new day | ||
if today not in df["date_UTC"].to_list(): | ||
df = pd.concat( | ||
[df, new_metric_row], | ||
ignore_index=True, | ||
axis=0, | ||
) | ||
|
||
return df |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
[tool.ruff] | ||
select = [ | ||
"A", # flake8-builtins | ||
"B", # flake8-bugbear | ||
"C4", # flake8-comprehensions | ||
"F", # flakes | ||
"I", # import sorting | ||
"T20", # flake8-print | ||
"UP", # upgrade | ||
] | ||
target-version = "py38" | ||
line-length = 79 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
import sys | ||
|
||
import pandas as pd | ||
import pytest | ||
|
||
sys.path.append("..") | ||
|
||
from ioos_metrics import ioos_metrics | ||
|
||
|
||
@pytest.fixture | ||
def df_previous_metrics(): | ||
return ioos_metrics.previous_metrics() | ||
|
||
|
||
def test_previous_metrics(df_previous_metrics): | ||
assert isinstance(df_previous_metrics, pd.DataFrame) | ||
assert not df_previous_metrics.empty | ||
|
||
|
||
def test_federal_partners(): | ||
num = ioos_metrics.federal_partners() | ||
# must the an integer and cannot be less than 0 | ||
assert isinstance(num, int) | ||
assert num >= 0 | ||
|
||
|
||
def test_ngdac_gliders(df_previous_metrics): | ||
num = ioos_metrics.ngdac_gliders() | ||
assert isinstance(num, int) | ||
# New count should always be >= than the previous one. | ||
assert num >= df_previous_metrics["NGDAC Glider Days"].iloc[-1] | ||
|
||
|
||
def test_comt(): | ||
num = ioos_metrics.comt() | ||
assert isinstance(num, int) | ||
assert num >= 0 | ||
|
||
|
||
def test_regional_associations(): | ||
num = ioos_metrics.regional_associations() | ||
assert isinstance(num, int) | ||
assert num >= 0 |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MathewBiddle I capture our slack convo here. Let me know if this is sufficient for future understanding of what we are doing.
@kbailey-noaa this script can compute gliders days much faster than my notebook or gdutils b/c it uses only the time_coverage metadata from the allDatasets entry in ERDDAP. The main difference is that it will "over-estimate" the days b/c it will count the time with NaNs data as part of the glider day.
If we see this as "the glider was in the water anyway" this metric is "more correct" than my notebook or gdutils. If we want to compute glider days strictly based on data collected, then this one overestimate by 621 days (75929 against 75308 for this time period).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ocefpaf Editing this comment:
Glider days = # days the instrument was in the water collecting data. Is that how your notebook and gdutils compute days (hopefully)? If a glider is deployed and in the water for 1 week but doesn't collect any data then it's worthless and days = 0.
And, overestimating by > 1 year's worth of days feels like too much.
From a metrics gathering standpoint, we're more interested in the data the gliders collect rather than the duration of the platform itself. Glider days is a measure of success related to data collection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ERDDAP calculates those two attributes. I assumed (bad on my part) that those would be correct based on the time variable in each dataset. You raise a great point and something we will have to consider in the current code.
Here is that entry in the allDatasets response (notice minTime is empty):
https://gliders.ioos.us/erddap/tabledap/allDatasets.htmlTable?datasetID%2Caccessible%2Cinstitution%2CdataStructure%2Ccdm_data_type%2Cclass%2Ctitle%2CminLongitude%2CmaxLongitude%2ClongitudeSpacing%2CminLatitude%2CmaxLatitude%2ClatitudeSpacing%2CminAltitude%2CmaxAltitude%2CminTime%2CmaxTime%2CtimeSpacing%2Cgriddap%2Csubset%2Ctabledap%2CMakeAGraph%2Csos%2Cwcs%2Cwms%2Cfiles%2Cfgdc%2Ciso19115%2Cmetadata%2CsourceUrl%2CinfoUrl%2Crss%2Cemail%2CtestOutOfDate%2CoutOfDate%2Csummary&datasetID=%22Nemesis-20170512T0000%22
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kbailey-noaa kudos to finding the only dataset with no minTime 👏 https://gliders.ioos.us/erddap/tabledap/allDatasets.htmlTable?datasetID%2CminTime&distinct()
Now I'm curious why there are fill values in the time variable for that dataset?
https://gliders.ioos.us/erddap/tabledap/Nemesis-20170512T0000.htmlTable?trajectory%2Cwmo_id%2Cprofile_id%2Ctime%2Clatitude%2Clongitude%2Cdepth%2Ccdom%2Ccdom_qc%2Cconductivity%2Cconductivity_qc%2Cdensity%2Cdensity_qc%2Cdepth_qc%2Cfluorescence%2Cfluorescence_qc%2Cinstrument_ctd%2Clat_qc%2Clat_uv%2Clat_uv_qc%2Clatitude_qc%2Clon_qc%2Clon_uv%2Clon_uv_qc%2Clongitude_qc%2Copbs%2Copbs_qc%2Coxygen%2Coxygen_qc%2Cplatform_meta%2Cprecise_lat%2Cprecise_lat_qc%2Cprecise_lon%2Cprecise_lon_qc%2Cprecise_time%2Cprecise_time_qc%2Cpressure%2Cpressure_qc%2Csalinity%2Csalinity_qc%2Ctemperature%2Ctemperature_qc%2Ctime_qc%2Ctime_uv%2Ctime_uv_qc%2Cu%2Cu_qc%2Cv%2Cv_qc&time%3C=2017-05-16T18%3A13%3A46Z
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MathewBiddle Thanks haha. It might be moot though. Not sure if you saw I had updated my comment above to specify glider days = # days the instrument was in the water collecting data...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MathewBiddle I'm not sure how Kathy eyes work but you had to see how many flaws in my code (and the data) she found when we were doing the hurricane animation script by just looking at the final animation! I guess that we bury our heads in the code and forget the big picture, she doesn't. That is why we need a diverse group.
But now I'm more curious why our code did not break! It should have. Investigating that... Anyway, we are at a point were we can choose the best of 3 imperfect metrics. IMO it is Matt's code with the fixed rounded days, when only glider days is the question we want to answer. There is no silver bullet for such a big dataset and so many small metadata problems.
Edit: The conversion of empty, not a NaN, to datetime, returns an empty value:
That is a bad silent error :-/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kbailey-noaa happy to change to "data days" however, when the machines take over the world and demand credit for their labor, I'm not answering for it ;-p
Jokes aside, the overestimation is 2 orders os magnitude smaller than both estimates. I don't think it is that bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like you added some logic in there. Nice addition.