Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor metrics in the IOOS_BTN.ipynb notebook #53

Merged
merged 13 commits into from
Feb 13, 2024
Merged
2 changes: 1 addition & 1 deletion .github/workflows/metrics.yml
Original file line number Diff line number Diff line change
@@ -2,7 +2,7 @@ name: Collect quarterly metrics

on:
push:
branches:
branches:
- main
paths:
- '.github/workflows/metrics.yml'
25 changes: 25 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
name: Full Tests

on:
pull_request:
push:
branches: [main]

jobs:
run:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

- name: Setup Micromamba
uses: mamba-org/setup-micromamba@v1
with:
init-shell: bash
environment-file: conda-lock.yml
environment-name: TEST

- name: Tests
shell: bash -l {0}
run: |
python -m pytest -rxs tests
41 changes: 41 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
repos:
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.5.0
hooks:
- id: trailing-whitespace
- id: check-ast
- id: debug-statements
- id: end-of-file-fixer
- id: check-added-large-files

- repo: https://github.com/psf/black
rev: 24.1.1
hooks:
- id: black
language_version: python3

- repo: https://github.com/codespell-project/codespell
rev: v2.2.6
hooks:
- id: codespell

- repo: https://github.com/asottile/add-trailing-comma
rev: v3.1.0
hooks:
- id: add-trailing-comma

- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.2.1
hooks:
- id: ruff

ci:
autofix_commit_msg: |
[pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci
autofix_prs: false
autoupdate_commit_msg: '[pre-commit.ci] pre-commit autoupdate'
autoupdate_schedule: monthly
skip: []
submodules: false
7 changes: 3 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
@@ -7,9 +7,9 @@ Requirements:
## Website
Leveraged existing resources from https://github.com/noaa-fisheries-integrated-toolbox/toolbox_web_templating.

The webpages are built from the `website/` directory.
The webpages are built from the `website/` directory.

| File(s) | Description
| File(s) | Description
|---------------------------------------|---------------------------------------------------------------
| `*_config.json` | configuration for what resources to present on the webpages.
| `create_asset_inventory_page.py` | script to create https://ioos.github.io/ioos_metrics/asset_inventory.html
@@ -33,7 +33,6 @@ All the webpages will be saved to `website/deploy`. You can view the local html

## Deployment

The website is generated using GitHub Actions and GitHub Pages. The python scripts, referenced above, are ran and the
The website is generated using GitHub Actions and GitHub Pages. The python scripts, referenced above, are ran and the
directory `website/deploy` is then uploaded as an artifact for GitHub Pages to serve as a website.
This process is automatically ran with every push to the `main` branch. See [here](https://github.com/ioos/ioos_metrics/blob/main/.github/workflows/website_create_and_deploy.yml).

23,659 changes: 12,489 additions & 11,170 deletions conda-lock.yml

Large diffs are not rendered by default.

4 changes: 3 additions & 1 deletion environment.yml
Original file line number Diff line number Diff line change
@@ -2,7 +2,7 @@ name: ioos-metrics
channels:
- conda-forge
dependencies:
- python=3.11
- python=3.12
- bs4
- eccodes
- fiscalyear
@@ -13,5 +13,7 @@ dependencies:
- owslib
- pandas
- pdbufr
- pytest
- requests
- suds
- pyarrow
Empty file added ioos_metrics/__init__.py
Empty file.
207 changes: 207 additions & 0 deletions ioos_metrics/ioos_metrics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,207 @@
"""
Code extracted from IOOS_BTN.ipynb
"""

import io
import warnings

import pandas as pd
import requests
from bs4 import BeautifulSoup

_HEADERS = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36",
}


def previous_metrics():
"""
Loads the previous metrics as a DataFrame for updating.
"""
df = pd.read_csv(
"https://github.com/ioos/ioos_metrics/raw/main/ioos_btn_metrics.csv",
)

number_cols = [
"Federal Partners",
"Regional Associations",
"HF Radar Stations",
"NGDAC Glider Days",
"National Platforms",
"Regional Platforms",
"ATN Deployments",
"MBON Projects",
"OTT Projects",
"HAB Pilot Projects",
"QARTOD Manuals",
"IOOS Core Variables",
"Metadata Records",
"IOOS",
"COMT Projects",
]
df[number_cols] = df[number_cols].astype("Int64")
return df


def federal_partners():
"""
ICOOS Act/COORA
Typically 17, from https://ioos.noaa.gov/community/national#federal.
"""

url = "https://ioos.noaa.gov/community/national#federal"

html = requests.get(url, headers=_HEADERS).text

df = pd.read_html(io.StringIO(html))
df_clean = df[1].drop(columns=[0, 2])
df_fed_partners = pd.concat([df_clean[1], df_clean[3]]).dropna().reset_index()
return df_fed_partners.shape[0]


def ngdac_gliders(start_date="2000-01-01", end_date="2023-12-31"):
"""
NGDAC Glider Days
Gliders monitor water currents, temperature, and conditions that reveal effects from storms,
impacts on fisheries, and the quality of our water.
This information creates a more complete picture of what is happening in the ocean,
as well as trends scientists might be able to detect.
U.S. IOOS began counting “Glider days” in 2008 with the intent to better coordinate across
U.S. glider operations and to increase the data sharing and data management of this technology.
One "Glider Day" is defined as 1 glider in the water collecting data for 1 day.
From https://gliders.ioos.us/erddap/info/index.html?page=1&itemsPerPage=1000
Cumulative from 2008 - present
Conditions on our calculations:
* drops all datasets with `datasetID` containing `delayed`.
* duration is calculated based on the metadata ERDDAP generates (time_coverage) which usually over-estimate a bit b/c it includes empty data (NaN).
Note that data with NaN can be real glider day with lost data. Which is OK for this metric.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MathewBiddle I capture our slack convo here. Let me know if this is sufficient for future understanding of what we are doing.

@kbailey-noaa this script can compute gliders days much faster than my notebook or gdutils b/c it uses only the time_coverage metadata from the allDatasets entry in ERDDAP. The main difference is that it will "over-estimate" the days b/c it will count the time with NaNs data as part of the glider day.

If we see this as "the glider was in the water anyway" this metric is "more correct" than my notebook or gdutils. If we want to compute glider days strictly based on data collected, then this one overestimate by 621 days (75929 against 75308 for this time period).

Copy link

@kbailey-noaa kbailey-noaa Feb 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocefpaf Editing this comment:
Glider days = # days the instrument was in the water collecting data. Is that how your notebook and gdutils compute days (hopefully)? If a glider is deployed and in the water for 1 week but doesn't collect any data then it's worthless and days = 0.
And, overestimating by > 1 year's worth of days feels like too much.
From a metrics gathering standpoint, we're more interested in the data the gliders collect rather than the duration of the platform itself. Glider days is a measure of success related to data collection.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MathewBiddle Thanks haha. It might be moot though. Not sure if you saw I had updated my comment above to specify glider days = # days the instrument was in the water collecting data...

Copy link
Member Author

@ocefpaf ocefpaf Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kbailey-noaa kudos to finding the only dataset with no minTime 👏

@MathewBiddle I'm not sure how Kathy eyes work but you had to see how many flaws in my code (and the data) she found when we were doing the hurricane animation script by just looking at the final animation! I guess that we bury our heads in the code and forget the big picture, she doesn't. That is why we need a diverse group.

But now I'm more curious why our code did not break! It should have. Investigating that... Anyway, we are at a point were we can choose the best of 3 imperfect metrics. IMO it is Matt's code with the fixed rounded days, when only glider days is the question we want to answer. There is no silver bullet for such a big dataset and so many small metadata problems.


Edit: The conversion of empty, not a NaN, to datetime, returns an empty value:

pd.to_datetime(df.loc[df["datasetID"] == "Nemesis-20170512T0000"]["minTime (UTC)"], errors="raise")

That is a bad silent error :-/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ocefpaf Editing this comment: Glider days = # days the instrument was in the water collecting data. Is that how your notebook and gdutils compute days (hopefully)? If a glider is deployed and in the water for 1 week but doesn't collect any data then it's worthless and days = 0. And, overestimating by > 1 year's worth of days feels like too much. From a metrics gathering standpoint, we're more interested in the data the gliders collect rather than the duration of the platform itself. Glider days is a measure of success related to data collection.

@kbailey-noaa happy to change to "data days" however, when the machines take over the world and demand credit for their labor, I'm not answering for it ;-p

Jokes aside, the overestimation is 2 orders os magnitude smaller than both estimates. I don't think it is that bad.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you added some logic in there. Nice addition.

UserWarning: The following rows have missing data:
    minTime (UTC)         maxTime (UTC)              datasetID
752           NaN  2017-08-02T19:16:37Z  Nemesis-20170512T0000
  warnings.warn(f"The following rows have missing data:\n{rows}")

"""
df = pd.read_csv(
"https://gliders.ioos.us/erddap/tabledap/allDatasets.csvp?minTime,maxTime,datasetID",
)

# We don't want allDatasets in our numbers.
df = df.loc[~(df["datasetID"] == "allDatasets")]
df.describe().T["count"]

# Check if any value is NaN and report it.
if df.isnull().sum().sum():
rows = df.loc[df.isnull().sum(axis=1).astype(bool)]
warnings.warn(f"The following rows have missing data:\n{rows}")

df.dropna(
axis=0,
inplace=True,
)

# drop delayed datasets
df = df.loc[df["datasetID"].str.contains("delayed") == False]

df[["minTime (UTC)", "maxTime (UTC)"]] = df[
["minTime (UTC)", "maxTime (UTC)"]
].apply(pd.to_datetime)

df = df["maxTime (UTC)"].apply(lambda x: x.ceil("D")) - df["minTime (UTC)"].apply(
lambda x: x.floor("D"),
)
return df.sum().days


def comt():
"""
The COMT serves as a conduit between the federal operational and research communities and allows sharing of numerical models,
observations and software tools.
The COMT supports integration, comparison, scientific analyses and archiving of data and model output needed to elucidate,
prioritize, and resolve federal and regional operational coastal ocean issues associated with a range of existing and emerging coastal oceanic,
hydrologic, and ecological models.
The Testbed has enabled significant community building (within the modeling community as well as enhancing academic and federal operational relations) which has dramatically improved model development.
Number of Active Projects via personal communication from COMT program manager.
"""

url = "https://ioos.noaa.gov/project/comt/"

html = requests.get(url, headers=_HEADERS).text

soup = BeautifulSoup(html, "html.parser")

for tag in soup.find_all("h2"):
if tag.text == "Current Projects":
comt = len(tag.next_sibling.find_all("li"))

return comt


def regional_associations():
ras = 0
url = "https://ioos.noaa.gov/regions/regions-at-a-glance/"

html = requests.get(url, headers=_HEADERS).text
soup = BeautifulSoup(html, "html.parser")

for tag in soup.find_all("a"):
if tag.find("strong") is not None:
ra = tag.find("strong").text
# TODO: change to log
# print(f"Found RA {ra}")
ras += 1

return ras


def update_metrics():
"""
Load previous metrics and update the spreadsheet.
"""
df = previous_metrics()

federal_partners_number = federal_partners()
glider_days = ngdac_gliders()
comt_number = comt()
ras = regional_associations()

_TODO = [
# "NGDAC Glider Days", (TODO: change to data days)
"HF Radar Stations", # It is a hardcoded number at the moment
"National Platforms",
"Regional Platforms",
"ATN Deployments",
"MBON Projects",
"OTT Projects",
"HAB Pilot Projects",
"QARTOD Manuals",
"IOOS Core Variables",
"Metadata Records",
"IOOS",
]

today = pd.Timestamp.strftime(pd.Timestamp.today(tz="UTC"), "%Y-%m-%d")
new_metric_row = pd.DataFrame(
[today, federal_partners_number, glider_days, comt_number, ras],
index=[
"date_UTC",
"Federal Partners",
"NGDAC Glider Days",
"COMT Projects",
"Regional Associations",
],
).T
# only update numbers if it's a new day
if today not in df["date_UTC"].to_list():
df = pd.concat(
[df, new_metric_row],
ignore_index=True,
axis=0,
)

return df
12 changes: 12 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
[tool.ruff]
select = [
"A", # flake8-builtins
"B", # flake8-bugbear
"C4", # flake8-comprehensions
"F", # flakes
"I", # import sorting
"T20", # flake8-print
"UP", # upgrade
]
target-version = "py38"
line-length = 79
44 changes: 44 additions & 0 deletions tests/test_metrics.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
import sys

import pandas as pd
import pytest

sys.path.append("..")

from ioos_metrics import ioos_metrics


@pytest.fixture
def df_previous_metrics():
return ioos_metrics.previous_metrics()


def test_previous_metrics(df_previous_metrics):
assert isinstance(df_previous_metrics, pd.DataFrame)
assert not df_previous_metrics.empty


def test_federal_partners():
num = ioos_metrics.federal_partners()
# must the an integer and cannot be less than 0
assert isinstance(num, int)
assert num >= 0


def test_ngdac_gliders(df_previous_metrics):
num = ioos_metrics.ngdac_gliders()
assert isinstance(num, int)
# New count should always be >= than the previous one.
assert num >= df_previous_metrics["NGDAC Glider Days"].iloc[-1]


def test_comt():
num = ioos_metrics.comt()
assert isinstance(num, int)
assert num >= 0


def test_regional_associations():
num = ioos_metrics.regional_associations()
assert isinstance(num, int)
assert num >= 0