Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a tester, I want to test the event based evaluation solution implemented in #130 #382

Open
13 of 15 tasks
HankHerr-NOAA opened this issue Jan 21, 2025 · 132 comments
Open
13 of 15 tasks
Assignees
Labels
testing Testing of new capabiltiies
Milestone

Comments

@HankHerr-NOAA
Copy link
Contributor

HankHerr-NOAA commented Jan 21, 2025

See #130. In this ticket, I'll track testing of the new capability. I'll start with build the latest code and running a very basic example. I'll then workup a test plan with different tests to perform tracked via checkboxes in the description of this ticket (i.e., below). Note that testing will make use of the standalone, both in-memory and using a database. COWRES testing will come later.

Let me pull the code and make sure I can build it.

Thanks,

Hank

==========

Tests to be performed are below. As I work down the list, in some cases, tests higher in the list will be updated to include the mentioned capability being tested. I'll essentially be throwing the kitchen sink at the capability to see if/when something "breaks". Other tests may be added as I progress through this list.

@HankHerr-NOAA HankHerr-NOAA added this to the v6.29 milestone Jan 21, 2025
@HankHerr-NOAA HankHerr-NOAA self-assigned this Jan 21, 2025
@HankHerr-NOAA HankHerr-NOAA added the testing Testing of new capabiltiies label Jan 21, 2025
@HankHerr-NOAA
Copy link
Contributor Author

As my initial evaluation, I used observations for ABRN1 streamflow (part of the HEFS Test A evaluations), and simulations for its NWM feature id (acquired from WRDS), and came up with this:

label: Testing Event Based
observed:
  label: OBS Streamflow
  sources: /home/ISED/wres/wresTestData/issue92087/inputs/ABRN1_QME.xml
  variable: QME
  feature_authority: nws lid
  type: observations
  time_scale:
    function: mean
    period: 24
    unit: hours
predicted:
  label: "19161749 RetroSim CSVs"
  sources:
  - /home/ISED/wres/nwm_3_0_retro_simulations/wfo/OAX/19161749_nwm_3_0_retro_wres.csv.gz
  variable: streamflow
  feature_authority: nwm feature id
  type: simulations
features:
  - {observed: ABRN1, predicted: '19161749'}
time_scale:
  function: mean
  period: 24
  unit: hours

event_detection: observed

It appears as though 64 events were identified with standard statistics output; here is sampling of the last few pools listed:

2025-01-21T18:05:54.537+0000  [Pool Thread 5] INFO PoolReporter - [60/64] Completed statistics for a pool in feature group 'ABRN1-19161749'. The time window was: ( Earliest reference time: -1000000000-01-01T00:00:00Z, Latest reference time: +1000000000-12-3
1T23:59:59.999999999Z, Earliest valid time: 1985-09-25T06:00:00Z, Latest valid time: 1985-09-28T06:00:00Z, Earliest lead duration: PT-2562047788015215H-30M-8S, Latest lead duration: PT2562047788015215H30M7.999999999S )                                       
2025-01-21T18:05:54.538+0000  [Pool Thread 2] INFO PoolReporter - [61/64] Completed statistics for a pool in feature group 'ABRN1-19161749'. The time window was: ( Earliest reference time: -1000000000-01-01T00:00:00Z, Latest reference time: +1000000000-12-3
1T23:59:59.999999999Z, Earliest valid time: 1996-11-18T06:00:00Z, Latest valid time: 1996-12-22T06:00:00Z, Earliest lead duration: PT-2562047788015215H-30M-8S, Latest lead duration: PT2562047788015215H30M7.999999999S )                                       
2025-01-21T18:05:54.552+0000  [Pool Thread 1] INFO PoolReporter - [62/64] Completed statistics for a pool in feature group 'ABRN1-19161749'. The time window was: ( Earliest reference time: -1000000000-01-01T00:00:00Z, Latest reference time: +1000000000-12-3
1T23:59:59.999999999Z, Earliest valid time: 1987-03-22T06:00:00Z, Latest valid time: 1987-10-03T06:00:00Z, Earliest lead duration: PT-2562047788015215H-30M-8S, Latest lead duration: PT2562047788015215H30M7.999999999S )                                       
2025-01-21T18:05:54.562+0000  [Pool Thread 6] INFO PoolReporter - [63/64] Completed statistics for a pool in feature group 'ABRN1-19161749'. The time window was: ( Earliest reference time: -1000000000-01-01T00:00:00Z, Latest reference time: +1000000000-12-3
1T23:59:59.999999999Z, Earliest valid time: 1998-11-04T06:00:00Z, Latest valid time: 1998-12-09T06:00:00Z, Earliest lead duration: PT-2562047788015215H-30M-8S, Latest lead duration: PT2562047788015215H30M7.999999999S )                                       
2025-01-21T18:05:54.563+0000  [Pool Thread 4] INFO PoolReporter - [64/64] Completed statistics for a pool in feature group 'ABRN1-19161749'. The time window was: ( Earliest reference time: -1000000000-01-01T00:00:00Z, Latest reference time: +1000000000-12-3
1T23:59:59.999999999Z, Earliest valid time: 1998-03-21T06:00:00Z, Latest valid time: 1998-09-07T06:00:00Z, Earliest lead duration: PT-2562047788015215H-30M-8S, Latest lead duration: PT2562047788015215H30M7.999999999S )

I don't have a good way to view them graphically at the moment. Let me see if I can spin up a quick-and-dirty spreadsheet to support viewing the XML observations and CSV simulations.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

That 7 month event in 1987 is kind of odd: "1987-03-22T06:00:00Z, Latest valid time: 1987-10-03T06:00:00Z". Again, I need to visualize the time series so that I can understand where the events are coming from.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

HankHerr-NOAA commented Jan 21, 2025

Here is a plot of the observations and simulation for ABRN1 stream, with the NWM retrosim being averaged to 24-hours ending at the times of the 24-hour observations (I believe that's how WRES would rescale it by default; observations are blue):

Image

Its crude, but the spreadsheet should allow me to focus in on individual events identified by the WRES to see if it makes sense. I'll start by examining the data for Mar 22, 1987, through Oct 3, 1987, which the WRES identified as one, long event.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

Here is an image zoomed into about 3/22/1987 - 10/3/1987 (plus some buffer around it):

Image

Can't really look at that an understand why the WRES would see this as a single, long event, based on the observations. I'm now going to look at a short event to see if I can make sense of the output.

Hank

@james-d-brown
Copy link
Collaborator

For what it's worth...

The events themselves won't always make a ton of sense and you will find that they are rather sensitive to the parameters.

It's the algorithm we have for now, but I am pretty certain it's an accurate implementation and it has a decent set of unit tests, including those ported across from python. There is probably no algorithm that produces completely satisfactory results, though.

Spending time looking at the events may lead to a different/better set of default parameter values, but it probably won't.

On the whole, it produces vaguely sensible looking events for synthetic time-series with strong peak signals. It starts to look more questionable for (some) real time-series.

I don't want to sway your UAT but, TBH, I am personally more concerned about the range of interactions between event detection and various other features and whether it produces any/sensible results across all of those various possibilities - there's only so much that can be covered with integration tests.

@HankHerr-NOAA
Copy link
Contributor Author

James:

Thanks. I was about to post the same conclusion that the parameters are probably just not optimal for this location, given the number of multi-week/month events I'm seeing, and I'm not going to spend time trying to optimize them.

Next step is to generate outputs that make some sense to me. I'm going to add graphics to the evaluation to help visualize what the WRES produces.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

Oh, and understood about wanting me to look at the interactions between features. I'll start working on that once I can make sense of the a "simple" case using real data.

Hank

@james-d-brown
Copy link
Collaborator

On the visualization, as I noted in #130, if there were a quick win for visualizing these events, I would've implemented it, but there really isn't. The quickest way would be a sidecar in the event detection package that generated visualizations, but that is pretty yucky as it bypasses our graphics clients. The best way would be to add event detection as a metric as well as a declaration option in itself. That way, you could calculate and visualize the detected events alone using the normal workflow, but that was going to be a lot of work as it would require a new metric/statistics class with output composed of two instants (start and end times). Will have to wait but, honestly, it would probably be better for event detection to be a separate web service.

@HankHerr-NOAA
Copy link
Contributor Author

Understood. That's why I'm trying to come up with something, myself, that I can do quickly enough to perform some basic checks of the output.

Here are the Pearson correlation coefficients for the various events:

Image

Okay. I think I need to just test out some of the different options provided in the wiki to see what happens.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

Oh, and if there are other metrics I should be looking at, let me know. I was thinking the time-to-peak metric probably won't be meaningful until I use forecasts to the evaluation, which I do plan to do. For now, since I'm evaluating simulations, I just looked at a couple of single-valued metrics and shared only the correlation.

Hank

@james-d-brown
Copy link
Collaborator

The example in the wiki uses simulations. In general, event detection won't work for forecast datasets because they are too short and will only capture partial events, unless it's a long-range forecast, perhaps. Anyway, see the example in the wiki, which is supposed to be exemplary as much as illustrative. I think that is the best/most reasonable application of event detection as it stands, i.e., detecting events for both observations and predictions simultaneously and then comparing the detected events in terms of peak timing error (or possibly other hydrologic signatures that we could add in future). Anything with forecasts is going to be much more dubious, unless you use a non-forecast dataset for detection. Anything with traditional, pool-based, statistics is going to be somewhat dubious too, IMHO.

@HankHerr-NOAA
Copy link
Contributor Author

Can I compute the average correlation across the identified events? I don't know how useful that would be; I just figured to give it a shot. Looking at the wiki, I don't think I can. I know you can summarize across feature pools, but I don't think we can summary across referenced date/time pools, right? I'll check the wiki.

For now, I just started an evaluation using the more complex declaration in,

https://github.com/NOAA-OWP/wres/wiki/Event-detection#how-do-i-declare-event-detection

just to see what happens.

Hank

@james-d-brown
Copy link
Collaborator

No, that would need to be added as an aggregation option for summary statistics. I think I speculated in #130 about this, so probably worth a ticket for a future release. But again, I am personally a bit doubtful about traditional statistics generated for event periods. In most cases, this sort of analysis probably makes more sense with thresholds, like an analysis for flows above a flood threshold.

@james-d-brown
Copy link
Collaborator

Anyway, I am going to mostly leave you alone now unless you have questions as I have probably already swayed you too much. I just wanted to emphasize that the events are quite sensitive to the parameters and the timing error type analysis probably makes most sense for this new feature, but users will do what they do and you are probably representing their thought process too...

@HankHerr-NOAA
Copy link
Contributor Author

The run using the parameters in the aforementioned section of wiki does yield significantly different events:

Image

So, yeah, sensitive to parameters.

I'm going to try to workup a checklist of things to look at as part of this testing now. I still have questions, but they should be answered as I work through the tests.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

I have a check list as a starting point. I'm sure that some of the items are nonsensical, but I want to see what happens when I combine different options. As I work through the list, I will likely use previously successful evaluations to add the new, specified feature, in order to see how the results are impacted. I'm probably overlooking tests to perform; I'll add those when I discover the oversight.

Thanks,

Hank

@HankHerr-NOAA
Copy link
Contributor Author

FYI... all test declarations will be kept in the standard .../wresTestData/github382 directory. I've also already created a GitHub_382 folder in the WRES Redmine Ticket Large Data Sets folder in Google Drive for sharing data. It has the observed and simulated data sets I've used for testing so far (though they aren't particularly "large").

Hank

@HankHerr-NOAA
Copy link
Contributor Author

I'm not sure I'm going to get to this today, except perhaps during the office hours (if no one is on). I've been working on the QPR slide deck and dealing with the proxy outage.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

Just talked with James some during the office hours. I ran an evaluation of single valued forecasts from WRDS against observations used for HEFS, with default event-based evaluation parameters, and obtained results for the pearson correlation coefficient, time to peak error, and sample size. First, we noticed that the evaluation.csv did not properly convey the reference_dates from the declaration; I need to write a ticket for that.

Second, James explained that each event period will yield time-to-peak errors for each single valued forecast for overlapping that period, and that each such error will be stored in the evaluation.csv with both the value and the forecast's corresponding issued time. It was hard for me to see this when looking at the CSV, directly, but became clearer when I looked at the CSV through Excel; here is a screenshot:

Image

Each time to peak error is presented as an issued time and value pair on two rows. The output image would then look like:

Image

Since I'm zoomed out so far, it appears that the points line up vertically, but that is actually not the case. Pay close attention to the 0-line at the top, you'll see that they do not exactly overlap.

I think this is reasonable output given the declaration I employed. I'll do a bit more testing, though, before checking the single-valued forecast box.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

I reported #385 for the CSV issue. I'll pick up testing again when I can tomorrow. I'm not making as much progress as I had hoped.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

My single valued example run used ABRN1 RFC forecasts from WRDS. I opened the evaluation.csv in Excel, counted the time to peak errors computed for each identified event, and compared that with what WRDS returns for time series when I constrain the request to be for the time period of the event. They are identical, meaning there is one time to peak error per time series overlapping the event. That is what I expected. Good.

As an aside, ABRN1 appears to be one of those forecast points (presumably in ABRFC) where forecasts are only generated when needed. So, for example, WRES identified an event spanning Jun 5 - Aug 5, 2014. The forecasts for that point that overlap all have issued times of Jun 4, meaning that the RFC generated the forecast only when the event was on the horizon. As for the event being two months long, that was likely due to the parameter options as discussed before.

I believe there are summary statistic options for time series metrics. Let me see if I can find those and give it a shot.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

First thing I found gave me overall summary statistics instead of one set of stats for the time to peak error instead of one per event. Let me revisit the event based wiki to see how I'm supposed to do it.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

No, I did it right. That number is intended to report the average time to peak error across all events. In other words, it answers the question, when an event occurs, what is the average time to peak error I can expect for forecasts of those events. If I want statistics related for a single event, then, I guess I would modify the declaration to focus on the single event of interest and run it again.

James: If that sounds wrong, please let me know.

I'm checking the single valued forecast evaluation. It works in a simple evaluation, which is the point of the checkbox, 'Evaluating single value forecasts for events (including time series metrics)'. More complicated stuff comes later.

Hank

@james-d-brown
Copy link
Collaborator

Yeah, there is one "time to peak" to for one "peak" aka one "event", so the "raw" numbers of the time-to-peak are the "per event" values and the summary statistics aggregate across all events.

@HankHerr-NOAA
Copy link
Contributor Author

Thanks, James!

The next checkbox is for a basic ensemble forecast test. So I guess I'll point the declaration to the HEFS data for ABRN1 and see what happens.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

Ensemble forecasts don't allow for time to peak error:

- The declared or inferred data 'type' for the 'predicted' dataset is ensemble forecasts, but the following metrics are not currently supported for this data 'type': [TIME TO PEAK ERROR]. Please remove these metrics or change the data 'type'.

Makes sense: there is one different peak per member. Anyway, I'll use more traditional metrics, even if they aren't really as interesting in this case.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

So time_pools is not allowed:

Caused by: wres.config.yaml.DeclarationException: Encountered 1 error(s) in the declared evaluation, which must be fixed:
    - Event detection was declared alongside explicit time pools, which is not allowed because event detection also generates time pools. Please remove the declaration of either 'event_detection' or 'time_pools' and try again.

That's reasonable behavior, but I don't see it documented in the event detection or declaration language wiki. The latter is very large, so I could easily be overlooking it. Checking the box since it can't be tested further,

Hank

@HankHerr-NOAA
Copy link
Contributor Author

As for thresholds, I've already tested probability_thresholds with HEFS event-detection evaluations. I'm going to try WRDS thresholds with retrospective simulation evaluations. Not sure if it makes sense; just testing the capability.

Hank

@james-d-brown
Copy link
Collaborator

So time_pools is not allowed:

Caused by: wres.config.yaml.DeclarationException: Encountered 1 error(s) in the declared evaluation, which must be fixed:
    - Event detection was declared alongside explicit time pools, which is not allowed because event detection also generates time pools. Please remove the declaration of either 'event_detection' or 'time_pools' and try again.

That's reasonable behavior, but I don't see it documented in the event detection or declaration language wiki. The latter is very large, so I could easily be overlooking it. Checking the box since it can't be tested further,

Hank

Let me take another look at that. For example, let's say that event detection fails to identify a historically important event and a user wants to add it in alongside the detected events. Sure, that indicates a limitation with event detection, but I guess a user should be able to mitigate this. Assuming it isn't additional work, I lean towards demoting this to a warning rather than an error. I don't think it makes sense to allow in combination with a regular sequence of valid_date_pools, however, and this should produce an error.

@HankHerr-NOAA
Copy link
Contributor Author

James:

Agree. New ticket, right?

Hank

@james-d-brown
Copy link
Collaborator

Yeah, may as well, just for clarity.

@HankHerr-NOAA
Copy link
Contributor Author

Created #406. I view that as an enhancement, so no need to hold up deployment whenever event detection goes out.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

The evaluation employing of NWM retrospective simulations against HEFS observations using WRDS thresholds did succeed for ABRN1. However, I'm having difficulty interpreting the results. Still examining,

Hank

@HankHerr-NOAA
Copy link
Contributor Author

I think the results make sense, but are very limited by the weekly maximum time scale. Floods will typically be reduced to one or two points at that time scale, so it leads to lots of minimal sample sizes, which I do see in the evaluation CSV, but which are hard to see on the plots. I'm going to remove the weekly scale and see if the results are more interesting.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

Finally got something interesting... Here is the time to peak error plot using WRDS thresholds for ABRN1 and a 24-hour mean time scale to identify events:

Image

Obviously, there are a lot of events without values for some thresholds, which is why some lines are jumping across events, but it all seems plausible. Worth noting that, if a higher threshold includes a point for specific event, then all of the lower thresholds will, as well, which also makes sense.

I'm not sure how far to go with this check, but this seems like its working to me. I'll go ahead and check the box,

Hank

@HankHerr-NOAA
Copy link
Contributor Author

Here are two checkboxes I have not tested yet and my plans for each:

"Evaluating with other temporal pools (what is the interaction?)": I've already done various combinations of temporal pools, within the constraints of the current capability (e.g., no time_pools allowed), but I'll check things out again to see if there is anything more to test.

"Evaluating events for multiple features pooled together (if that is even possible; check the wiki)": I honestly have no idea what to expect with this test, and its worth noting a problem with multiple features treated independently resulted in a but, #400.

Should happen this afternoon,

Hank

@HankHerr-NOAA
Copy link
Contributor Author

Have to remind myself how to declare feature_groups, because I don't see that topic covered in the declaration language wiki. Perhaps it was an intentional oversight to keep the documentation simple. Based on the schema, I think I just wrap feature_groups around features and optionally name it. Giving that a shot using my (failing) 3 feature evaluation to see what happens,

Hank

@HankHerr-NOAA
Copy link
Contributor Author

Looks like I got my answer:

Caused by: wres.config.yaml.DeclarationException: Encountered 1 error(s) in the declared evaluation, which must be fixed:
    - Event detection was declared alongside feature groups, which is not currently supported. Please remove the declaration of 'feature_groups' or a 'feature_service' with a 'group' whose features will be pooled together (i.e., 'pool: true'), as applicable, and try again. Alternatively, please remove 'event_detection'. Hint: summary statistics are supported alongside event detection if your goal is to compute statistics across events for multiple geographic features.

Its not supported. I was wondering how we could logically group multiple features when doing event detection, and I guess we can't.

What I'll try again is what is hinted at at the end: computing summary statistics for a group of features.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

Oh, actually the 3 feature evaluation with independent handling of features is needed for that, and that is currently resulting in unexpected events. I'll check the box associated with feature groups and just make sure to look at summary statistics when I test the independent features checkbox.

Hank

@HankHerr-NOAA
Copy link
Contributor Author

Looking at temporal pools...

My HEFS evaluations included lead time pools, and I was able to generate graphics with lead time on the domain axis. That's one test.

James said this above:

I don't think it makes sense to allow in combination with a regular sequence of valid_date_pools, however, and this should produce an error.

Let me confirm I can produce that error,

Hank

@HankHerr-NOAA
Copy link
Contributor Author

Error confirmed:

Caused by: wres.config.yaml.DeclarationException: Encountered 1 error(s) in the declared evaluation, which must be fixed:
    - Event detection was declared alongside valid date pools, which is not allowed because event detection also generates valid date pools. Please remove the declaration of either 'event_detection' or 'valid_date_pools' and try again.

Testing reference_dates_pools next. Expecting a failure there, as well,

Hank

@HankHerr-NOAA
Copy link
Contributor Author

I was wrong. Reference date pooling is allowed. Examining the output to see if I can make sense of it (I might struggle since I put next to no thought about how I declared it),

Hank

@james-d-brown
Copy link
Collaborator

Yes, lead times and reference date pools are allowed. There is an entire wiki on pooling features:

https://github.com/NOAA-OWP/wres/wiki/Pooling-geographic-features

I guess we need to reference it in the main wiki, though.

@james-d-brown
Copy link
Collaborator

james-d-brown commented Feb 6, 2025

@HankHerr-NOAA
Copy link
Contributor Author

The pools:

reference_dates:
  minimum: 2000-08-07T23:00:00Z
  maximum: 2015-08-08T23:00:00Z

reference_date_pools:
  period: 365
  frequency: 180
  unit: days

When those pools are included, I get this warning:

Event detection was declared alongside reference date pools, which is allowed, but may not be intended. A separate pool will be generated for each combination of event and reference date pool. If this is not intended, please remove either 'event_detection' or 'reference_date_pools' and try again.

Here is the plot of Pearson correlation coefficients when NO reference_date_pools are declared:

Image

Here are the reported Pearson correlation coefficients when reference_date_pools are declared:

Image

So the domain axis is the issued time window center while the valid time windows (the events) appear in the legend. Okay. The events can overlap multiple windows, which is why some event series have 2 or even 3 points (the 3 point one is an event from 3/8/2010 - 10/2/2010 which is 208 days wide).

Here are the unique events identified when no reference_date_pools are declared:

[hank.herr@owpal-d-ised01 single_valued_ahps_declaration.yml.output]$ awk -F "\"*,\"*" '{print $22,$23}' evaluation.csv | sort | uniq
2007-10-15T06:00:00Z 2007-12-10T06:00:00Z
2008-04-14T06:00:00Z 2008-09-22T06:00:00Z
2009-04-28T06:00:00Z 2009-08-24T06:00:00Z
2010-03-08T06:00:00Z 2010-10-02T06:00:00Z
2013-04-19T06:00:00Z 2013-08-22T06:00:00Z
2014-06-05T06:00:00Z 2014-08-05T06:00:00Z
2015-05-08T06:00:00Z 2015-11-02T06:00:00Z

Here are the unique events when reference_date_pools are declared, as shown above:

[hank.herr@owpal-d-ised01 single_valued_ahps_declaration.temporal_pools.yml.output]$ awk -F "\"*,\"*" '{print $22,$23}' evaluation.csv | sort | uniq
2007-10-15T06:00:00Z 2007-12-10T06:00:00Z
2008-04-14T06:00:00Z 2008-09-22T06:00:00Z
2009-04-28T06:00:00Z 2009-08-24T06:00:00Z
2010-03-08T06:00:00Z 2010-10-02T06:00:00Z
2013-04-19T06:00:00Z 2013-08-22T06:00:00Z
2014-06-05T06:00:00Z 2014-08-05T06:00:00Z

It appears as though the last pool in 2015 is lost, which may be because the reference_dates maximum is about midway through that event. Looking at the log, I do see mention of of the date "2015-05-08" as a earliestValidTime. But then later, I see that pools did not result in any pairs; for example:

0 2025-02-06T18:41:33.128+0000 [Pool Thread 5] WARN PoolSupplier - When evaluating a pool for time window TimeWindowOuter[earliestReferenceTime=2011-06-11T23:00:00Z,latestReferenceTi me=2012-06-10T23:00:00Z,earliestValidTime=2015-05-08T06:00:00Z,latestValidTime=2015-11-02T06:00:00Z,earliestLeadDuration=PT-2562047788015215H-30M-8S,latestLeadDuration=PT25620477880 15215H30M7.999999999S], failed to identify any pairs for feature: Feature[name=ABRN1,description=Auburn,srid=0,wkt=]. Pairs were available for these features: [].

I believe the including of reference date pools capping the maximum reference date for a WRDS RFC forecast resulted in no data to evaluate overlapping with that last event. Remember that this is ABRN1, which I think is a dry point in Nebraska, and ABRFC only prepares forecasts when it thinks its necessary for some of its points.

Bottom line, I'm seeing one point per event and per pool, unless there are no pairs, so I'm seeing what I should see. Awesome.

Checking the box,

Hank

@HankHerr-NOAA
Copy link
Contributor Author

For today, I think I'm done with testing. I do plan to do additional tests, mixing and matching features to see if I encounter unexpected exceptions or results, but that will happen tomorrow.

We still have two checkboxes unchecked waiting for tickets to be fixed. See the list of checkboxes in the ticket description for the blocking issues.

That's it for today,

Hank

@HankHerr-NOAA
Copy link
Contributor Author

I didn't even look for the pooling features wiki. Oops. With it referenced now, it should be easier to spot.

Thanks,

Hank

@HankHerr-NOAA
Copy link
Contributor Author

This is going to be tracked with 6.30.

I moved the development ticket to 6.30 for tracking purposes, as well.

Hank

@HankHerr-NOAA HankHerr-NOAA modified the milestones: v6.29, v6.30 Feb 6, 2025
@HankHerr-NOAA
Copy link
Contributor Author

Next week, I'll continue testing some combinations of features as time allows. When the tickets associated with unchecked checkboxes are addressed, I'll test them as well.

Thanks,

Hank

@HankHerr-NOAA
Copy link
Contributor Author

I just ran a few of the evaluations I ran toward the end of last week to double check that they execute using a database correctly. In general, the standard scores look okay.

But, when I ran an evaluation of NWM retrospective simulations using the time-to-peak error, I noted that there were differences between in-memory and database for the graphic, ABRN1_19161749_19161749_RetroSim_CSVs_TIME_TO_PEAK_ERROR.png. In fact, I noticed that the output was not deterministic: that graphic would change with every run whether or not it used a database.

I'm going to post ticket,

Hank

@HankHerr-NOAA
Copy link
Contributor Author

I think the problem I'm spotting is already covered in #399. The declaration, shared below, uses sampling_uncertainty, and the diagram I'm looking at is that for time to peak error. I think, when combined with thresholds, the issue becomes more obvious. Fix #399 and the outputs using thresholds will likely also be fixed.

Nothing new here,

Hank

==========

label: Testing Event Based
observed:
  label: OBS Streamflow
  sources: /home/ISED/wres/wresTestData/issue92087/inputs/ABRN1_QME.xml
  variable: QME
  feature_authority: nws lid
  type: observations
  time_scale:
    function: mean
    period: 24
    unit: hours

predicted:
  label: "19161749 RetroSim CSVs"
  sources: 
  - /home/ISED/wres/nwm_3_0_retro_simulations/wfo/OAX/19161749_nwm_3_0_retro_wres.csv.gz
  variable: streamflow
  feature_authority: nwm feature id
  type: simulations

features:
  - {observed: ABRN1, predicted: '19161749'}

# Weekly maximum time scale
time_scale:
  function: mean
  period: 24
  unit: hours

event_detection: observed

sampling_uncertainty:
  sample_size: 1000
  quantiles: [0.05,0.95]

metrics:
  - time to peak error
  - sample size
  - pearson correlation coefficient

threshold_sources:
- uri: https://WRDS/api/location/v3.0/nws_threshold
  operator: greater
  apply_to: observed
  type: value
  parameter: flow
  provider: NWS-NRLDB
  rating_provider: NRLDB
  feature_name_from: observed

output_formats:
  - csv2
  - png
  - pairs

@HankHerr-NOAA
Copy link
Contributor Author

Other than the issue with time to peak error, #399, in-memory and database runs appear identical.

Hank

@james-d-brown
Copy link
Collaborator

Please see my recent comments in #397.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
testing Testing of new capabiltiies
Projects
None yet
Development

No branches or pull requests

2 participants