Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeseries reconciliation #50

Open
intarga opened this issue Jan 7, 2025 · 42 comments
Open

Timeseries reconciliation #50

intarga opened this issue Jan 7, 2025 · 42 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@intarga
Copy link
Member

intarga commented Jan 7, 2025

Make a script to help content managers find and combine duplicate timeseries from different sources

@intarga intarga added the enhancement New feature or request label Jan 7, 2025
@intarga intarga added this to the Beta Release milestone Jan 7, 2025
@ketilt
Copy link
Collaborator

ketilt commented Jan 7, 2025

I could probably do this. Would this make more sense as an integral part of LARD, or located somewhere else in the architecture?

@intarga
Copy link
Member Author

intarga commented Jan 8, 2025

I think it makes sense to do this in the ingestor (though isolated enough that we can break it out later), since that already runs an HTTP server and has a database connection pool.

I suggest breaking down the task like so:

  • Informally set out the algorithm for reconciliation on paper, defining a priority order for the data sources
  • Write a SQL query that finds reconciliation candidates, and another that resolves them
  • Design a web UI
  • Write some plumbing handler functions in the ingestor to connect the web UI to the queries
  • Write some test cases to ensure this works as expected

Please reach out for help at any point, I imagine especially that tasks 4 and 5 will go a lot smoother with some help from Manuel or I. We can also probably offer useful insight on the structure of the tables for task 2

@ketilt
Copy link
Collaborator

ketilt commented Feb 25, 2025

I'll start sketching out the algorithm first. At some point it would be useful to have SQL access to something to run tests. Where should I start towards that goal?

@Lun4m
Copy link
Collaborator

Lun4m commented Feb 25, 2025

We have this guide (setup 0 and 3 are the only relevant steps), but I'll send you a message.

@intarga
Copy link
Member Author

intarga commented Feb 27, 2025

Rather than running queries on the prod postgres, I think it would be better to use the integration tests for this. It sets up a local postgres db with the schema, and you load it up with some test case data (In this case I guess some stripped down timeseries that should be reconciled and some that shouldn't). I'm happy to help you write these tests, so just book some time with me (or Manuel/Louise if she's back) when you're ready to do it

@ketilt
Copy link
Collaborator

ketilt commented Feb 28, 2025

I'll focus first on reconciliation of KDVH data. I'd also like to understand better the need for reconciliation between obsinn and kvalobs data. Is it needed for edge cases, or generally for all timeseries?

I know some incoming data never passes through kvalobs. So there is no MET label equivalent and there is no need for reconciliation.

Some incoming data is tagged with external station identifiers like wmoid, wigosid, call sign, etc. This could be reconciliated, and we never did that in ODA.

For the rest, the mapping appears trivial. Except I vaguely remember some discussions about sensor=0. Is this what breaks the 1-to-1 mapping between obsinn and MET/kvalobs labels, making reconciliation necessary?

@intarga
Copy link
Member Author

intarga commented Feb 28, 2025

Is it needed for edge cases, or generally for all timeseries?

Generally for all. At the moment we don't do any reconciliation, so any canonical timeseries that straddles the starting point of obsinn ingestion and the stopping point of migration will currently be split across two database timeseries.

Some incoming data is tagged with external station identifiers like wmoid, wigosid, call sign, etc. This could be reconciliated, and we never did that in ODA.

It's up to the content managers how far to take this, and I guess you're their representative here 🙂. I would be happy to mark this issue closed with just KDVH/Kvalobs/Obsinn reconciliation though, and leave further reconciliation for a follow-on issue.

Except I vaguely remember some discussions about sensor=0. Is this what breaks the 1-to-1 mapping between obsinn and MET/kvalobs labels, making reconciliation necessary?

Here's my recollection of the issue:

When Obsinn sends data, it specifies parameters in the header like PARAMNAME(sensor, level), i.e TX(1,1). However, Obsinn can and does regularly omit the part in parentheses, in which case they are to be treated as default. In our database when we receive a message like this, we leave the sensor and level as NULL. This is different from Kvalobs' apparent behaviour, where it fills in all these missing values, causing a discrepancy between what is migrated from Kvalobs vs what we ingest from Obsinn. This leaves some lingering questions that we should probably answer in resolving this issue:

  1. Does Kvalobs apply a consistent default value (I think we've noticed it being 0) or can it vary?
  2. Can Obsinn send messages with explicit sensor/level that match the Kvalobs defaults? If so should this be treated identically to if they weren't specified, or is it meaningfully different.
  3. What is the ideal canonical representation to use going forward? Our NULLs or Kvalobs' defaults?

@ketilt
Copy link
Collaborator

ketilt commented Feb 28, 2025

My hypothesis is that NULL = 0 always. But I will check with Søren and Børge until I am sure of this

@ketilt
Copy link
Collaborator

ketilt commented Feb 28, 2025

Outline of reconciliation analysis for KDVH data:

  1. Predefine a relevant typeid, or loop over all typeids that were created to match KDVH data tables
  2. Predefine a set of stationids, or loop over all unique stationids in timeseries labels matching a given typeid.
  3. Predefine a set of paramids, or loop over all unique paramid in timeseries labels matching this typeid and station.
  4. Define set timeseriesKDVH containing all the timeseries labels matching given typeid, stationid and paramid
  5. Define set timeseriesMET containing all the timeseries labels matching the same stationid and paramid, but any other typeid
  6. For each timeseriesKDVH, stream chronologically through its data and all the timeseriesMET data series, and check for matches
    • Keep track of start/end times of matches and mismatches.
    • If values in timeseriesKDVH exactly match multiple candidates, prefer the lowest level and sensor values, and the typeid of the latest match.
    • After checking all data values, do one of the following:
      • If 100% match with single MET label: mark KDVH timeseries for deletion (and add alias?)
      • If 100% match but not always with the same MET label: mark KDVH timeseries for deletion (and store the match summary)
      • If 100% match with single MET label only from a given time until today: write the remainder of the historical data into the MET timeseries then mark KDVH timeseries for deletion (and add alias?)
      • If 100% match with multiple MET labels and only from a given time: log and request user decision through config file. When relevant, pre-fill typeid suggestions based on expected valid times of different typeid.
      • If 0% match: add to log of timeseries that need relabeling
      • Other cases: store the match summary (maximum 100 entries to avoid bloat from edge cases) and do nothing.

This will ignore cases of data existing in timeseriesMET that does not exist in timeseriesKDVH. This can be intentional and due to param filter settings, and I do not think it will be necessary to document such cases with this algorithm.

@Lun4m
Copy link
Collaborator

Lun4m commented Feb 28, 2025

To track how the migrations are going, I have defined separate labels for KDVH and Kvalobs timeseries (in a separate branch for now). During the migration we insert in both labels.met and the other two tables, depending on what we are migrating. I think they will also be helpful for step 1-5 here, but let me know if you have any suggestions on how to improve them for your purposes.

@ketilt
Copy link
Collaborator

ketilt commented Mar 3, 2025

@Lun4m Great, that will be nice to have. Are the labels.met always mandatory, or could timeseries have only labels.kdvh?

@intarga
Copy link
Member Author

intarga commented Mar 3, 2025

I guess not technically mandatory, but we want every timeseries to have it, else it won't be visible to API's like Frost. We set looser unique constraints on labels.met compared to the source labels so that we could handle duplicates from different sources that haven't been reconciled yet

@ketilt
Copy link
Collaborator

ketilt commented Mar 3, 2025

My hypothesis is that NULL = 0 always. But I will check with Søren and Børge until I am sure of this

According to Søren, level=0 and sensor=0 is always implied when omitted. So the canonical representation, if you want to stick to that, would be to drop NULL and have 0 be default

@intarga
Copy link
Member Author

intarga commented Mar 4, 2025

Excellent, let's roll that into the changes we make resolving this issue

@ketilt
Copy link
Collaborator

ketilt commented Mar 4, 2025

Now that we know that, does that enable ingesting obsinn data into labels.met from the start?

I know there used to be data going through obsinn that (intentionally) did not have paramids, such as status parameters. I'm not sure if these data have paramids today, but I wouldn't be surprised if obsinn contains data that cannot be given a met label. These data are only intended to be used internally anyway.

For the rest, if they can and are given a met label, I suspect we'd remove the need for large scale reconcilation of those sources?

@intarga
Copy link
Member Author

intarga commented Mar 4, 2025

Obsinn data does already receive a met label from the start, though this should still make reconciliation easier, as reconciliation candidates will have identical met labels.

I know there used to be data going through obsinn that (intentionally) did not have paramids, such as status parameters. I'm not sure if these data have paramids today, but I wouldn't be surprised if obsinn contains data that cannot be given a met label. These data are only intended to be used internally anyway.

I don't think we've seen this in the ingestor @Lun4m?

For the rest, if they can and are given a met label, I suspect we'd remove the need for large scale reconcilation of those sources?

As stated above, we will still need to reconcile them (though the reconciliation will be more straightforward), as they will still have separate timeseries IDs just with identical met labels

@Lun4m
Copy link
Collaborator

Lun4m commented Mar 4, 2025

Yes, there is some metadata coming in as parameters (whose names start with ##). I guess we haven't yet decided how to treat them, it should be easy to filter them out during ingestion, but maybe it's worth to keep them?

Anyway there might be cases where a param code does not have a corresponding param ID, but that would only lead to a NULL param ID in the met label, until a param ID is assigned in Stinfosys (?).

@intarga
Copy link
Member Author

intarga commented Mar 4, 2025

I don't know the answer to this, but I agree keeping them seems like the safest option

@ketilt
Copy link
Collaborator

ketilt commented Mar 4, 2025

Ah yes, I had forgotten that the reconciliation was only theoretically limited to the level/sensor omission cases. That's great.

So how is this duplication inevitable? Is it a relic of migration? Or is there some way that obsinn and kvalobs ingestors keep track of labels they created themselves, but not labels that the other ingestor created?

@intarga
Copy link
Member Author

intarga commented Mar 4, 2025

Whenever we get data from a source, we check its source specific label for a match, if none exists, we create a new source specific label, along with a timeseries entry and met label.

In the above process, we could check and reuse any matching met label and timeseries entry, and just overwrite any conflicting data. However, different data sources can have have conflicting opinions on the data (for one example, KDVH only keeps "corrected" data, when we would prefer the original), and doing this would result in the last data source to write getting priority. I don't think this is ideal because:

  • We should decide an explicit priority order between data sources, rather than just giving the last write priority
  • We should have a record of which time ranges in a timeseries are from which data source
  • I think this reconciliation process warrants at least some oversight from a content manager

@ketilt
Copy link
Collaborator

ketilt commented Mar 4, 2025

I fully agree for KDVH, which handled data in a very different way. That's why I once added the column CorrKDVH in ODA, to indicate that that is all we know. My reconciliation plan for KDVH is to untangle its lossiness and deduce the de facto time ranges that were chosen for some reason. In this way, I consider every typeid as a separate source.

For kvalobs, initially I would consider it to simply supply corrected to obsinn's original. That is kvalobs' purpose. But I know kvalobs can also decide to replace original with one of its missing data codes. I would agree this is unfortunate for provenance, while at the same time important to know and register.

And yet, if we are going to reconcile all of the data from the two sources anyway, I would hope we can do that preemptively and consistently rather than require a continual oversight and management. Either by simply ignoring kvalobs' rewrites of original, or having a separate column for kvalobs' original rewrites. The latter would leave further reconciliation for downstream, but we could use the priority order then as well. I would not be surprised if you have attempted to bring this up, and that people haven't been able to agree what is the correct way?

Or maybe this is just a deeper design philosophy of LARD? Will ROVE also be considered a separate source with separate yet identical labels?

@intarga
Copy link
Member Author

intarga commented Mar 4, 2025

rather than require a continual oversight

We'll only migrate once, so in theory this should only have to be done once.

I would hope we can do that preemptively and consistently

I would hope so too, in which case the experience of a content manager using the reconciliation tool would be simple matter of repeatedly clicking "Ok" on unambiguous reconciliation candidates.

But I know kvalobs can also decide to replace original with one of its missing data codes

This is actually news to me, I did not realise kvalobs did this, @Lun4m is this taken into account in the migrations?

simply ignoring kvalobs' rewrites of original

The fact I didn't know about this illustrates my point. It's easy to litigate the conflicts we expect, I'm more concerned about the ones we don't. Wouldn't you like to have a human in the loop in case of that?

ROVE also be considered a separate source

Rove will not mutate data at all; its flags and corrections will be strictly separate. I suppose you could consider that a design philosophy

@Lun4m
Copy link
Collaborator

Lun4m commented Mar 4, 2025

If we are talking about the -32767 and -32766 values (are there more?), yes, they are replaced with NULLs when importing the kvalobs dumps. But I also thought they were used in place of missing observations, not that they were replacing actual observations! Now that I'm checking I'm not doing the same for KDVH. @ketilt can these values also end up in KDVH?

@ketilt
Copy link
Collaborator

ketilt commented Mar 4, 2025

@intarga I see, so this is mainly a migration thing. Will LARD not consume kvalobs data after migration, then?

And I might be mistanken about the rewriting of original, I will check with Pål Sannes. The wording "original verdi er forkastet" which is used a lot in the flag documentation, could simply mean that a flag is set, not that the value is actually replaced.

@Lun4m KDVH converts these codes into NULLs

@intarga
Copy link
Member Author

intarga commented Mar 4, 2025

Will LARD not consume kvalobs data after migration, then?

Not to the main data table at least, which is where we're doing reconciliation.

Since Vegar wants us to go into beta serving kvalobs QC, we will ingest the Kafka queue for that, but only into a separate table, which will eventually be dropped when confident goes into production. In that table whatever kvalobs says is considered authoritative

@ketilt
Copy link
Collaborator

ketilt commented Mar 4, 2025

And you can scratch what I said about kvalobs replacing original. That is designed to never happen, and any "forkastet" value is simply indicated with a flag.

Given this, can we falsify the following hypothesis?

  1. Obsinn will only set original
  2. Kvalobs will only set corrected, controlinfo, useinfo and cfailed
  3. Thus there will never be collisions between the sources

I know very little about obsinn, so if it actually manages corrected, controlinfo or useinfo, it would make more sense to me to keep data separate.

If the hypothesis holds, I think I would rather keep humans out of the loop, to avoid introducing human errors into the mix. We should also document somehow the arbitrary choices the human makes, and I don't feel that would be as transparent.

If we do end up with a real duplicity, I'm curious whether we aim to reconcile and delete the duplicate, or reconcile and store the reconciled version separately and persistently. Depending on our aim, this might be relevant to combine with the climate filter in DROPS.

@intarga
Copy link
Member Author

intarga commented Mar 4, 2025

The original value is the only thing we're trying to reconcile

Given this, can we falsify the following hypothesis?

This hypothesis is equivalent to saying the priority order for data sources is always obsinn > kvalobs > kdvh, which seems reasonable, but I still think a content manager should at least look at the conflicts between the data sources to see if there's something we missed?

We should also document somehow the arbitrary choices the human makes, and I don't feel that would be as transparent.

If done via an API, its fairly trivial to log all the choices and information relevant to them.

@ketilt
Copy link
Collaborator

ketilt commented Mar 5, 2025

The original value is the only thing we're trying to reconcile
That simplifies things.

I would say it's trivially always obsinn > kvalobs for original, as there should be no conflicts there.

For KDVH it's a whole different matter and it definitely needs some human interaction for unambiguous cases, just like we need for the climate filter.

If done via an API, its fairly trivial to log all the choices and information relevant to them.

We can require a description of why the user chose as they did, but I expect that people will put rubbish in that text field and that the choice will remain a mystery for posterity, which is a frustrating thought to me because we cannot reproduce the conclusion.

@intarga
Copy link
Member Author

intarga commented Mar 5, 2025

We can require a description of why the user chose as they did, but I expect that people will put rubbish in that text field and that the choice will remain a mystery for posterity, which is a frustrating thought to me because we cannot reproduce the conclusion.

"people" here is I guess you and Elinah, so I hope you can be disciplined about it. Ultimately responsibility for the correctness of the content lies with the content managers, so it would be you two on the hook if you make a wrong or poorly documented call. If you're really that worried though I guess you could implement a review system so you need sign off from 2 content managers to make a change.

Anyway, do you feel that you have what you need to get started on this?

@ketilt
Copy link
Collaborator

ketilt commented Mar 5, 2025

Yes, you can leave the responsibility for the content integrity to us.

I think I have what I need. I will plan a reconciliation for KDVH as specified, and for kvalobs/obsinn I would use the priority order and only ever write original if obsinn does not supply one. When and if I see colliding cases, I'll iterate on the plan and algorithm.

As for timing, I've timeboxed some hours next week to work towards a code draft of the KDVH reconciliator. Does it need to be in Rust? That will of course take some time getting on board with.

@intarga
Copy link
Member Author

intarga commented Mar 5, 2025

It should be in rust yes, but don't worry we can help with that. To start with feel free to write pseudo-code as I think the queries are the most important thing, sometime this week or next I'll set up a PR with an integration test for you.

@intarga
Copy link
Member Author

intarga commented Mar 6, 2025

@ketilt I've set up the test on this branch: #76

You can find the test at the bottom of integration_tests/tests/end_to_end.rs, with examples of how to load fake data, make queries, assert, and print things. You can run the test with just test_reconciliation from the project root.

Also useful information would be to look at the schema files in the db directory, to see how the relevant tables fit together.

Dependencies you'll need to get going should be:

@ketilt
Copy link
Collaborator

ketilt commented Mar 14, 2025

I realized one issue concerning KDVH reconciliation:

  • The (relevant) KDVH data already fits into the met label, through the typeids given to each of its tables.
  • This has then been used in the climate filters to ensure that KDVH data tables get priority over kvalobs before 2006.
  • So if I reconcile the duplicates in favor of the kvalobs typeids, this will impact the results of the climate filter, potentially giving users a lower quality excerpt of the data.

Therefore, I will plan to output only diagnostics from this reconciliation. A reconciliation of KDVH/kvalobs must be coordinated with improvements in the climate filter. The new climate filter is not yet as effective as the old one, and I think more work needs to be done to export information to the new filter, before we are ready to consider the effects of this reconciliation. Having these diagnostics will be helpful.

Then for obsinn/kvalobs migration reconciliation. Looking at the LARD code, I see only a single obsvalue column. If there is no distinction between original and corrected in LARD, then I am not sure there is any need to persistently reconcile these sources. I suppose only a few users will be interested in selecting both of these versions. For the downstream climate filter, we'd need to do some kind of reconciliation (on the fly?) based on selecting the kvalobs corrected version whenever available. So we can potentially push all of the reconciliation needs to the climate filter. I am intuitively worried about the overhead, though it won't be worse than what ODA has been doing.

@Lun4m
Copy link
Collaborator

Lun4m commented Mar 17, 2025

I set up the migrations so that all data before 2006 is from KDVH. After 2006 it's mostly from kvalobs, plus data from T_EDATA and T_METARDATA (same as your migration script, which I just blindly followed).

Regarding original/corrected, I don't think we have landed on a solution for confident yet, but for the time being we are still going to be using kvalobs, so you can find the corrected values in flags.kvadata (soon to be legacy.data)

@ketilt
Copy link
Collaborator

ketilt commented Mar 20, 2025

I can imagine a situation where users could retrieve the full reconciled data in this manner:

SELECT d.obstime, d.obsvalue, d.qc_usable, f.corrected, f.controlinfo, f.useinfo, f.cfailed
FROM public.data d
LEFT JOIN legacy.data f
WHERE d.timeseries = f.timeseries
AND d.obstime = f.obstime
AND d.timeseries = <timeseriesid>

Here, the tsid from the kvalobs migrated legacy.data is the same as the obsinn ingested public.data. So the reconciliation has started with this situation of matching labels.met entries:

labels.met & labels.obsinn --> public.timeseries --> public.data (from obsinn)
labels.met & labels.kvalobs --> public.timeseries --> legacy.data (from kvalobs)

And ended up with this situation

labels.met & labels.obsinn & labels.kvalobs --> public.timeseries --> public.data & legacy.data

Normally, legacy.data.original will be redundant. However, if cases exist where kvalobs has data missing in obsinn, it would be good to keep the column. It can be written to public.data.obsvalue during reconciliation. After a full reconciliation, legacy.data.original could be removed.

Overlap between KDVH and other sources will not occur, due to the unique typeids chosen for KDVH tables. Reconciling these (not involving the 6-step algorithm I wrote above) should be limited to asserting this fact and reporting any anomalies. The only anomaly I would expect, is a double migration of the same data, which can be resolved by a simple deletion of the duplicate.

I'll write out some queries to find and resolve candidates. Let me know if you see a flaw in the logic above

@intarga
Copy link
Member Author

intarga commented Mar 20, 2025

You should not consider joins between legacy.data and public.data in end-user queries as they are not intended to coexist.

For now we are using legacy.data because we don’t have confident QC. When confident is production ready, we will entirely drop legacy.data.

@ketilt
Copy link
Collaborator

ketilt commented Mar 20, 2025

If all corrected values will be dropped, then it sounds to me like the only thing you really need from kvalobs, is these theoretical original values that do not exist in obsinn. I do not see a way to combine corrected and original without destroying the integrity of the raw data.

@intarga
Copy link
Member Author

intarga commented Mar 20, 2025

Yes, as stated previously, the plan is to drop kvalobs’ corrected entirely. Once confident is in production, all QC (including corrections) should come from confident

@intarga
Copy link
Member Author

intarga commented Mar 20, 2025

In the short term, you want to reconcile between KDVH and kvalobs in legacy.data, and in the longer term, between the originals in legacy.data and Obsinn (I think we’ve established that in this case Obsinn takes total priority, which is pretty straightforward)

@ketilt
Copy link
Collaborator

ketilt commented Mar 20, 2025

I understand. This should be very straight forward then.

I would start by asserting that the migrated data from KDVH and kvalobs have no conflicts, because of the non-overlapping typeid ranges they are given. A conflict would indicate a stranger issue than content management, but would likely be easy to resolve as a one-off.

Later, there is the issue of transferring unique data from KDVH from legacy.data to public.data. Rumor has it that old, historical data was digitized directly into KDVH. This data may or may not be corrected using histkvalobs.

Will there be a place at all in LARD for imputed values?

@intarga
Copy link
Member Author

intarga commented Mar 20, 2025

Later, there is the issue of transferring unique data from KDVH from legacy.data to public.data. Rumor has it that old, historical data was digitized directly into KDVH. This data may or may not be corrected using histkvalobs.

Will there be a place at all in LARD for imputed values?

I'm not sure I understand what you're saying here. My naive plan is to just take the original column from legacy.data which I assume covers this digitised data? Are you saying there are some cases where we should consider taking the corrected instead? I can see an argument if there are particular corrections Confident can't reproduce. I trust your judgment on that.

We will try and migrate HQC corrections we can't reproduce as part of Confident, but I think that can be considered a separate issue. And they won't be migrated to public.data.obsvalue, but a different column, perhaps even in a different table.

@ketilt
Copy link
Collaborator

ketilt commented Mar 21, 2025

Since KDVH does not distinguish between original and corrected in the data tables that are actively used, we cannot tell what is what. It uses only corrected from kvalobs, while before the kvalobs era the data may be regarded as original unless modified.

I don't know if that matters at all though. Since the KDVH data lives separately from the rest of the data, we will always be able to identify it. And it is eventually handled in the climate filter, so we get the intended result.

I would only really expect corrections on the product timeseries based on digitized data. Since we retain the raw data, reproducibility should be possible (by using the bespoke equations and coefficients involved in deducing ancient diurnal values). So when I think more about it, I don't see anything very worrying about taking the value from KDVH into public.data.obsvalue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants