-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Timeseries reconciliation #50
Comments
I could probably do this. Would this make more sense as an integral part of LARD, or located somewhere else in the architecture? |
I think it makes sense to do this in the ingestor (though isolated enough that we can break it out later), since that already runs an HTTP server and has a database connection pool. I suggest breaking down the task like so:
Please reach out for help at any point, I imagine especially that tasks 4 and 5 will go a lot smoother with some help from Manuel or I. We can also probably offer useful insight on the structure of the tables for task 2 |
I'll start sketching out the algorithm first. At some point it would be useful to have SQL access to something to run tests. Where should I start towards that goal? |
We have this guide (setup 0 and 3 are the only relevant steps), but I'll send you a message. |
Rather than running queries on the prod postgres, I think it would be better to use the integration tests for this. It sets up a local postgres db with the schema, and you load it up with some test case data (In this case I guess some stripped down timeseries that should be reconciled and some that shouldn't). I'm happy to help you write these tests, so just book some time with me (or Manuel/Louise if she's back) when you're ready to do it |
I'll focus first on reconciliation of KDVH data. I'd also like to understand better the need for reconciliation between obsinn and kvalobs data. Is it needed for edge cases, or generally for all timeseries? I know some incoming data never passes through kvalobs. So there is no MET label equivalent and there is no need for reconciliation. Some incoming data is tagged with external station identifiers like wmoid, wigosid, call sign, etc. This could be reconciliated, and we never did that in ODA. For the rest, the mapping appears trivial. Except I vaguely remember some discussions about sensor=0. Is this what breaks the 1-to-1 mapping between obsinn and MET/kvalobs labels, making reconciliation necessary? |
Generally for all. At the moment we don't do any reconciliation, so any canonical timeseries that straddles the starting point of obsinn ingestion and the stopping point of migration will currently be split across two database timeseries.
It's up to the content managers how far to take this, and I guess you're their representative here 🙂. I would be happy to mark this issue closed with just KDVH/Kvalobs/Obsinn reconciliation though, and leave further reconciliation for a follow-on issue.
Here's my recollection of the issue: When Obsinn sends data, it specifies parameters in the header like
|
My hypothesis is that NULL = 0 always. But I will check with Søren and Børge until I am sure of this |
Outline of reconciliation analysis for KDVH data:
This will ignore cases of data existing in timeseriesMET that does not exist in timeseriesKDVH. This can be intentional and due to param filter settings, and I do not think it will be necessary to document such cases with this algorithm. |
To track how the migrations are going, I have defined separate labels for KDVH and Kvalobs timeseries (in a separate branch for now). During the migration we insert in both |
@Lun4m Great, that will be nice to have. Are the labels.met always mandatory, or could timeseries have only labels.kdvh? |
I guess not technically mandatory, but we want every timeseries to have it, else it won't be visible to API's like Frost. We set looser unique constraints on |
According to Søren, level=0 and sensor=0 is always implied when omitted. So the canonical representation, if you want to stick to that, would be to drop NULL and have 0 be default |
Excellent, let's roll that into the changes we make resolving this issue |
Now that we know that, does that enable ingesting obsinn data into I know there used to be data going through obsinn that (intentionally) did not have paramids, such as status parameters. I'm not sure if these data have paramids today, but I wouldn't be surprised if obsinn contains data that cannot be given a met label. These data are only intended to be used internally anyway. For the rest, if they can and are given a met label, I suspect we'd remove the need for large scale reconcilation of those sources? |
Obsinn data does already receive a
I don't think we've seen this in the ingestor @Lun4m?
As stated above, we will still need to reconcile them (though the reconciliation will be more straightforward), as they will still have separate timeseries IDs just with identical |
Yes, there is some metadata coming in as parameters (whose names start with Anyway there might be cases where a param code does not have a corresponding param ID, but that would only lead to a NULL param ID in the met label, until a param ID is assigned in Stinfosys (?). |
I don't know the answer to this, but I agree keeping them seems like the safest option |
Ah yes, I had forgotten that the reconciliation was only theoretically limited to the level/sensor omission cases. That's great. So how is this duplication inevitable? Is it a relic of migration? Or is there some way that obsinn and kvalobs ingestors keep track of labels they created themselves, but not labels that the other ingestor created? |
Whenever we get data from a source, we check its source specific label for a match, if none exists, we create a new source specific label, along with a timeseries entry and met label. In the above process, we could check and reuse any matching met label and timeseries entry, and just overwrite any conflicting data. However, different data sources can have have conflicting opinions on the data (for one example, KDVH only keeps "corrected" data, when we would prefer the original), and doing this would result in the last data source to write getting priority. I don't think this is ideal because:
|
I fully agree for KDVH, which handled data in a very different way. That's why I once added the column CorrKDVH in ODA, to indicate that that is all we know. My reconciliation plan for KDVH is to untangle its lossiness and deduce the de facto time ranges that were chosen for some reason. In this way, I consider every typeid as a separate source. For kvalobs, initially I would consider it to simply supply corrected to obsinn's original. That is kvalobs' purpose. But I know kvalobs can also decide to replace original with one of its missing data codes. I would agree this is unfortunate for provenance, while at the same time important to know and register. And yet, if we are going to reconcile all of the data from the two sources anyway, I would hope we can do that preemptively and consistently rather than require a continual oversight and management. Either by simply ignoring kvalobs' rewrites of original, or having a separate column for kvalobs' original rewrites. The latter would leave further reconciliation for downstream, but we could use the priority order then as well. I would not be surprised if you have attempted to bring this up, and that people haven't been able to agree what is the correct way? Or maybe this is just a deeper design philosophy of LARD? Will ROVE also be considered a separate source with separate yet identical labels? |
We'll only migrate once, so in theory this should only have to be done once.
I would hope so too, in which case the experience of a content manager using the reconciliation tool would be simple matter of repeatedly clicking "Ok" on unambiguous reconciliation candidates.
This is actually news to me, I did not realise kvalobs did this, @Lun4m is this taken into account in the migrations?
The fact I didn't know about this illustrates my point. It's easy to litigate the conflicts we expect, I'm more concerned about the ones we don't. Wouldn't you like to have a human in the loop in case of that?
Rove will not mutate data at all; its flags and corrections will be strictly separate. I suppose you could consider that a design philosophy |
If we are talking about the -32767 and -32766 values (are there more?), yes, they are replaced with NULLs when importing the kvalobs dumps. But I also thought they were used in place of missing observations, not that they were replacing actual observations! Now that I'm checking I'm not doing the same for KDVH. @ketilt can these values also end up in KDVH? |
@intarga I see, so this is mainly a migration thing. Will LARD not consume kvalobs data after migration, then? And I might be mistanken about the rewriting of original, I will check with Pål Sannes. The wording "original verdi er forkastet" which is used a lot in the flag documentation, could simply mean that a flag is set, not that the value is actually replaced. @Lun4m KDVH converts these codes into NULLs |
Not to the main data table at least, which is where we're doing reconciliation. Since Vegar wants us to go into beta serving kvalobs QC, we will ingest the Kafka queue for that, but only into a separate table, which will eventually be dropped when confident goes into production. In that table whatever kvalobs says is considered authoritative |
And you can scratch what I said about kvalobs replacing original. That is designed to never happen, and any "forkastet" value is simply indicated with a flag. Given this, can we falsify the following hypothesis?
I know very little about obsinn, so if it actually manages corrected, controlinfo or useinfo, it would make more sense to me to keep data separate. If the hypothesis holds, I think I would rather keep humans out of the loop, to avoid introducing human errors into the mix. We should also document somehow the arbitrary choices the human makes, and I don't feel that would be as transparent. If we do end up with a real duplicity, I'm curious whether we aim to reconcile and delete the duplicate, or reconcile and store the reconciled version separately and persistently. Depending on our aim, this might be relevant to combine with the climate filter in DROPS. |
The original value is the only thing we're trying to reconcile
This hypothesis is equivalent to saying the priority order for data sources is always
If done via an API, its fairly trivial to log all the choices and information relevant to them. |
I would say it's trivially always For KDVH it's a whole different matter and it definitely needs some human interaction for unambiguous cases, just like we need for the climate filter.
We can require a description of why the user chose as they did, but I expect that people will put rubbish in that text field and that the choice will remain a mystery for posterity, which is a frustrating thought to me because we cannot reproduce the conclusion. |
"people" here is I guess you and Elinah, so I hope you can be disciplined about it. Ultimately responsibility for the correctness of the content lies with the content managers, so it would be you two on the hook if you make a wrong or poorly documented call. If you're really that worried though I guess you could implement a review system so you need sign off from 2 content managers to make a change. Anyway, do you feel that you have what you need to get started on this? |
Yes, you can leave the responsibility for the content integrity to us. I think I have what I need. I will plan a reconciliation for KDVH as specified, and for kvalobs/obsinn I would use the priority order and only ever write original if obsinn does not supply one. When and if I see colliding cases, I'll iterate on the plan and algorithm. As for timing, I've timeboxed some hours next week to work towards a code draft of the KDVH reconciliator. Does it need to be in Rust? That will of course take some time getting on board with. |
It should be in rust yes, but don't worry we can help with that. To start with feel free to write pseudo-code as I think the queries are the most important thing, sometime this week or next I'll set up a PR with an integration test for you. |
@ketilt I've set up the test on this branch: #76 You can find the test at the bottom of Also useful information would be to look at the schema files in the Dependencies you'll need to get going should be: |
I realized one issue concerning KDVH reconciliation:
Therefore, I will plan to output only diagnostics from this reconciliation. A reconciliation of KDVH/kvalobs must be coordinated with improvements in the climate filter. The new climate filter is not yet as effective as the old one, and I think more work needs to be done to export information to the new filter, before we are ready to consider the effects of this reconciliation. Having these diagnostics will be helpful. Then for obsinn/kvalobs migration reconciliation. Looking at the LARD code, I see only a single |
I set up the migrations so that all data before 2006 is from KDVH. After 2006 it's mostly from kvalobs, plus data from T_EDATA and T_METARDATA (same as your migration script, which I just blindly followed). Regarding |
I can imagine a situation where users could retrieve the full reconciled data in this manner:
Here, the tsid from the kvalobs migrated labels.met & labels.obsinn --> public.timeseries --> public.data (from obsinn) And ended up with this situation labels.met & labels.obsinn & labels.kvalobs --> public.timeseries --> public.data & legacy.data Normally, Overlap between KDVH and other sources will not occur, due to the unique typeids chosen for KDVH tables. Reconciling these (not involving the 6-step algorithm I wrote above) should be limited to asserting this fact and reporting any anomalies. The only anomaly I would expect, is a double migration of the same data, which can be resolved by a simple deletion of the duplicate. I'll write out some queries to find and resolve candidates. Let me know if you see a flaw in the logic above |
You should not consider joins between legacy.data and public.data in end-user queries as they are not intended to coexist. For now we are using legacy.data because we don’t have confident QC. When confident is production ready, we will entirely drop legacy.data. |
If all |
Yes, as stated previously, the plan is to drop kvalobs’ corrected entirely. Once confident is in production, all QC (including corrections) should come from confident |
In the short term, you want to reconcile between KDVH and kvalobs in legacy.data, and in the longer term, between the originals in legacy.data and Obsinn (I think we’ve established that in this case Obsinn takes total priority, which is pretty straightforward) |
I understand. This should be very straight forward then. I would start by asserting that the migrated data from KDVH and kvalobs have no conflicts, because of the non-overlapping typeid ranges they are given. A conflict would indicate a stranger issue than content management, but would likely be easy to resolve as a one-off. Later, there is the issue of transferring unique data from KDVH from Will there be a place at all in LARD for imputed values? |
I'm not sure I understand what you're saying here. My naive plan is to just take the original column from We will try and migrate HQC corrections we can't reproduce as part of Confident, but I think that can be considered a separate issue. And they won't be migrated to public.data.obsvalue, but a different column, perhaps even in a different table. |
Since KDVH does not distinguish between original and corrected in the data tables that are actively used, we cannot tell what is what. It uses only corrected from kvalobs, while before the kvalobs era the data may be regarded as original unless modified. I don't know if that matters at all though. Since the KDVH data lives separately from the rest of the data, we will always be able to identify it. And it is eventually handled in the climate filter, so we get the intended result. I would only really expect corrections on the product timeseries based on digitized data. Since we retain the raw data, reproducibility should be possible (by using the bespoke equations and coefficients involved in deducing ancient diurnal values). So when I think more about it, I don't see anything very worrying about taking the value from KDVH into |
Make a script to help content managers find and combine duplicate timeseries from different sources
The text was updated successfully, but these errors were encountered: