You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Our decision not to persist features limits the flexibility and reproducibility of the system. Triage is designed for batch processing, which means that we could follow functional data engineering principles and store batch partitioned feature and label data in partitioned Postgres tables, Redshift, or HDFS. This would make flexible re-testing of models on different label time periods much easier by being able to construct the matrices on the fly at evaluation time without needing to rebuild features and labels and make Rayid's preferred solution for #378 much easier to implement.
Connecting this to #368, if we versioned features on the hash of query logic, aggregation function, aggregation time period, imputation method, etc., we would be able to track how changes in feature definitions between experiments shifted the distributions of features as well as monitor how feature distributions for the same feature definitions change over time (and throw warnings or errors if, e.g., variance on a feature dropped dramatically between batches). Currently, from_obj logic changes are hidden because they affect the experiment hash but not the feature names.
There are some complications to this approach based on how the group + triage typically operate. Data are received and processed in batches from partners, but the definition of a batch in triage is more closely tied to the experiment and experiment run. Storing all experiments or experiment runs as new batches is likely overly redundant. If you change the label definition, do you really need to create a new batch for all of the features? No, but if you rerun the same experiment on new source data, you will. We could consider the hash of the experiment components (e.g., label definition) in a batch definition, but triage has no good way of knowing what batch the source data are at, so it would not have a good basis for knowing when to create a new batch for the same configuration.
A couple of alternatives for this:
Triage has some way of reading batch version of source data and smartly updates its batches if cohort, label, etc. definitions change (only ever adding to existing batches); we currently do this for at least one project with the record linkage timestamp user_metadata key, but figuring out how to make that generalizable to different methods of versioning source batches is a little harder
A triage "batch" incorporates everything but the learner grid, including the run time, and there are presumably redundant batches
In either case, batch_id is added as metadata to the experiment_runs and a batch_metadata table is introduced, potentially subsuming some of the concepts from experiment_runs and/or experiments
The text was updated successfully, but these errors were encountered:
What happens to the replace flag under this paradigm? replace indicates that there was an upstream error in the batch process (e.g., an error in cleaning, or PII leakage) and the entire batch (features, labels) and all dependencies (models, evaluations) should be replaced. This is the only time that data should be dropped/updated.
Chiming in from beyond the DSSG alum past, but one thing I have done to
solve this pattern in my work is rely on [Ibis](https://ibis-project.org/)
which is basically - what if SQL alchemy, but pandas / data focused, and
supports all manner of backends (including on memory pandas DF) but scales
to Redshift/postgres/bigquery etc
Our decision not to persist features limits the flexibility and reproducibility of the system. Triage is designed for batch processing, which means that we could follow functional data engineering principles and store batch partitioned feature and label data in partitioned Postgres tables, Redshift, or HDFS. This would make flexible re-testing of models on different label time periods much easier by being able to construct the matrices on the fly at evaluation time without needing to rebuild features and labels and make Rayid's preferred solution for #378 much easier to implement.
Connecting this to #368, if we versioned features on the hash of query logic, aggregation function, aggregation time period, imputation method, etc., we would be able to track how changes in feature definitions between experiments shifted the distributions of features as well as monitor how feature distributions for the same feature definitions change over time (and throw warnings or errors if, e.g., variance on a feature dropped dramatically between batches). Currently,
from_obj
logic changes are hidden because they affect the experiment hash but not the feature names.There are some complications to this approach based on how the group + triage typically operate. Data are received and processed in batches from partners, but the definition of a batch in triage is more closely tied to the experiment and experiment run. Storing all experiments or experiment runs as new batches is likely overly redundant. If you change the label definition, do you really need to create a new batch for all of the features? No, but if you rerun the same experiment on new source data, you will. We could consider the hash of the experiment components (e.g., label definition) in a batch definition, but triage has no good way of knowing what batch the source data are at, so it would not have a good basis for knowing when to create a new batch for the same configuration.
A couple of alternatives for this:
In either case, batch_id is added as metadata to the experiment_runs and a batch_metadata table is introduced, potentially subsuming some of the concepts from experiment_runs and/or experiments
The text was updated successfully, but these errors were encountered: