-
Notifications
You must be signed in to change notification settings - Fork 11
Description
For a complicated set of reasons, it's currently quite annoying to rerun comps for an existing model, something that we've wanted to do for the past two final models. Let me explain the reasons in detail:
Currently, users can only compute comps in the interpret
stage as part of a model pipeline run. It is also extremely resource-intensive to compute comps, meaning we can generally only run it on remote infrastructure like AWS Batch unless we're running on a small subset of data. These two constraints on the comps calculation process combine to create a situation in which we have to run an entirely new model, generating new values and a new run ID in the process, if we want to regenerate comps for a final model. This situation is particularly problematic given that model values are not fully deterministic (#373), so we can't just use a comps run in place of a final model run when publishing comps; instead, we need to take care to use a final model run for publishing values, while using a subsequent comps run for publishing comps. This creates unnecessary complexity in the process of publishing comps, and it also feels counterintuitive, since we're not actually changing anything related to the model structure itself, just rerunning comps.
I propose that we refactor our comps process and data model in order to make it possible for users to run remote jobs to recalculate comps for existing models.
High level steps for for this refactor include:
- Tweak the comps data model to allow for multiple runs using the same model run, since
model.comp
is currently unique by(run_id, pin, card)
but the new approach will allow multiple runs per run ID- I think the easiest solution is probably adding an incrementing
version
column, such that subsequent runs using the same run ID increment theversion
field, and the row with the highestversion
is the published comp; if we do this we should migrate existing comps data to addversion = 1
for all existing comps
- I think the easiest solution is probably adding an incrementing
- Update the
interpret
pipeline stage to save comps using the new data model - Update code that consumes comps (like PINVAL and the 2-PIN comparison doc) to handle the new data model, which we could do two ways:
- Tweak the consuming code to query the most recent
version
for all comps - Switch consuming code to point to a new view called something like
model.vw_comp
that performs the version filtering logic and thereby maintains uniqueness by(run_id, pin, card)
- This is probably the better path forward
- Tweak the consuming code to query the most recent
- Make a new GitHub workflow to run comps based on an existing model run ID
- Overall workflow logic: Pull model artifacts for the supplied run ID, then run
dvc repro interpret
to run theinterpret
stage withcomp_enable=True
and write a set of comps with an incrementedversion
- Overall workflow logic: Pull model artifacts for the supplied run ID, then run