Skip to content

Allow users to compute comps using existing model object #383

@jeancochrane

Description

@jeancochrane

For a complicated set of reasons, it's currently quite annoying to rerun comps for an existing model, something that we've wanted to do for the past two final models. Let me explain the reasons in detail:

Currently, users can only compute comps in the interpret stage as part of a model pipeline run. It is also extremely resource-intensive to compute comps, meaning we can generally only run it on remote infrastructure like AWS Batch unless we're running on a small subset of data. These two constraints on the comps calculation process combine to create a situation in which we have to run an entirely new model, generating new values and a new run ID in the process, if we want to regenerate comps for a final model. This situation is particularly problematic given that model values are not fully deterministic (#373), so we can't just use a comps run in place of a final model run when publishing comps; instead, we need to take care to use a final model run for publishing values, while using a subsequent comps run for publishing comps. This creates unnecessary complexity in the process of publishing comps, and it also feels counterintuitive, since we're not actually changing anything related to the model structure itself, just rerunning comps.

I propose that we refactor our comps process and data model in order to make it possible for users to run remote jobs to recalculate comps for existing models.

High level steps for for this refactor include:

  • Tweak the comps data model to allow for multiple runs using the same model run, since model.comp is currently unique by (run_id, pin, card) but the new approach will allow multiple runs per run ID
    • I think the easiest solution is probably adding an incrementing version column, such that subsequent runs using the same run ID increment the version field, and the row with the highest version is the published comp; if we do this we should migrate existing comps data to add version = 1 for all existing comps
  • Update the interpret pipeline stage to save comps using the new data model
  • Update code that consumes comps (like PINVAL and the 2-PIN comparison doc) to handle the new data model, which we could do two ways:
    1. Tweak the consuming code to query the most recent version for all comps
    2. Switch consuming code to point to a new view called something like model.vw_comp that performs the version filtering logic and thereby maintains uniqueness by (run_id, pin, card)
      • This is probably the better path forward
  • Make a new GitHub workflow to run comps based on an existing model run ID
    • Overall workflow logic: Pull model artifacts for the supplied run ID, then run dvc repro interpret to run the interpret stage with comp_enable=True and write a set of comps with an incremented version

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions