Adding CLI for `prepare_data` and `run_models` #50

marlinfiggins · 2025-03-14T23:28:12Z

Following up on #48 and #49, I'm adding a CLI interface for these two purposes.

Overview

This PR introduces a standardized command-line interface (CLI) for running evofr models, eliminating the need for per-project run-model.py scripts. The new CLI allows users to specify model configurations, data inputs, and inference settings via a YAML configuration file while also supporting command-line file overrides for input data.

Challenge

The biggest challenge for this CLI is that the required inputs vary across data classes, models, and inference methods. Some models require priors (e.g., InnovationMLR), others need different types of sequence count data (e.g., VariantFrequencies vs. InnovationSequenceCounts), and inference methods have different parameter requirements (e.g., InferNUTS vs. InferMAP). We dynamically parses the configuration, identifies the appropriate components, and instantiates them with the correct arguments while allowing users to override file-based inputs at runtime.

Solution

The CLI is implemented within evofr/commands/run_model.py and follows these steps:

Configuration Loading: Reads a YAML file containing "model", "data", and "inference" sections.
Data File Handling: Paths to data files (e.g., raw_seq_path) in the YAML can be overridden via CLI arguments (--raw-seq-path). The CLI automatically loads these files into pandas DataFrames before passing them to data classes.
Component Instantiation: Uses a recursive function to identify and instantiate models, data classes, priors, and inference methods based on the YAML specification.
Model Execution: Runs the specified inference method on the model and data.
Results Export: Saves the fitted model results in JSON format to the specified output directory.

Remaining challenges

Adding forecasting logic
We still need to filter the data set to locations of interest before passing to the model config or CLI.
We need to scope out the what we want to prepare data command to do fully.

Usage instructions

For testing purposes, you can run poetry shell and then use the commands below.

To use a config file, just run:

evofr run-model --config config.yaml

If you want to override input files at runtime

evofr run-model --config config.yaml \
    --raw-seq-path data/new_seq.tsv \
    --export-path results/

### Example config

```yaml
model:
  type: "MultinomialLogisticRegression"
  tau: 4.2
data:
  type: "VariantFrequencies"
  raw_seq_path: "test/testing_data/mlr-variant-counts.tsv"
  pivot: "C"
inference:
  type: "InferNUTS"
  num_warmup: 500
  num_samples: 1500
export:
  export_path: "results/"
  sites: ["freq", "ga"]
  dated: [True, False]
  forecasts: [False, False]

marlinfiggins · 2025-03-17T20:17:51Z

Adding `prepare-data` command

Overview

I've added a command prepare-data. The prepare-data command ensures that input case counts and sequence counts are filtered, pruned, collapsed (in other), and formatted before analysis.

Previously, each evofr-based analysis handled data preprocessing through separate scripts that were copied and modified as needed. By centralizing this, we reduce duplicated code, improve maintainability, and simplify preprocessing workflow across evofr projects.

Future concerns

Different analyses require different preprocessing steps. The current implementation starts with sequence counts of lineages overall and thencollapses rare variants into an “other” category, but in the future, we may want to collapse variants into their parent lineages instead as in collapse-lineage-counts.py in forecasts-ncov.

Generalizing this script may be challenging since different evofr models require different types of data processing. For example:

ef.MLRNowcast requires specialized sequence processing to get both submission and collection dates (ef.DelaySequenceCounts).
ef.InnovationMLR needs additional input data (ef.InnovationSequenceCounts) including a parent-variant mapping and relies on collapsing small lineages.
Future models may require new preprocessing steps, making long-term maintenance a consideration.

I think that it might be useful to break these into individual scripts and commands in the future since case counts are not always needed. To reflect this and allow easy expansion of commands in the future, we might consider renaming the command to prepare-sequence-counts and prepare-cases to better separate these processes.

Approach

The CLI is implemented in evofr/cli.py and evofr/commands/prepare_data.py and follows these steps:

Configuration Handling:

Reads a YAML configuration file specifying data paths and preprocessing settings.
Allows command-line overrides for data file paths.

Data Preprocessing

Filtering by Date: Restricts input data to a specified date range.
Pruning Recent Data: Excludes recent sequences to mitigate reporting biases.
Location-Based Filtering: Includes or excludes locations based on sequence counts.
Collapsing Clades: Groups variants below a sequence threshold into an "other" category.

Data Output

Saves processed sequence counts and case counts as tab-separated (.tsv) files for use in evofr models.

Usage instructions

To prepare data using a configuration file:

evofr prepare-data --config config.yaml

To override input file paths at runtime:

evofr prepare-data --config config.yaml \
    --seq-counts data/new_sequences.tsv \
    --cases data/new_cases.tsv

Example YAML config.yaml

prepare_data:
  seq_counts: "data/raw_sequences.tsv"
  cases: "data/raw_cases.tsv"  # Optional: Can be omitted if not needed
  output_seq_counts: "data/processed_sequences.tsv"
  output_cases: "data/processed_cases.tsv"  # Optional: Can be omitted
  max_date: "2024-03-10"
  included_days: 60
  prune_seq_days: 7
  location_min_seq: 5
  location_min_seq_days: 30
  excluded_locations: "data/excluded_locations.txt"
  clade_min_seq: 10
  clade_min_seq_days: 14
  force_include_clades:
    - "BA.1"
    - "BA.2"

marlinfiggins added 5 commits March 14, 2025 16:05

First pass at model-run script

cae189b

Adding init_to_MAP NUTS sampler

ab2b204

Registering priors for InnovationModel

086f478

Adding proper exporting

7242870

First pass at prepare-data

0ee22a4

marlinfiggins added 3 commits March 17, 2025 14:30

Adding tests for CLI

0f18efe

Switching to MAP for tests

a2e157f

More test fixes

be15b7b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding CLI for `prepare_data` and `run_models` #50

Adding CLI for `prepare_data` and `run_models` #50

marlinfiggins commented Mar 14, 2025 •

edited

Loading

marlinfiggins commented Mar 17, 2025 •

edited

Loading

Adding CLI for prepare_data and run_models #50

Are you sure you want to change the base?

Adding CLI for prepare_data and run_models #50

Conversation

marlinfiggins commented Mar 14, 2025 • edited Loading

Overview

Challenge

Solution

Remaining challenges

Usage instructions

marlinfiggins commented Mar 17, 2025 • edited Loading

Adding prepare-data command

Overview

Future concerns

Approach

Usage instructions

Example YAML config.yaml

Adding CLI for `prepare_data` and `run_models` #50

Adding CLI for `prepare_data` and `run_models` #50

marlinfiggins commented Mar 14, 2025 •

edited

Loading

marlinfiggins commented Mar 17, 2025 •

edited

Loading

Adding `prepare-data` command