Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding CLI for prepare_data and run_models #50

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Conversation

marlinfiggins
Copy link
Collaborator

@marlinfiggins marlinfiggins commented Mar 14, 2025

Following up on #48 and #49, I'm adding a CLI interface for these two purposes.

Overview

This PR introduces a standardized command-line interface (CLI) for running evofr models, eliminating the need for per-project run-model.py scripts. The new CLI allows users to specify model configurations, data inputs, and inference settings via a YAML configuration file while also supporting command-line file overrides for input data.

Challenge

The biggest challenge for this CLI is that the required inputs vary across data classes, models, and inference methods. Some models require priors (e.g., InnovationMLR), others need different types of sequence count data (e.g., VariantFrequencies vs. InnovationSequenceCounts), and inference methods have different parameter requirements (e.g., InferNUTS vs. InferMAP). We dynamically parses the configuration, identifies the appropriate components, and instantiates them with the correct arguments while allowing users to override file-based inputs at runtime.

Solution

The CLI is implemented within evofr/commands/run_model.py and follows these steps:

  • Configuration Loading: Reads a YAML file containing "model", "data", and "inference" sections.
  • Data File Handling: Paths to data files (e.g., raw_seq_path) in the YAML can be overridden via CLI arguments (--raw-seq-path). The CLI automatically loads these files into pandas DataFrames before passing them to data classes.
  • Component Instantiation: Uses a recursive function to identify and instantiate models, data classes, priors, and inference methods based on the YAML specification.
  • Model Execution: Runs the specified inference method on the model and data.
  • Results Export: Saves the fitted model results in JSON format to the specified output directory.

Remaining challenges

  • Adding forecasting logic
  • We still need to filter the data set to locations of interest before passing to the model config or CLI.
  • We need to scope out the what we want to prepare data command to do fully.

Usage instructions

For testing purposes, you can run poetry shell and then use the commands below.

To use a config file, just run:

evofr run-model --config config.yaml

If you want to override input files at runtime

evofr run-model --config config.yaml \
    --raw-seq-path data/new_seq.tsv \
    --export-path results/
### Example config

```yaml
model:
  type: "MultinomialLogisticRegression"
  tau: 4.2
data:
  type: "VariantFrequencies"
  raw_seq_path: "test/testing_data/mlr-variant-counts.tsv"
  pivot: "C"
inference:
  type: "InferNUTS"
  num_warmup: 500
  num_samples: 1500
export:
  export_path: "results/"
  sites: ["freq", "ga"]
  dated: [True, False]
  forecasts: [False, False]

@marlinfiggins
Copy link
Collaborator Author

marlinfiggins commented Mar 17, 2025

Adding prepare-data command

Overview

I've added a command prepare-data. The prepare-data command ensures that input case counts and sequence counts are filtered, pruned, collapsed (in other), and formatted before analysis.

Previously, each evofr-based analysis handled data preprocessing through separate scripts that were copied and modified as needed. By centralizing this, we reduce duplicated code, improve maintainability, and simplify preprocessing workflow across evofr projects.

Future concerns

Different analyses require different preprocessing steps. The current implementation starts with sequence counts of lineages overall and thencollapses rare variants into an “other” category, but in the future, we may want to collapse variants into their parent lineages instead as in collapse-lineage-counts.py in forecasts-ncov.

Generalizing this script may be challenging since different evofr models require different types of data processing. For example:

I think that it might be useful to break these into individual scripts and commands in the future since case counts are not always needed. To reflect this and allow easy expansion of commands in the future, we might consider renaming the command to prepare-sequence-counts and prepare-cases to better separate these processes.

Approach

The CLI is implemented in evofr/cli.py and evofr/commands/prepare_data.py and follows these steps:

  1. Configuration Handling:
  • Reads a YAML configuration file specifying data paths and preprocessing settings.
  • Allows command-line overrides for data file paths.
  1. Data Preprocessing
  • Filtering by Date: Restricts input data to a specified date range.
  • Pruning Recent Data: Excludes recent sequences to mitigate reporting biases.
  • Location-Based Filtering: Includes or excludes locations based on sequence counts.
  • Collapsing Clades: Groups variants below a sequence threshold into an "other" category.
  1. Data Output
  • Saves processed sequence counts and case counts as tab-separated (.tsv) files for use in evofr models.

Usage instructions

To prepare data using a configuration file:

evofr prepare-data --config config.yaml

To override input file paths at runtime:

evofr prepare-data --config config.yaml \
    --seq-counts data/new_sequences.tsv \
    --cases data/new_cases.tsv

Example YAML config.yaml

prepare_data:
  seq_counts: "data/raw_sequences.tsv"
  cases: "data/raw_cases.tsv"  # Optional: Can be omitted if not needed
  output_seq_counts: "data/processed_sequences.tsv"
  output_cases: "data/processed_cases.tsv"  # Optional: Can be omitted
  max_date: "2024-03-10"
  included_days: 60
  prune_seq_days: 7
  location_min_seq: 5
  location_min_seq_days: 30
  excluded_locations: "data/excluded_locations.txt"
  clade_min_seq: 10
  clade_min_seq_days: 14
  force_include_clades:
    - "BA.1"
    - "BA.2"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant