scripts/eval_datasets.py measures how many annotated evidence spans survive pruning across a configuration of context-relevance datasets. This document explains how to run the CLI, what each configuration does, and how to interpret the generated artefacts. It intentionally omits score tables so the instructions remain evergreen.
For each dataset in a config file, the script:
- Loads the dataset from Hugging Face (e.g.,
hotchpotch/msmarco-context-relevance). - Runs
model.process()to prune each passage. - Compares the pruned spans against the labelled evidence annotations.
- Computes span-level precision, recall, and a β = 2 F2 score (recall-weighted) plus mean compression.
Dropping relevant spans (false negatives) is more damaging than keeping surplus context, so F2 is the headline metric.
- Some datasets contain very long passages (>60 k characters). If you hit memory errors, temporarily limit evaluation via
--limit, use the nano configs, or regenerate the dataset with shorter spans. - A small number of queries in the multilingual sets are malformed or language-mismatched. The published configs already omit the worst offenders. If you uncover new issues, send a PR to update the source dataset rather than editing the evaluation script.
- Compression percentages are per-dataset averages; for heterogeneous corpora (e.g., GooAQ vs. JA-focused Wikipedia), expect different baseline compression even at the same threshold.
All configs live under configs/eval_datasets/:
| File | Purpose |
|---|---|
ja.yaml, en.yaml |
Full evaluation suites (all datasets, full sample counts). |
ja_nano.yaml, en_nano.yaml |
“Nano” subsets with per-dataset n_samples overrides for quick smoke tests. Use these when iterating on code or verifying regressions; they run 10–20× faster. |
Each entry in a config looks like:
- dataset_name: hotchpotch/msmarco-context-relevance
subset: default
n_samples: 100 # only in *_nano.yamlThe script reads each row sequentially. You can clone a file and add/remove datasets for ad‑hoc scenarios. Optional keys:
split: override the global split from the YAML header.n_samples: cap the number of records loaded from that dataset (only present in*_nano.yaml).
uv run python scripts/eval_datasets.py \
--config CONFIG_PATH \
--model MODEL_DIR \
--threshold 0.1 \
--batch-size 256 \
--timing-details \
--output-json tmp/eval_<label>.json \
--output-file tmp/eval_<label>.mdReplace:
CONFIG_PATHwith one of the YAML files described above.MODEL_DIRwith the local run directory or exported checkpoint (the script auto-detectsfinal_model/when present).--thresholdwith the pruning threshold you want to sweep.--thresholds/--th(optional) to evaluate multiple thresholds in one run.
Useful flags:
| Flag | Description |
|---|---|
--thresholds/--th |
Comma-separated list of extra thresholds. Repeat the flag to add more. |
--split |
Override the split for every dataset (default: YAML split). |
--limit |
Evaluate only the first N examples per dataset (applied after any n_samples cap). |
--target |
Restrict evaluation to specific datasets (dataset_name:subset). Repeatable. |
--device |
Force a specific device (e.g., cuda, cuda:1, cpu). |
--batch-size |
Controls the number of examples passed to model.process per call (default 512). |
--no-progress / --silent |
Suppress progress bars or all intermediate logs. |
--timing-details |
Print per-stage timing and include them in the summaries. |
--use-automodel |
Load checkpoints via transformers.AutoModel (useful for remote-code models). |
The Markdown (.md) and JSON (.json) summaries land wherever you point --output-file / --output-json. Many workflows use tmp/ during iteration, then copy artefacts under output/release_models/<model>/eval_results/ when ready to publish.
To run a fast smoke test:
uv run python scripts/eval_datasets.py \
--config configs/eval_datasets/ja_nano.yaml \
--model output/release_models/open-provence-reranker-v1 \
--threshold 0.1 \
--batch-size 128 \
--output-json tmp/eval_v1_ja_nano.jsonThe *_nano.yaml configs cap each dataset at n_samples=100, so the whole run finishes in a few minutes while still hitting every dataset.
To sweep thresholds without editing the YAML, call the script in a loop:
for th in 0.05 0.1 0.3 0.5; do
uv run python scripts/eval_datasets.py \
--config configs/eval_datasets/en.yaml \
--model output/release_models/open-provence-reranker-v1-gte-modernbert-base \
--threshold "$th" \
--batch-size 256 \
--output-json tmp/eval_en_th_${th//./_}.json \
--output-file tmp/eval_en_th_${th//./_}.md
doneCombine this with the nano configs for rapid iteration:
for th in 0.05 0.1; do
uv run python scripts/eval_datasets.py \
--config configs/eval_datasets/en_nano.yaml \
--model output/release_models/open-provence-reranker-v1-gte-modernbert-base \
--threshold "$th" \
--batch-size 128 \
--output-json tmp/eval_en_nano_th_${th//./_}.json
doneOnce the numbers look good, re-run with the full configs to generate publishable artefacts.
- Inspect
tmp/eval_<label>.jsonand check that:macrometrics (macro F2/recall/precision) are within expected ranges.datasetsentries are present for every config row.
- Copy the JSON and Markdown into the release tree, e.g.,
cp tmp/eval_en_th_0_1.{json,md} \ output/release_models/open-provence-reranker-v1-gte-modernbert-base/eval_results/eval_datasets_en.* - Update any dated reports (e.g.,
docs/eval_reports/<date>.md) if the numbers correspond to a release milestone.
- Config matches the intended language (no
enmodel onja.yamlwithout overrides). - Ignore file includes newly discovered bad records when necessary.
- Markdown/JSON summaries stored under
tmp/have been archived or copied to the release directory. - Threshold sweeps include both full and nano configs during development.
Following this playbook keeps dataset-level pruning evaluations reproducible and manageable, whether you are iterating on new pruning heads or validating regression fixes.