Current focus: paper-ready evaluation infrastructure and baseline comparison.
Split evaluation into three stages so baseline training (expensive) doesn't re-run for every new model checkpoint.
Runs once per test dataset. Extracts ground truth, trains all baselines, caches predictions. Re-run only when test data, baseline hyperparams, or extraction logic change.
Inputs: test.arrayrecord, baseline config
Outputs: outputs/eval_harness/{harness_id}/
observations.parquet— one row per RTT prediction position:(seq_idx, meas_idx, src_key, dst_key, actual_rtt_ms, timestamp, global_median_pred, ema_pred, vivaldi_pred, mf_pred, trmf_pred)baselines_meta.json— hyperparams, training time, IP vocab size, etc.trmf_model.npz— saved TRMF matrices (F, X, W) for inspection
What it does:
- Load test sequences, extract (src_key, dst_key, rtt_ms, timestamp) at every RTT position
- Compute simple baselines inline (global median, EMA, last-seen, window-mean)
- Train Vivaldi (4D+height, ~5 epochs, seconds)
- Train static biased MF (r=16, 10 epochs, seconds)
- Train TRMF (restructure into time-indexed matrix, alternating GD, minutes)
- Record each baseline's prediction at every position
- Save to parquet
Runs once per checkpoint. Forward pass on test sequences, extract model predictions.
Inputs: checkpoint .pt, test.arrayrecord (or harness observations.parquet for position alignment)
Outputs: outputs/eval_harness/{harness_id}/model_preds/{run_name}.parquet
(seq_idx, meas_idx, model_top1_pred, rtt_byte1_logprobs, rtt_byte2_logprobs)
What it does:
- Load model checkpoint
- Forward pass on each test sequence
- At each RTT position (keyed by seq_idx, meas_idx to match harness), extract model's top-1 RTT prediction
- Save to parquet
For transformer w/ vs w/o timestamps: run model_eval twice — once on standard test sequences, once on sequences tokenized without timestamps. Both produce separate parquet files under the same harness.
Runs locally, no GPU, fast. Joins harness + model predictions, generates figures and tables.
Inputs: harness parquet + one or more model_preds parquets
Outputs: outputs/figures/ and outputs/tables/
cdf_comparison.pdf— main CDF figure, all methodscdf_comparison_log.pdf— log-scale x-axis variantpercentile_table.csv— p50, p75, p90, p95 relative error per methodloss_breakdown.csv— per-token-type CE and accuracy- Additional figures as needed for paper
What it does:
- Load harness observations + model predictions, join on (seq_idx, meas_idx)
- Compute relative error per method per observation
- Generate CDF figure (one curve per method)
- Generate percentile table
- Optionally stratify by pair volatility, time-of-day, IP version, etc.
src/ping_llm/eval/
harness.py # stage 1: extract ground truth + train/run all baselines
model_eval.py # stage 2: per-checkpoint forward pass
analysis.py # stage 3: join data, compute metrics, generate figures
vivaldi.py # Vivaldi NCS implementation
trmf.py # TRMF implementation
mf_baseline.py # biased MF (exists)
baselines.py # simple baselines: global median, EMA, etc. (refactor from existing)
token_classify.py # token type classification (exists)
loss_breakdown.py # per-type loss analysis (exists)
history_ping.py # live ping eval (exists)
run_all.py # legacy unified runner (keep working, wraps new stages)
outputs/
eval_harness/
default/ # harness_id = hash of test data + config
observations.parquet # ground truth + baseline predictions
baselines_meta.json
trmf_model.npz
model_preds/
deep60-60k.parquet # model predictions per checkpoint
deep60-was-60k.parquet
680m-131k.parquet
figures/ # generated by analysis.py
tables/
paper/
analysis.typ # typst, includes figures/ and tables/
After implementation, write docs/EVAL_GUIDE.md documenting:
- How the three-stage pipeline works
- When to re-run each stage (test data changes → harness; new checkpoint → model_eval; any change → analysis)
- How to add a new baseline method
- How to add a new analysis figure
- File formats and column schemas for parquet files
- How the typst report pulls in generated figures
Spring-force coordinate system. Each IP gets a coordinate vector (R^d) + height scalar.
predicted_rtt(i, j) = ||coord_i - coord_j|| + height_i + height_j- Adaptive timestep:
delta = c_c * e_i / (e_i + e_j), c_c=0.25 - Error estimate: EMA of relative prediction error
- Symmetric update: on each (src, dst, rtt), update both nodes
- Parameters: dim=4, height=yes, n_epochs=5
- Interface:
fit_vivaldi(measurements) -> dict[ip_key -> (coord, height, error)]
Temporal Regularized Matrix Factorization. Y = F @ X with AR regularization on X.
Data restructuring (the real complexity):
- Convert (src_key, dst_key, rtt_ms, timestamp) → sparse matrix Y of shape (n_pairs, n_timebins)
- Bin width: 15-minute bins (RIPE Atlas ~4 min interval → ~3-4 obs/bin)
- Observation mask: 1 where measured, 0 where missing
- Z-score normalize each row before fitting
Core algorithm (alternating gradient descent):
Objective:
||mask * (Y - F @ X)||² + λ_f||F||² + λ_x * AR_penalty(X, W) + η||X||² + λ_w||W||² + α * sum_to_one(W)
AR_penalty: Σ_t ||x_t - Σ_l W_l · x_{t-l}||²
Updates per iteration:
F -= lr * grad_F
X -= lr * grad_X (must accumulate across lags — known bug in reference impl)
W -= lr * grad_W
- Lag set for 15-min bins: {1, 2, 4, 96, 672} (15m, 30m, 1h, 1day, 1week)
- K=20, λ_f=1.0, λ_x=100, η=0.5, α=500, lr=1e-4, 10k iters
- Interface:
fit_trmf(measurements_with_timestamps) -> predict(src_key, dst_key, timestamp) - Cold-start fallback: if pair unseen, defer to static MF prediction
Decode TIMESTAMP_ABS/DELTA1/DELTA4 tokens to recover Unix timestamps per measurement.
Currently skipped in baselines.py — falls through to ROLE_BYTE_COUNTS skip. Need:
current_time_sec: Optional[int]accumulator before the loop- TIMESTAMP_ABS:
struct.unpack(">Q", ...)→ 8-byte uint64 (absolute seconds) - TIMESTAMP_DELTA1:
token_to_byte(data[0])→ 1-byte delta added to previous - TIMESTAMP_DELTA4:
struct.unpack(">I", ...)→ 4-byte delta added to previous - No timestamp: leave as None (30% of training data has no timestamps)
- Add
"timestamp": current_time_secto each position dict
Field shuffling caveat: encode_measurement randomizes field order within each
measurement, so the timestamp block may appear before or after RTT_START. Buffer
per-measurement: store timestamp when seen, attach to RTT entry at next
MEASUREMENT_START boundary (reconcile). ~5 extra lines for the buffering.
def plot_error_cdf(observations_df, model_preds_df, output_dir):
# Join on (seq_idx, meas_idx)
# Compute relative_error = |pred - actual| / actual for each method
# Plot empirical CDF per method
# Also: log-scale variant, percentile tabledeep60 architecture: 60L/384E/6H/64HD = ~106M params. Deep-narrow variant at roughly
the same param budget as the default 95M (20L/640E/10H/64HD). Defined in
scripts/train/slurm_deep60_was.sh.
Both jobs submitted to Unity cluster (same architecture, same data, same steps):
deep60-60k— baseline, plain cross-entropy (COMPLETED)deep60-was-60k— CE + λ₁=0.5 byte1 WAS + λ₂=0.1 byte2 WAS (SUBMITTED, job 56254030)
Compare via analysis pipeline:
- Overall CE loss and per-token-type CE
- RTT byte accuracy: byte 1 (exponent) vs byte 2 (mantissa) separately
- RTT prediction MAE/median vs all baselines
- Does Wasserstein improve RTT accuracy without hurting other token types?
Test λ₁ ∈ {0.1, 0.3, 0.5, 1.0} and λ₂ ∈ {0.0, 0.05, 0.1, 0.3} via autoresearch.
Train 680M (24L/1536E/12H/128HD) with winning loss config for 200k steps. Current 680m-200k (plain CE, 131k steps): MAE 48.2ms, log₂ median 0.435.
Not blocking — preemptible single A100 has worked.
- Figure 1: CDF of relative prediction error — all methods on one plot
- Figure 2: Loss breakdown by token type across model sizes
- Figure 3: RTT accuracy vs training steps (learning curve)
- Figure 4: Wasserstein loss effect on RTT byte 1 vs byte 2
- Figure 5: Live ping — model predicted RTT distribution vs actual
- Table 1: Baseline comparison (MAE, median AE, p50/p75/p90 relative error)
- Table 2: Per-token-type accuracy across models (95M, 106M deep60, 680M)
- Table 3: Wasserstein hyperparameter ablation
All generated by analysis.py, output to outputs/figures/ and outputs/tables/,
included by typst report at paper/analysis.typ.
- Eval pipeline skeleton — harness.py, model_eval.py, analysis.py with parquet I/O
- Vivaldi — easy, ~1 hour
- Timestamp extraction — needed by TRMF and transformer w/o timestamps eval
- TRMF — ~1-2 days, bug-fixed gradient computation
- CDF plotting + analysis — ~30 min
- Wire into harness — integrate all baselines into stage 1
- Transformer no-timestamp eval — tokenize test data without timestamps, run model_eval
- Wasserstein A/B analysis (when cluster jobs complete)
- Paper figures and tables via analysis.py
- Write
docs/EVAL_GUIDE.md— document the three-stage pipeline, file formats, how to add new baselines/figures, when to re-run each stage