|
| 1 | +# Development Guide |
| 2 | + |
| 3 | +Notes for contributors extending the exporter (new loaders/targets, schema tweaks, etc.). User-facing usage lives in `README.md`. |
| 4 | + |
| 5 | +## Local environment |
| 6 | +- Install deps with `uv sync --dev`. |
| 7 | +- Run checks with `uv run pre-commit run --all-files`. |
| 8 | +- Add pre-commit to git hooks with `uv run pre-commit install`. |
| 9 | +- Run tests with `uv run pytest -v`. Integration-style tests need: |
| 10 | + - `NEPTUNE2_E2E_API_TOKEN`, `NEPTUNE2_E2E_PROJECT` |
| 11 | + - `NEPTUNE3_E2E_API_TOKEN`, `NEPTUNE3_E2E_PROJECT` |
| 12 | + |
| 13 | +## Code structure (src/neptune_exporter) |
| 14 | +- `main.py`: Click CLI wiring for `export`, `load`, `summary`. |
| 15 | +- `exporters/`: `Neptune2Exporter` (neptune-client) and `Neptune3Exporter` (neptune-query); both yield `pyarrow.RecordBatch` objects matching `model.SCHEMA`. |
| 16 | +- `export_manager.py`: Orchestrates export per project/run, fans out batches per run, and skips runs already on disk. |
| 17 | +- `storage/`: `ParquetWriter` (streaming parts per run, temp file cleanup) and `ParquetReader` (per-project/run streaming, metadata extraction). |
| 18 | +- `loaders/`: Common `DataLoader` interface plus `MLflowLoader` and `WandBLoader` implementations. |
| 19 | +- `loader_manager.py`: Topologically sorts runs (parents before forks), resumes runs if the target already has them, and streams parts to loaders. |
| 20 | +- `summary_manager.py` & `validation/report_formatter.py`: Lightweight data introspection/printing for already-exported parquet. |
| 21 | +- `model.py`: Central PyArrow schema. |
| 22 | +- `utils.py`: Shared helpers (`sanitize_path_part` adds a digest to keep paths safe/unique). |
| 23 | + |
| 24 | +## Data flow overview |
| 25 | +1. Export (primary): exporter → `ExportManager` → `ParquetWriter` (+ file downloads). A run is considered complete when `*_part_0.parquet` exists; runs without it are rewritable. |
| 26 | +2. Summary: `ParquetReader` → `SummaryManager` → `ReportFormatter`. |
| 27 | +3. Load (optional): `ParquetReader` → `LoaderManager` → selected `DataLoader`. |
| 28 | + |
| 29 | +Exports are resumable but not incremental: reruns skip completed runs, so new data added to an already-exported run will be missed unless you re-export to a fresh location. |
| 30 | + |
| 31 | +## Adding or changing components |
| 32 | +- **New loader** (e.g., another tracking backend): |
| 33 | + - Implement `DataLoader` methods (`create_experiment`, `find_run`, `create_run`, `upload_run_data`). |
| 34 | + - Handle attribute name sanitization and step conversion internally; `loader_manager` provides `step_multiplier` (keep it consistent when Neptune steps are floats). |
| 35 | + - Extend CLI choices in `main.py` and plumb target-specific options. |
| 36 | +- **Schema changes**: |
| 37 | + - Update `model.SCHEMA`. |
| 38 | + - Ensure exporters populate the new columns and loaders ignore/handle them gracefully. |
| 39 | + - Add coverage in tests and, if necessary, bump parquet reader/writer logic. |
| 40 | +- **Exporter tweaks**: |
| 41 | + - Keep outputs as PyArrow tables matching `model.SCHEMA`. |
| 42 | + - Continue batching to avoid large in-memory frames; follow the `download_*` generator pattern. |
| 43 | +- **File handling**: |
| 44 | + - Artifacts are stored under `--files-path/<sanitized_project_id>/...`; keep the relative paths in `file_value.path` stable so loaders can find the payloads. |
| 45 | +- **Forking**: |
| 46 | + - Fork metadata exists only in Neptune 3.x exports. W&B supports forks only in a limited/preview fashion—avoid relying on strict fidelity. MLflow does not support forking and saves parents as tags instead. |
| 47 | + |
| 48 | +## Testing notes |
| 49 | +- Prefer function-style pytest tests (no classes) and `unittest.mock.Mock` for doubles. |
| 50 | +- Look at `tests/test_storage.py` and `tests/test_summary_manager.py` for patterns. |
| 51 | +- When adding loader/exporter behavior, add small, focused tests around boundary cases (empty batches, missing metadata, bad attribute names). |
| 52 | + |
| 53 | +## CI |
| 54 | +GitHub Actions runs linting (ruff, mypy, license headers) and tests on Python 3.13 using uv. Workflows live in `.github/workflows/ci.yml`. |
0 commit comments