Skip to content

Commit d83070c

Browse files
authored
Documentation improvements (#12)
1 parent 884619b commit d83070c

File tree

7 files changed

+200
-427
lines changed

7 files changed

+200
-427
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ wheels/
1111

1212
# VSCode
1313
.vscode/
14+
.idea/
1415

1516
.env
1617
.neptune/

DEVELOPMENT.md

Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# Development Guide
2+
3+
Notes for contributors extending the exporter (new loaders/targets, schema tweaks, etc.). User-facing usage lives in `README.md`.
4+
5+
## Local environment
6+
- Install deps with `uv sync --dev`.
7+
- Run checks with `uv run pre-commit run --all-files`.
8+
- Add pre-commit to git hooks with `uv run pre-commit install`.
9+
- Run tests with `uv run pytest -v`. Integration-style tests need:
10+
- `NEPTUNE2_E2E_API_TOKEN`, `NEPTUNE2_E2E_PROJECT`
11+
- `NEPTUNE3_E2E_API_TOKEN`, `NEPTUNE3_E2E_PROJECT`
12+
13+
## Code structure (src/neptune_exporter)
14+
- `main.py`: Click CLI wiring for `export`, `load`, `summary`.
15+
- `exporters/`: `Neptune2Exporter` (neptune-client) and `Neptune3Exporter` (neptune-query); both yield `pyarrow.RecordBatch` objects matching `model.SCHEMA`.
16+
- `export_manager.py`: Orchestrates export per project/run, fans out batches per run, and skips runs already on disk.
17+
- `storage/`: `ParquetWriter` (streaming parts per run, temp file cleanup) and `ParquetReader` (per-project/run streaming, metadata extraction).
18+
- `loaders/`: Common `DataLoader` interface plus `MLflowLoader` and `WandBLoader` implementations.
19+
- `loader_manager.py`: Topologically sorts runs (parents before forks), resumes runs if the target already has them, and streams parts to loaders.
20+
- `summary_manager.py` & `validation/report_formatter.py`: Lightweight data introspection/printing for already-exported parquet.
21+
- `model.py`: Central PyArrow schema.
22+
- `utils.py`: Shared helpers (`sanitize_path_part` adds a digest to keep paths safe/unique).
23+
24+
## Data flow overview
25+
1. Export (primary): exporter → `ExportManager``ParquetWriter` (+ file downloads). A run is considered complete when `*_part_0.parquet` exists; runs without it are rewritable.
26+
2. Summary: `ParquetReader``SummaryManager``ReportFormatter`.
27+
3. Load (optional): `ParquetReader``LoaderManager` → selected `DataLoader`.
28+
29+
Exports are resumable but not incremental: reruns skip completed runs, so new data added to an already-exported run will be missed unless you re-export to a fresh location.
30+
31+
## Adding or changing components
32+
- **New loader** (e.g., another tracking backend):
33+
- Implement `DataLoader` methods (`create_experiment`, `find_run`, `create_run`, `upload_run_data`).
34+
- Handle attribute name sanitization and step conversion internally; `loader_manager` provides `step_multiplier` (keep it consistent when Neptune steps are floats).
35+
- Extend CLI choices in `main.py` and plumb target-specific options.
36+
- **Schema changes**:
37+
- Update `model.SCHEMA`.
38+
- Ensure exporters populate the new columns and loaders ignore/handle them gracefully.
39+
- Add coverage in tests and, if necessary, bump parquet reader/writer logic.
40+
- **Exporter tweaks**:
41+
- Keep outputs as PyArrow tables matching `model.SCHEMA`.
42+
- Continue batching to avoid large in-memory frames; follow the `download_*` generator pattern.
43+
- **File handling**:
44+
- Artifacts are stored under `--files-path/<sanitized_project_id>/...`; keep the relative paths in `file_value.path` stable so loaders can find the payloads.
45+
- **Forking**:
46+
- Fork metadata exists only in Neptune 3.x exports. W&B supports forks only in a limited/preview fashion—avoid relying on strict fidelity. MLflow does not support forking and saves parents as tags instead.
47+
48+
## Testing notes
49+
- Prefer function-style pytest tests (no classes) and `unittest.mock.Mock` for doubles.
50+
- Look at `tests/test_storage.py` and `tests/test_summary_manager.py` for patterns.
51+
- When adding loader/exporter behavior, add small, focused tests around boundary cases (empty batches, missing metadata, bad attribute names).
52+
53+
## CI
54+
GitHub Actions runs linting (ruff, mypy, license headers) and tests on Python 3.13 using uv. Workflows live in `.github/workflows/ci.yml`.

0 commit comments

Comments
 (0)