Skip to content

feat(download): new v2 download service replacing sda-download#2187

Open
jhagberg wants to merge 28 commits intomainfrom
feature/sda-download-v2
Open

feat(download): new v2 download service replacing sda-download#2187
jhagberg wants to merge 28 commits intomainfrom
feature/sda-download-v2

Conversation

@jhagberg
Copy link
Contributor

@jhagberg jhagberg commented Jan 16, 2026

Summary

New download service at sda/cmd/download/ replacing sda-download/. Complete implementation with v2 REST API, GA4GH visa support, keyset pagination, audit logging, and significant performance improvement over the old service.

Related issues

What's included

Core service

  • Config via internal/config/v2 (pflag/viper framework), 50+ flags
  • gRPC health server for K8s probes + HTTP ready/live endpoints
  • Range header parsing (RFC 9110), If-Range ETag support
  • Crypt4gh header re-encryption via gRPC with lazy connection + TLS

Database

  • PostgreSQL with prepared statements, keyset cursor pagination
  • LATERAL json_agg for checksum aggregation (no row multiplication)
  • Ristretto cache layer (lock-free) for file lookups and permissions

Authentication

  • Structure-based JWT detection (not just dot-counting)
  • Opaque token support via userinfo endpoint
  • Session cookie cache with TTL bounded by min(token.exp, visa.exp)
  • Permission models: combined, visa-only, ownership

GA4GH visa validation

  • (iss, jku) allowlist enforcement against trusted issuers
  • JWKS caching with per-request fetch limits
  • ControlledAccessGrants validation (by, value, source, asserted)
  • Conditions claim rejection per GA4GH spec (reject if present, not supported)
  • Identity binding modes: broker-bound, strict-sub, strict-iss-sub

v2 REST API

  • GET /datasets — paginated dataset list with HMAC-signed page tokens
  • GET /datasets/:datasetId — dataset metadata
  • GET /datasets/:datasetId/files — keyset-paginated file list with filePath/pathPrefix filters
  • HEAD/GET /files/:fileId — combined download with re-encrypted header
  • HEAD/GET /files/:fileId/header — re-encrypted header only
  • HEAD/GET /files/:fileId/content — raw archive body (no pubkey needed)
  • GET /service-info — GA4GH service-info
  • RFC 9457 Problem Details for all errors
  • No existence leakage (403 for both "not found" and "no access")
  • UseRawPath=true for URL-encoded slash dataset IDs

Security

  • Production guards: fail startup if allow-all-data, missing HMAC secret, or missing gRPC TLS in production
  • Audit logging: download.complete, download.denied (401/403), download.failed (storage/streaming errors)
  • SameSite=Lax on session cookies
  • Input length cap on filter params (4096 chars)
  • os.OpenRoot scoped file reads in key loading (gosec G122)

Testing

  • Unit tests across all packages (241 tests, 12 packages)
  • 33 integration tests with Docker Compose (postgres, minio, mock OIDC, reencrypt)
  • Capability probes: tests skip on missing prerequisites, hard-fail on regressions
  • Benchmark with paired iterations and validated-payload mode

Benchmark results

Fair comparison with paired iteration ordering (alternating old→new / new→old), session cookies enabled for both services, and query cache disabled on the new service.

endpoint-e2e mode (100 req × 5 iterations × 10 concurrency)

Metric OLD NEW Change
Requests/sec 34.5 120.9 +250%
Mean latency 284ms 88ms -69%
Throughput 501 MB/s 1757 MB/s +250%
Success rate 100% 100%

validated-payload mode (plaintext SHA256 verification)

Both services decrypt to identical plaintext (SHA256 match confirmed), with the new service ~3.8x faster.

Metric OLD NEW Change
Requests/sec 45.5 172.8 +280%
Mean latency 218ms 59ms -73%
Throughput 661 MB/s 2512 MB/s +280%

How to test

# Unit tests
make test-sda

# Visa-tagged tests
cd sda && go test -tags visas -count=1 ./cmd/download/...

# Integration tests
make integrationtest-sda-cmd-download

# Benchmark (old vs new, endpoint-e2e)
make benchmark-download

# Benchmark (with plaintext validation)
make benchmark-download-validated

@jhagberg jhagberg requested a review from a team as a code owner January 16, 2026 09:19
@jhagberg jhagberg marked this pull request as draft January 16, 2026 09:19
@codecov
Copy link

codecov bot commented Jan 16, 2026

Codecov Report

❌ Patch coverage is 33.71488% with 1911 lines in your changes missing coverage. Please review.
✅ Project coverage is 40.14%. Comparing base (82c83bb) to head (a10757f).
⚠️ Report is 35 commits behind head on main.

Files with missing lines Patch % Lines
sda/cmd/download/benchmark/benchmark.go 3.77% 662 Missing ⚠️
sda/cmd/download/middleware/auth.go 18.31% 324 Missing and 6 partials ⚠️
sda/cmd/download/visa/validator.go 0.00% 233 Missing ⚠️
sda/cmd/download/handlers/files.go 35.76% 164 Missing and 3 partials ⚠️
sda/cmd/download/database/database.go 68.04% 80 Missing and 13 partials ⚠️
sda/cmd/download/visa/userinfo.go 0.00% 81 Missing ⚠️
sda/cmd/download/reencrypt/reencrypt.go 20.40% 77 Missing and 1 partial ⚠️
sda/cmd/download/visa/jwks_cache.go 0.00% 59 Missing ⚠️
sda/cmd/download/streaming/streaming.go 74.82% 21 Missing and 15 partials ⚠️
sda/cmd/download/database/cache.go 78.66% 24 Missing and 8 partials ⚠️
... and 10 more
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2187      +/-   ##
==========================================
- Coverage   40.92%   40.14%   -0.79%     
==========================================
  Files          96      118      +22     
  Lines        9925    12535    +2610     
==========================================
+ Hits         4062     5032     +970     
- Misses       5264     6837    +1573     
- Partials      599      666      +67     
Flag Coverage Δ
unittests 40.14% <33.71%> (-0.79%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sda/cmd/download/audit/audit.go 100.00% <100.00%> (ø)
sda/cmd/download/handlers/problem.go 100.00% <100.00%> (ø)
sda/cmd/download/handlers/publickey.go 100.00% <100.00%> (ø)
sda/cmd/download/handlers/serviceinfo.go 100.00% <100.00%> (ø)
sda/cmd/download/streaming/ifrange.go 100.00% <100.00%> (ø)
sda/cmd/download/audit/stdout_logger.go 85.71% <85.71%> (ø)
sda/cmd/download/handlers/pagination.go 92.15% <92.15%> (ø)
sda/cmd/reencrypt/reencrypt.go 19.01% <44.44%> (+0.76%) ⬆️
sda/cmd/download/handlers/options.go 57.14% <57.14%> (ø)
sda/cmd/download/health/health.go 40.00% <40.00%> (ø)
... and 15 more

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@kostas-kou
Copy link
Contributor

I know that it is not in swagger but in my opinion we should probably add an endpoint for downloading a whole dataset. As far as I remember the only way to download a whole dataset is only by using the sda-cli tool. What do you think?

@jhagberg jhagberg force-pushed the feature/sda-download-v2 branch from fbf7ec2 to 633bc20 Compare January 19, 2026 07:56
@jbygdell jbygdell changed the base branch from main to feature/multiple-backends-wip January 19, 2026 07:59
@jhagberg jhagberg force-pushed the feature/sda-download-v2 branch 4 times, most recently from 766b982 to 1ea1b63 Compare January 19, 2026 16:37
@pontus
Copy link
Contributor

pontus commented Jan 20, 2026

Would it be possible to get built in caching of database query responses already now? There are several good options that makes this fairly easy (the old sda-download used ristretto for a session cache and it's used in sdafs to handle data caching).

(For the streaming use case, even if the query and response is cached in the database, having to do several roundtrips for each fetch is not nice, even ignoring the unneeded load caused.)

@jhagberg jhagberg force-pushed the feature/sda-download-v2 branch 2 times, most recently from 63a8660 to 4ebb2be Compare January 21, 2026 11:00
@jhagberg
Copy link
Contributor Author

@pontus Yes I made a suggestion in 4ebb2be

@jhagberg jhagberg force-pushed the feature/sda-download-v2 branch from 4ebb2be to d1a1a68 Compare January 21, 2026 11:05
@pontus
Copy link
Contributor

pontus commented Jan 21, 2026

@pontus Yes I made a suggestion in 4ebb2be

Thanks, looks great.

@KarlG-nbis KarlG-nbis force-pushed the feature/multiple-backends-wip branch from dccff14 to 63b5ef8 Compare January 23, 2026 08:42
@jhagberg jhagberg force-pushed the feature/sda-download-v2 branch from d2599df to c1f426b Compare January 23, 2026 09:07
@KarlG-nbis KarlG-nbis force-pushed the feature/multiple-backends-wip branch 2 times, most recently from cd20c7a to d1d8f7d Compare January 23, 2026 11:58
Base automatically changed from feature/multiple-backends-wip to main January 23, 2026 12:19
@jhagberg jhagberg force-pushed the feature/sda-download-v2 branch from c1f426b to 181ff87 Compare January 23, 2026 12:39
@jhagberg jhagberg force-pushed the feature/sda-download-v2 branch 2 times, most recently from 55ab531 to 51ebac3 Compare February 23, 2026 10:27
@jhagberg jhagberg changed the title Feature/sda download v2 WIP feat(download): new v2 download service replacing sda-download Feb 23, 2026
@jhagberg jhagberg requested a review from pontus February 23, 2026 10:30
jhagberg added 21 commits March 13, 2026 13:31
…eencrypt

New download service at sda/cmd/download/ replacing sda-download/.

Core components:
- main.go: entry point with production safety guards, TLS, graceful shutdown
- config/: 50+ config flags via internal/config/v2 (pflag/viper framework)
- health/: gRPC health server for K8s probes + HTTP ready/live endpoints
- streaming/: Range header parsing (RFC 9110), If-Range ETag support,
  combined header+body and body-only streaming with seek-based positioning
- reencrypt/: gRPC client for crypt4gh header re-encryption with lazy
  connection and TLS support
- internal/config/v2/: shared config registration framework
- swagger_v2.yml: OpenAPI spec for the v2 REST API
Database components:
- database.go: PostgreSQL interface with prepared statements, keyset
  cursor pagination via (submission_file_path, stable_id) composite
  cursor, LATERAL json_agg for checksum aggregation (prevents row
  multiplication), LIKE prefix escaping, and CheckDatasetExists for
  no-existence-leakage pattern
- cache.go: Ristretto-based cache wrapper (lock-free) for file lookups,
  permission checks, and dataset queries. Paginated queries bypass
  cache due to cursor variability.
Authentication middleware:
- Structure-based JWT detection via looksLikeJWT() (3 dot-segments
  with base64url JSON header+payload)
- JWT path: validate locally via loaded keyset, optional issuer match
- Opaque path: call UserinfoClient.FetchUserinfo for subject resolution
- Session cookie cache (sda_session + legacy sda_session_key)
- Token-keyed cache (sha256(token)) with TTL bounded by
  min(token.exp, min(visa.exp), configTTL)
- Permission model: combined/visa/ownership dataset population
- SameSite=Lax on session cookies
- Audit denial events (download.denied) on all 401 paths
GA4GH visa support:
- validator.go: GetVisaDatasets() extracts datasets from visa JWTs,
  enforces (iss, jku) allowlist, verifies signatures via cached JWKS,
  validates ControlledAccessGrants (by, value, source, conditions,
  asserted), supports broker-bound/strict-sub/strict-iss-sub identity
  binding modes, detects multi-identity scenarios
- trust.go: LoadTrustedIssuers from JSON with conditional HTTPS
  enforcement for JKU URLs
- userinfo.go: UserinfoClient with HTTP cache, io.LimitReader safety,
  GA4GH passport v1 claim extraction
- jwks_cache.go: JWK cache with per-request fetch limits and
  (iss, jku) allowlist enforcement
- types.go: Identity, TrustedIssuer, VisaClaim, UserinfoResponse
- Pre-validation limits: max-visas (200), max-visa-size (16KB),
  max-jwks-per-request (10)
…-info

REST API endpoints:
- GET /datasets: paginated dataset list with HMAC-signed page tokens
- GET /datasets/:datasetId: dataset metadata (date, files, size)
- GET /datasets/:datasetId/files: keyset-paginated file list with
  filePath/pathPrefix filters (mutually exclusive, 4096 char limit)
- HEAD/GET /files/:fileId: combined download with re-encrypted header
- HEAD/GET /files/:fileId/header: re-encrypted header only
- HEAD/GET /files/:fileId/content: raw archive body (no pubkey needed)
- GET /service-info: GA4GH service-info metadata
- GET /health/ready, /health/live: health probes

Cross-cutting:
- RFC 9457 Problem Details for all errors (application/problem+json)
- No existence leakage: 403 for both "not found" and "no access"
- Content-Disposition with .c4gh extension
- Cache-Control headers on all data endpoints
- UseRawPath=true for URL-encoded slash dataset IDs
- Audit denied/failed events on 403 and server errors
- Correlation ID middleware (X-Correlation-ID)
Audit logging:
- audit.Logger interface with StdoutLogger (JSON lines) and NoopLogger
- download.complete/content/header events on successful operations
- download.denied events on 401/403 (middleware + handlers)
- download.failed events on storage/streaming errors with ErrorReason
- Correlation ID propagated to all audit events

Production guards (app.environment=production):
- Fail startup if jwt.allow-all-data is enabled
- Fail startup if pagination.hmac-secret is empty
- Fail startup if gRPC client TLS certs are missing
- validateProductionConfig extracted for unit testing
Docker Compose integration test environment:
- postgres, minio (S3), mock-aai, mockoidc (OIDC + visa JWTs),
  reencrypt (gRPC), download service under test
- database_seed with test dataset + file
- make_download_credentials.sh: RSA keypair, JWT token, trusted issuers
- mockoidc.py: OIDC discovery, JWKS, userinfo with visa datasets

33 integration tests covering:
- Health, auth (JWT + opaque), session cookies, service-info
- Dataset listing, file listing, pagination, invalid pageSize/pageToken
- Encoded-slash dataset IDs (UseRawPath verification)
- Range requests, multi-range rejection, If-Range ETag contract
- Content-Disposition, pathPrefix filter with SQL wildcard escaping
- Expired token rejection, long-transfer resume scenario
- Problem Details format, access control (no existence leakage)

Environment capability probes (SetupSuite):
- REQUIRES_REENCRYPT, REQUIRES_STORAGE_FILE, REQUIRES_SESSION_CACHE
- Tests skip on missing prerequisites, hard-fail on regressions
Benchmark infrastructure:
- benchmark.go: concurrent load tester comparing old (sda-download)
  vs new (sda/cmd/download) services with auto-discovery, JWT auth,
  configurable iterations/concurrency, percentile stats (p50/p95/p99)
- sda-benchmark.yml: Docker Compose with both services, shared
  pipeline, and benchmark runner
- seed_benchmark_data.sh: uploads crypt4gh files through real ingest
  pipeline for realistic test data
- Makefile targets: benchmark-download-{up,seed,run,down}

Result: NEW service is ~255% faster than OLD (67 vs 19 req/s).
…rflow hint

- 55_download_test.sh: update from v1 API paths (/info/datasets,
  /file/{id}, public_key header) to v2 (/datasets, /files/{id},
  X-C4GH-Public-Key). Fixes CI failure in sda (s3) integration job.
- middleware/auth.go: use len(a) instead of len(a)+len(b) as map
  capacity hint in mergeDatasets to silence CodeQL overflow warning.
- benchmark.go: fmt.Errorf → errors.New where no format verbs,
  nlreturn blank lines, ifElseChain → switch, nolint for gosec
  G402 (TLS skip in benchmark tool) and revive deep-exit
- handlers/files.go: remove duplicate ArchivePath check
- middleware/auth.go: extract tokenExpiry() to reduce nestif
- visa/: nolint:gosec for SSRF (URLs validated against allowlist)
Rewrites the service documentation to match the actual v2 API routes,
authentication model, GA4GH visa support, and full configuration
reference. The previous version documented draft/old routes that did
not match the implementation.
Update curl examples to use v2 API routes (/datasets, /files) instead
of old /info/ and /file/ routes. Fix public key header name from
public_key to X-C4GH-Public-Key.
…ep with health poll

- Update download-config-old.yaml to use storage v2 config format
  (sda-download on main migrated to storage/v2)
- Replace sleep 60 with health endpoint polling for both services
- Keep shared volume between runs to avoid re-downloading binaries
… expiry

- Create standalone userinfo client when auth.allow-opaque=true but
  visa.enabled=false, so opaque tokens work without visa support
- Return config file read/parse errors when --config-file is
  explicitly set, instead of silently falling through to defaults
- Preserve visa expiry on validation-cache hits so session TTL
  remains bounded by the visa exp claim
Move the audit logger selection to main.go: StdoutLogger when audit is
required, NoopLogger otherwise. This removes the IsNoop type-check
method from the interface and clarifies that NoopLogger is the
"logging disabled" implementation, not a test helper.
- Replace sort.Slice/sort.Strings with slices.SortFunc/slices.Sort
- Replace wg.Add/go/Done pattern with wg.Go
- Use os.OpenRoot to scope file reads in filepath.Walk (gosec G122)
- Extract helpers to reduce nesting complexity (nestif)
Check that len(a)+len(b) won't overflow before using it as a map
capacity hint. Falls back to zero hint (runtime-grown) otherwise.
path.Clean("/") returns "/" not ".", so https://issuer.example/ did
not match https://issuer.example. Treat u.Path == "/" the same as "."
by clearing it to empty string. Adds test coverage.
Move regexp.MustCompile from per-call in ParseRangeHeader to a
package-level var, avoiding repeated compilation on the hot download
path.
- Use filepath.Clean instead of TrimLeft+path.Join to preserve
  absolute config paths like /etc/sda
- Surface ReadInConfig errors (only ignore ConfigFileNotFoundError)
- Replace BindPFlags panic with returned error
- CONFIGFILE -> CONFIG_FILE in TESTING.md and dev_config.yaml
- Go 1.20+ -> Go 1.25+ to match go.mod
- Update stale config error troubleshooting text
- Document build-all prerequisite for integration test target
@jhagberg jhagberg force-pushed the feature/sda-download-v2 branch from 259d2d4 to 1798ca6 Compare March 13, 2026 12:32
func mergeDatasets(a, b []string) []string {
hint := 0
if len(a) <= math.MaxInt-len(b) {
hint = len(a) + len(b)

Check failure

Code scanning / CodeQL

Size computation for allocation may overflow High

This operation, which is used in an
allocation
, involves a
potentially large value
and might overflow.

Copilot Autofix

AI about 22 hours ago

General approach: Avoid performing arithmetic (math.MaxInt - len(b)) that can itself overflow when working with values derived from untrusted or potentially large data. For combining slice lengths, we can instead guard the addition by comparing each operand against math.MaxInt/2 (or similar) so that their sum cannot exceed math.MaxInt without ever performing a risky subtraction.

Best concrete fix: Change the mergeDatasets hint calculation to use a safe check that does not subtract len(b) from math.MaxInt. For example:

hint := 0
if len(a) <= math.MaxInt/2 && len(b) <= math.MaxInt/2 {
    hint = len(a) + len(b)
}

This guarantees len(a) + len(b) <= math.MaxInt without any intermediate overflow: both lengths are individually bounded by half the maximum integer, so their sum cannot exceed the maximum. If either is larger, hint remains 0, which simply falls back to the default map growth strategy; no functional change in correctness, just losing the optimization in extreme edge cases.

Files/regions to change: Only sda/cmd/download/middleware/auth.go, inside mergeDatasets (lines 698–702 in your snippet). No new imports or types are needed; math is already imported and still used.


Suggested changeset 1
sda/cmd/download/middleware/auth.go

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/sda/cmd/download/middleware/auth.go b/sda/cmd/download/middleware/auth.go
--- a/sda/cmd/download/middleware/auth.go
+++ b/sda/cmd/download/middleware/auth.go
@@ -697,7 +697,9 @@
 // mergeDatasets returns a deduplicated union of two string slices.
 func mergeDatasets(a, b []string) []string {
 	hint := 0
-	if len(a) <= math.MaxInt-len(b) {
+	// Avoid potential overflow in size computation by ensuring both lengths
+	// are small enough that their sum cannot exceed math.MaxInt.
+	if len(a) <= math.MaxInt/2 && len(b) <= math.MaxInt/2 {
 		hint = len(a) + len(b)
 	}
 
EOF
@@ -697,7 +697,9 @@
// mergeDatasets returns a deduplicated union of two string slices.
func mergeDatasets(a, b []string) []string {
hint := 0
if len(a) <= math.MaxInt-len(b) {
// Avoid potential overflow in size computation by ensuring both lengths
// are small enough that their sum cannot exceed math.MaxInt.
if len(a) <= math.MaxInt/2 && len(b) <= math.MaxInt/2 {
hint = len(a) + len(b)
}

Copilot is powered by AI and may make mistakes. Always verify output.
…main

- Disable query cache in benchmark config so the new service doesn't
  get an unfair caching advantage over the old service
- Fix old service session domain from "localhost" to "download-old"
  so cookies are actually sent back in the Docker network
Wait for RabbitMQ verified queue to have messages before consuming,
preventing a race where the DB checksums appear before the verify
worker has pushed messages.
… mode

- Alternate old→new / new→old order between iterations to cancel
  warm-up bias
- Add cookie jar to HTTP client so session cookies work properly
- Add validated-payload mode that downloads from both endpoints,
  decrypts with crypt4gh, and compares plaintext SHA256
- Extract buildRequest helper shared between benchmark and validation
- Add percentChange helper to handle division-by-zero safely
…ed target

- Mount full sda/ module instead of benchmark dir alone so go.mod
  is available for crypt4gh dependency resolution
- Bump benchmark image to golang:1.25-alpine to match go.mod
- Add BENCHMARK_MODE and VERIFY_PRIVATE_KEY env vars
- Add benchmark-download-validated Makefile target
Tests for parseBenchmarkMode, pairedIterationTargets,
comparePayloadDigests, and percentChange.
- Replace int(a-b) duration sort with cmp.Compare to prevent overflow
  on 32-bit platforms
- Remove unused makeRequest function
- Remove redundant idx := i and tt := tt copies (Go 1.25 has
  per-iteration loop variables)
The handler emitted "download.complete" but the audit package
documents and tests "download.completed". Align with the documented
contract.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[download] Implement v2 download API endpoints (US1-US7, US10, US13-US15)

5 participants