-
Notifications
You must be signed in to change notification settings - Fork 10
Description
Feature
A new storage backend that supports APIs where files are addressed by opaque
IDs rather than direct paths. The backend performs a resolution step —
translating the htsget request path into a file ID — before constructing
ticket URLs.
Motivation
htsget and DRS (Data Repository Service)
are both GA4GH standards, but htsget-rs currently has no way to serve data
from DRS-style repositories where file access requires an ID lookup.
The existing backends all assume a direct mapping from the htsget request ID
to a storage location:
- FileStorage — ID maps to filesystem path
- S3Storage — ID maps to S3 key
- UrlStorage — ID maps to
{base_url}/{key}
This works when the data backend uses the same addressing scheme as htsget.
But a growing number of genomic data repositories use ID-based APIs where the
relationship between a human-readable path and the download endpoint requires
a lookup step:
htsget request: GET /reads/{dataset}/{filepath}
↓
resolution step: dataset + filepath → fileId (via API call)
↓
ticket URLs: GET /files/{fileId}/content (with Range header)
UrlStorage cannot do this — it constructs URLs by concatenating a base URL
with the key, with no intermediate resolution.
Note: the ID-based data endpoint (/files/{fileId}/content) already provides
streaming and Range support. The missing piece is the {dataset}/{filepath}
→ fileId resolution step before URL construction.
Concrete use case: The NeIC Sensitive Data Archive
(SDA) is a federated genomic archive used by Nordic research institutions.
We are building a new download API (v2 spec)
with a DRS-inspired design — ID-based file access, split header/content
endpoints, and GA4GH service-info. The API is not a standalone DRS service
today but is designed to be easily separable into one in the future.
The v2 API endpoints relevant for htsget:
GET /datasets/{datasetId}/files→ list files (returns fileId per file)GET /files/{fileId}/content→ encrypted data segments (Range-capable)GET /files/{fileId}/header→ Crypt4GH header
htsget-rs is already used with SDA via UrlStorage pointed at an internal
path-based endpoint (/s3-encrypted/{dataset}/{filepath}). This path-based
endpoint is being retired in the v2 API. An ID-resolving backend would let
htsget-rs work with the new API without maintaining a legacy endpoint.
Proposed approach
A new backend (working name: DrsStorage or ResolverStorage) that:
- Resolves the key to a file ID via a configurable API call
(e.g.GET /datasets/{datasetId}/files?filePath={filepath}→ returns fileId) - Caches the resolution (genomic archives typically treat files as immutable)
- Constructs ticket URLs using the resolved ID
(e.g.{response_url}/files/{fileId}/content) - Implements
get()/head()by proxying to the ID-based data endpoint
It would reuse the existing response_url pattern and forward_headers
mechanism (forwarding request headers to backend/ticket fetch path per config)
from UrlStorage, and could be feature-gated like the S3 and URL backends.
Example config
[[locations]]
regex = "^(?P<dataset>[^/]+)/(?P<filepath>.+)$"
substitution_string = "$dataset/$filepath"
backend.kind = "Drs"
backend.api_url = "http://download-internal:8080"
backend.response_url = "https://download.example.org"
backend.resolve_endpoint = "/datasets/{dataset}/files?filePath={filepath}"
backend.content_endpoint = "/files/{fileId}/content"
backend.forward_headers = true
backend.header_blacklist = ["Host"]Alternatives considered
- Keep a path-based endpoint in the data repository — works but forces
repositories to maintain legacy endpoints just for htsget compatibility - External sidecar that pre-resolves paths — operationally complex,
static mapping breaks when files are added - Regex/substitution in UrlStorage — cannot do HTTP lookups, only
string transformations
Scope
Happy to contribute an implementation if there's interest. The SDA team has
experience with htsget-rs (we maintain a deployment using UrlStorage + C4GH)
and can provide integration testing against a real archive.
Would like to hear your thoughts on:
- Whether this fits htsget-rs scope or is better as an external plugin
- Naming:
DrsStoragevsResolverStoragevs something else - Any architectural preferences for the resolution/caching layer