Skip to content

Feature: Storage backend for ID-based APIs (DRS-style resolution) #356

@jhagberg

Description

@jhagberg

Feature

A new storage backend that supports APIs where files are addressed by opaque
IDs rather than direct paths. The backend performs a resolution step —
translating the htsget request path into a file ID — before constructing
ticket URLs.

Motivation

htsget and DRS (Data Repository Service)
are both GA4GH standards, but htsget-rs currently has no way to serve data
from DRS-style repositories where file access requires an ID lookup.

The existing backends all assume a direct mapping from the htsget request ID
to a storage location:

  • FileStorage — ID maps to filesystem path
  • S3Storage — ID maps to S3 key
  • UrlStorage — ID maps to {base_url}/{key}

This works when the data backend uses the same addressing scheme as htsget.
But a growing number of genomic data repositories use ID-based APIs where the
relationship between a human-readable path and the download endpoint requires
a lookup step:

htsget request:    GET /reads/{dataset}/{filepath}
                         ↓
resolution step:   dataset + filepath  →  fileId     (via API call)
                         ↓
ticket URLs:       GET /files/{fileId}/content  (with Range header)

UrlStorage cannot do this — it constructs URLs by concatenating a base URL
with the key, with no intermediate resolution.

Note: the ID-based data endpoint (/files/{fileId}/content) already provides
streaming and Range support. The missing piece is the {dataset}/{filepath}
fileId resolution step before URL construction.

Concrete use case: The NeIC Sensitive Data Archive
(SDA) is a federated genomic archive used by Nordic research institutions.
We are building a new download API (v2 spec)
with a DRS-inspired design — ID-based file access, split header/content
endpoints, and GA4GH service-info. The API is not a standalone DRS service
today but is designed to be easily separable into one in the future.

The v2 API endpoints relevant for htsget:

  • GET /datasets/{datasetId}/files → list files (returns fileId per file)
  • GET /files/{fileId}/content → encrypted data segments (Range-capable)
  • GET /files/{fileId}/header → Crypt4GH header

htsget-rs is already used with SDA via UrlStorage pointed at an internal
path-based endpoint (/s3-encrypted/{dataset}/{filepath}). This path-based
endpoint is being retired in the v2 API. An ID-resolving backend would let
htsget-rs work with the new API without maintaining a legacy endpoint.

Proposed approach

A new backend (working name: DrsStorage or ResolverStorage) that:

  1. Resolves the key to a file ID via a configurable API call
    (e.g. GET /datasets/{datasetId}/files?filePath={filepath} → returns fileId)
  2. Caches the resolution (genomic archives typically treat files as immutable)
  3. Constructs ticket URLs using the resolved ID
    (e.g. {response_url}/files/{fileId}/content)
  4. Implements get()/head() by proxying to the ID-based data endpoint

It would reuse the existing response_url pattern and forward_headers
mechanism (forwarding request headers to backend/ticket fetch path per config)
from UrlStorage, and could be feature-gated like the S3 and URL backends.

Example config

[[locations]]
regex = "^(?P<dataset>[^/]+)/(?P<filepath>.+)$"
substitution_string = "$dataset/$filepath"

backend.kind = "Drs"
backend.api_url = "http://download-internal:8080"
backend.response_url = "https://download.example.org"
backend.resolve_endpoint = "/datasets/{dataset}/files?filePath={filepath}"
backend.content_endpoint = "/files/{fileId}/content"
backend.forward_headers = true
backend.header_blacklist = ["Host"]

Alternatives considered

  1. Keep a path-based endpoint in the data repository — works but forces
    repositories to maintain legacy endpoints just for htsget compatibility
  2. External sidecar that pre-resolves paths — operationally complex,
    static mapping breaks when files are added
  3. Regex/substitution in UrlStorage — cannot do HTTP lookups, only
    string transformations

Scope

Happy to contribute an implementation if there's interest. The SDA team has
experience with htsget-rs (we maintain a deployment using UrlStorage + C4GH)
and can provide integration testing against a real archive.

Would like to hear your thoughts on:

  • Whether this fits htsget-rs scope or is better as an external plugin
  • Naming: DrsStorage vs ResolverStorage vs something else
  • Any architectural preferences for the resolution/caching layer

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions