Datahub File Service

Data Hub File Service - a service enabling file inspection and re-encryption at Data Hubs

Description

JWK Configuration

The DHFS communicates with the GHGA Central API and with the Data Hub's own S3 instance in order to perform re-encryption of files initially encrypted and uploaded by submitters. Within GHGA Central services, file upload activity is tracked and broadcasted through Kafka events. However, DHFS is not connected to GHGA Central's Kafka network. Instead, a dedicated service at GHGA Central liaises between DHFS instances and the rest of GHGA Central, ensuring that DHFS instances only receive information about files uploaded to their S3 instance. Because this communication occurs through a public-facing HTTP API, DHFS requests are authenticated by DHFS-signed JSON Web Tokens. Each Data Hub must therefore generate a key pair in JWK format and share the public key with GHGA Central. The key pair must be of type EC P-256, which provides strong security at compact key sizes. The created private key must be set as the data_hub_signing_key configuration value.

Example for JWK generation with Python and the jwcrypto library:

from jwcrypto.jwk import JWK
jwk = JWK.generate(kid="HD01-DHFS-2026-04", kty="EC", crv="P-256")
with open("jwk.pub", "w") as pk, open("jwk.sec", "w") as sk:
    pk.write(jwk.export_public())
    sk.write(jwk.export_private())

Crypt4GH Key Configuration

Crypt4GH keys are used in two different capacities. The first is for decrypting symmetric file encryption secrets generated by file submitters. During the upload process, the GHGA Connector will encrypt the file content with a key randomly generated according to the Crypt4GH standard (32-bytes). The GHGA Connector encrypts this key using the recipient Data Hub's Crypt4GH public key and includes it at the beginning of the uploaded file, before any of the actual file content. When DHFS starts the re-encryption process for the file, it first fetches the encrypted file secret and decrypts it using the Data Hub's matching Crypt4GH private key. Since Crypt4GH supports passphrase protection for private keys, DHFS configuration requires the key's path instead of the key itself. The key path is set as data_hub_crypt4gh_private_key_path, and the passphrase, if applicable, is set as data_hub_crypt4gh_private_key_passphrase.

The second use for Crypt4GH is to securely transfer the new file encryption secret to GHGA Central after file re-encryption. This key that the data hub should use for this purpose must be passed in the configuration setting central_api_crypt4gh_public_key.

Installation

We recommend using the provided Docker container.

A pre-built version is available on Docker Hub:

docker pull ghga/datahub-file-service:3.0.0

Or you can build the container yourself from the ./Dockerfile:

# Execute in the repo's root dir:
docker build -t ghga/datahub-file-service:3.0.0 .

For production-ready deployment, we recommend using Kubernetes. However for simple use cases, you could execute the service using docker on a single server:

# The entrypoint is pre-configured:
docker run -p 8080:8080 ghga/datahub-file-service:3.0.0 --help

If you prefer not to use containers, you may install the service from source:

# Execute in the repo's root dir:
pip install .

# To run the service:
dhfs --help

Configuration

Parameters

The service requires the following configuration parameters:

client_cache_capacity (integer): Maximum number of entries to store in the cache. Older entries are evicted once this limit is reached. Exclusive minimum: 0. Default: 128.
client_cache_ttl (integer): Number of seconds after which a stored response is considered stale. Minimum: 0. Default: 60.
client_cacheable_methods (array): HTTP methods for which responses are allowed to be cached. Default: ["POST", "GET"].
- Items (string)
client_exponential_backoff_max (integer): Maximum number of seconds to wait between retries when using exponential backoff retry strategies. The client timeout might need to be adjusted accordingly. Minimum: 0. Default: 60.
client_num_retries (integer): Number of times to retry failed API calls. Minimum: 0. Default: 3.
client_retry_status_codes (array): List of status codes that should trigger retrying a request. Default: [408, 429, 500, 502, 503, 504].
- Items (integer): Minimum: 0.
client_reraise_from_retry_error (boolean): Specifies if the exception wrapped in the final RetryError is reraised or the RetryError is returned as is. Default: true.
per_request_jitter (number): Max amount of jitter (in seconds) to add to each request. Minimum: 0. Default: 0.0.
retry_after_applicable_for_num_requests (integer): Amount of requests after which the stored delay from a 429 response is ignored again. Can be useful to adjust if concurrent requests are fired in quick succession. Exclusive minimum: 0. Default: 1.
http_request_timeout_seconds (number): Request timeout setting in seconds. Default: 60.0.
data_hub_crypt4gh_private_key_path (string, format: path, required): Path to the Data Hub's Crypt4GH private key file.

Examples:
```
"./key.sec"
```
data_hub_crypt4gh_private_key_passphrase: Passphrase needed to read the content of the private key file. Only needed if the private key is encrypted. Default: null.
- Any of
  - string
  - null
central_api_crypt4gh_public_key (string, required): The Crypt4GH public key used by the Central API. This is used to encrypt new file encryption secrets.
central_api_url (string, format: uri, required): The base URL used to connect to to the GHGA Central API. Length must be between 1 and 2083 (inclusive).
data_hub_signing_key (string, format: password, required and write-only): The Data Hub's private JWK for signing JWT auth tokens.

Examples:
```
"{\"crv\": \"P-256\", \"kty\": \"EC\", \"x\": \"...\", \"y\": \"...\", \"d\": \"...\"}"
```
storage_alias (string, required): An alias identifying the Data Hub at which this instance of DHFS is running. This value should be set in coordination with GHGA Central.

Examples:
```
"HD01"
```
```
"TUE01"
```
```
"B01"
```
s3_endpoint_url (string, required): URL to the S3 API.

Examples:
```
"http://localhost:4566"
```
s3_access_key_id (string, required): Part of credentials for login into the S3 service. See: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html.

Examples:
```
"my-access-key-id"
```
s3_secret_access_key (string, format: password, required and write-only): Part of credentials for login into the S3 service. See: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html.

Examples:
```
"my-secret-access-key"
```
s3_session_token: Part of credentials for login into the S3 service. See: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/credentials.html. Default: null.
- Any of
  - string, format: password
  - null
Examples:
```
"my-session-token"
```
aws_config_ini: Path to a config file for specifying more advanced S3 parameters. This should follow the format described here: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#using-a-configuration-file. Default: null.
- Any of
  - string, format: path
  - null
Examples:
```
"~/.aws/config"
```
log_level (string): The minimum log level to capture. Must be one of: "CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", or "TRACE". Default: "INFO".
service_name (string): Short name of this service. Default: "dhfs".
service_instance_id (string, required): A string that uniquely identifies this instance across all instances of this service. This is included in log messages.

Examples:
```
"germany-bw-instance-001"
```
log_format: If set, will replace JSON formatting with the specified string format. If not set, has no effect. In addition to the standard attributes, the following can also be specified: timestamp, service, instance, level, correlation_id, and details. Default: null.
- Any of
  - string
  - null
Examples:
```
"%(timestamp)s - %(service)s - %(level)s - %(message)s"
```
```
"%(asctime)s - Severity: %(levelno)s - %(msg)s"
```
log_traceback (boolean): Whether to include exception tracebacks in log messages. Default: true.
min_run_interval_seconds (integer): The minimum number of seconds to wait before asking the CentralAPI about new files for interrogation. Default: 60.
interrogation_bucket_id (string, required): The name for the S3 'interrogation' bucket, which houses re-encrypted files until they are copied to permanent storage by IFRS.
library_log_level (string): The log level to use for libraries. This option can be used in tandem with log_level to view DEBUG logs from DHFS without the noise of third-party libraries. Will be overridden by log_level if log_level is higher. By default, this is set to CRITICAL, which will suppress all logs with a log level lower than CRITICAL. Must be one of: "CRITICAL", "ERROR", "WARNING", "INFO", "DEBUG", or "TRACE". Default: "CRITICAL".
library_logger_names (array): The list of logger names to target with library_log_level. Default: ["httpx", "crypt4gh", "hexkit", "ghga_service_commons", "boto3", "botocore", "httpcore", "urllib3"].
- Items (string)

Usage:

A template YAML file for configuring the service can be found at ./example_config.yaml. Please adapt it, rename it to .dhfs.yaml, and place it in one of the following locations:

in the current working directory where you execute the service (on Linux: ./.dhfs.yaml)
in your home directory (on Linux: ~/.dhfs.yaml)

The config YAML file will be automatically parsed by the service.

Important: If you are using containers, the locations refer to paths within the container.

All parameters mentioned in the ./example_config.yaml can also be set using environment variables or file secrets.

For naming the environment variables, just prefix the parameter name with dhfs_, e.g. for the host set an environment variable named dhfs_host (you may use both upper or lower cases, however, it is standard to define all env variables in upper cases).

To use file secrets, please refer to the corresponding section of the pydantic documentation.

HTTP API

An OpenAPI specification for this service can be found here.

Architecture and Design:

This is a Python-based service following the Triple Hexagonal Architecture pattern. It uses protocol/provider pairs and dependency injection mechanisms provided by the hexkit library.

Development

For setting up the development environment, we rely on the devcontainer feature of VS Code in combination with Docker Compose.

To use it, you have to have Docker Compose as well as VS Code with its "Remote - Containers" extension (ms-vscode-remote.remote-containers) installed. Then open this repository in VS Code and run the command Remote-Containers: Reopen in Container from the VS Code "Command Palette".

This will give you a full-fledged, pre-configured development environment including:

infrastructural dependencies of the service (databases, etc.)
all relevant VS Code extensions pre-installed
pre-configured linting and auto-formatting
a pre-configured debugger
automatic license-header insertion

Inside the devcontainer, a command dev_install is available for convenience. It installs the service with all development dependencies, and it installs pre-commit.

The installation is performed automatically when you build the devcontainer. However, if you update dependencies in the ./pyproject.toml or the lock/requirements-dev.txt, run it again.

License

This repository is free to use and modify according to the Apache 2.0 License.

README Generation

This README file is auto-generated, please see .readme_generation/README.md for details.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.pyproject_generation		.pyproject_generation
.readme_generation		.readme_generation
.template		.template
example_data		example_data
lock		lock
scripts		scripts
src/dhfs		src/dhfs
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile		Dockerfile
Dockerfile.dhi		Dockerfile.dhi
LICENSE		LICENSE
README.md		README.md
config_schema.json		config_schema.json
example_config.yaml		example_config.yaml
openapi.yaml		openapi.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Datahub File Service

Description

JWK Configuration

Crypt4GH Key Configuration

Installation

Configuration

Parameters

Usage:

HTTP API

Architecture and Design:

Development

License

README Generation

About

Uh oh!

Releases 8

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Datahub File Service

Description

JWK Configuration

Crypt4GH Key Configuration

Installation

Configuration

Parameters

Usage:

HTTP API

Architecture and Design:

Development

License

README Generation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 8

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages