Skip to content

Input data from misconfigured public S3 buckets when using AWS credentials #4732

@ramdayan

Description

@ramdayan

New feature

Usage scenario

Many commonly used public datasets are stored on public S3 buckets which are configured in a way that only allows access using anonymous AWS credentials (e.g. AWS iGenomes). AWS supports anonymous access to S3 when AWS credentials are configured using the --no-sign-request AWS CLI flag.

On the other hand, when running on AWS Batch, AWS credentials are required to access all AWS resources & services needed to run on a private AWS Batch cluster.

Nextflow allows configuring a single set of AWS credentials (or role) to be used during runtime, and accesses all S3 URLs provided to channels the same way, meaning you cannot use public datasets as mentioned above when running on your own private AWS Batch cluster, and also not in conjunction with datasets stored in private S3 buckets.

Suggest implementation

My suggestion is to add support for a new option flag to the fromPath channel factory (and maybe others as well), which when set to true, given an S3 URL as the path, uses the --no-sign-request flag when generating the AWS CLI command that pulls the data from the given S3 bucket.

This way the user will have the granularity required in order to access both public & private S3 buckets during the same run, while running on a private AWS Batch cluster as well, regardless of how the public S3 buckets are configured.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions