Skip to content

Conversation

@edmundmiller
Copy link
Member

@edmundmiller edmundmiller commented Oct 27, 2025

Summary

Add automatic upload of workflow outputs to Seqera Platform datasets, enabling seamless integration between Nextflow workflows and Platform's dataset management features.

Features

1. Automatic Dataset Creation & Upload

  • Auto-create datasets for workflow outputs with index files
  • Upload CSV/TSV index files to Platform datasets
  • Configurable naming pattern with workflow metadata variables
  • Per-output configuration for granular control

2. Manual HTTP Implementation

  • Uses manual HTTP requests with multipart/form-data encoding
  • Leverages existing TowerClient infrastructure
  • No external SDK dependencies
  • Works in CI without GitHub authentication requirements

3. Configuration Options

tower {
    datasets {
        enabled = true
        createMode = 'auto'  // or 'existing'
        namePattern = '${workflow.runName}-outputs'
        
        perOutput {
            'my_output' {
                enabled = true
                datasetId = 'existing-dataset-id'  // optional
            }
        }
    }
}

Implementation Details

Dataset Upload Flow:

  1. Listen for WorkflowOutputEvent events via TraceObserverV2
  2. Collect outputs with index files (CSV/TSV)
  3. On workflow completion, create dataset(s) as configured
  4. Upload index files using multipart HTTP POST

HTTP Implementation:

  • createDataset() - POST JSON payload to datasets API
  • uploadFile() - Multipart/form-data file upload per RFC 2388
  • createMultipartBody() - Manual multipart encoding
  • Uses existing HttpClient from TowerClient infrastructure

Configuration Classes:

  • DatasetConfig - Main configuration with validation
  • Support for auto/existing modes and per-output settings

Testing

Unit Tests (3 tests)

  • Workflow output event collection
  • Index file detection
  • Configuration validation

Integration Test

  • Real API upload (conditional on TOWER_ACCESS_TOKEN)
  • End-to-end validation with actual Platform

Validation Workflow

  • Manual test workflow in validation/dataset-upload/
  • Demonstrates complete dataset upload flow
  • Includes comprehensive testing guide

Documentation

  • Configuration reference in DatasetConfig
  • Validation workflow README with troubleshooting
  • Prerequisites: Nextflow 25.10.0+ for output {} block support

Architecture Decision

Why Manual HTTP vs tower-java-sdk?

Initially refactored to use tower-java-sdk, but reverted because:

  • SDK is hosted on GitHub Packages which requires authentication
  • CI builds cannot access GitHub Packages without credentials
  • Manual HTTP implementation is simpler and has no external dependencies
  • Uses existing TowerClient HTTP infrastructure

The manual implementation maintains identical functionality while ensuring CI compatibility.

Breaking Changes

None - feature is opt-in via configuration.

Related Issues

Checklist

  • Implemented manual HTTP multipart upload
  • Added comprehensive unit tests
  • Added integration test
  • Created validation workflow
  • Fixed workflow output DSL syntax
  • Updated documentation
  • All tests passing
  • Works in CI without authentication

@netlify
Copy link

netlify bot commented Oct 27, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit 687b0c3
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/69009f52888d130008c6fe31

Implement automatic upload of Nextflow workflow output index files to
Seqera Platform datasets when workflows complete, enabling seamless
integration between Nextflow's output syntax and Platform's dataset
management.

Changes:
- Add DatasetConfig class for dataset upload configuration
  - Support auto-create or use existing datasets
  - Customizable dataset name patterns with variable substitution
  - Per-output configuration overrides
- Update TowerConfig to include datasets configuration scope
- Implement dataset upload in TowerClient:
  - Collect workflow outputs via onWorkflowOutput() callback
  - Upload index files on workflow completion (onFlowComplete)
  - Create datasets via Platform API with proper workspace URLs
  - Use multipart/form-data for file uploads (matches tower-cli)
  - Add URL builders for dataset API endpoints
- Add comprehensive unit tests for DatasetConfig

API Implementation:
- Create dataset: POST /workspaces/{id}/datasets/
- Upload file: POST /workspaces/{id}/datasets/{id}/upload
- Proper multipart/form-data format with file field
- Workspace ID in URL path (not query param)
- Header detection via ?header=true query parameter

Configuration example:
  tower {
    datasets {
      enabled = true
      createMode = 'auto'
      namePattern = '${workflow.runName}-outputs'
      perOutput {
        'results' { datasetId = 'existing-id' }
      }
    }
  }

Based on research of tower-cli (v0.15.0) and Seqera Platform API
documentation to ensure correct endpoint structure and payload format.

Signed-off-by: Edmund Miller <[email protected]>
Signed-off-by: Edmund Miller <[email protected]>
…et upload

Refactor dataset upload implementation to use the official tower-java-sdk
instead of manual HTTP multipart encoding, significantly simplifying the
code and improving maintainability.

Changes:
- Add tower-java-sdk dependency (1.43.1) with GitHub Packages repository
- Replace manual HTTP implementation with DatasetsApi SDK methods:
  - createDataset() now uses datasetsApi.createDataset(wspId, request)
  - uploadIndexToDataset() now uses datasetsApi.uploadDataset(wspId, id, header, file)
- Remove ~120 lines of manual HTTP code:
  - Deleted getUrlDatasets() and getUrlDatasetUpload() URL builders
  - Deleted uploadFile() multipart HTTP request construction
  - Deleted createMultipartBody() RFC 2388 multipart encoding
- Add comprehensive test coverage:
  - 7 unit tests with mocked DatasetsApi (initialization, event collection, 
    dataset creation, file upload, exception handling)
  - 1 integration test with real Platform API (conditional on TOWER_ACCESS_TOKEN)
  - Manual test workflow in test-dataset-upload/ directory with documentation

Testing:
- All unit tests passing (BUILD SUCCESSFUL)
- Integration test ready (runs when TOWER_ACCESS_TOKEN available)
- Test workflow provides end-to-end validation guide

Benefits:
- Uses official Seqera SDK (same as tower-cli)
- Easier to test with mocked API
- SDK handles all HTTP/multipart details automatically
- Bug fixes in SDK benefit us automatically
- Code reduced from ~300 lines to ~100 lines

Note: Requires GitHub credentials for tower-java-sdk dependency.
Configure github_username and github_access_token in gradle.properties
or set GITHUB_USERNAME and GITHUB_TOKEN environment variables.

Signed-off-by: Edmund Miller <[email protected]>
Signed-off-by: Edmund Miller <[email protected]>
The tower-java-sdk dependency from GitHub Packages requires authentication
even for public packages, causing CI build failures. This reverts the SDK
refactoring and restores the manual HTTP implementation.

Changes:
- Removed tower-java-sdk dependency from build.gradle
- Restored manual HTTP methods in TowerClient.groovy:
  - getUrlDatasets() and getUrlDatasetUpload() URL helpers
  - createDataset() with JSON payload and sendHttpMessage()
  - uploadFile() multipart HTTP implementation
  - createMultipartBody() RFC 2388 implementation (~120 lines total)
- Simplified TowerClientTest.groovy to remove SDK-specific tests
- Kept core functionality tests and integration test

Functionality remains identical - only the implementation approach changed
from SDK calls to direct HTTP requests. This allows the plugin to build
successfully in CI without requiring GitHub Package authentication.

Signed-off-by: Edmund Miller <[email protected]>
edmundmiller pushed a commit to edmundmiller/nextflow that referenced this pull request Nov 12, 2025
Implements a new Channel.fromDataset() operator that downloads datasets
from Seqera Platform. This feature enables workflows to fetch datasets
directly from the platform API.

Changes:
- Add DatasetExplorer class for handling dataset download logic
- Add Channel.fromDataset() factory method with support for dataset ID,
  version, and fileName parameters
- Add comprehensive unit tests for DatasetExplorer
- Use Tower access token authentication (TOWER_ACCESS_TOKEN env var or
  tower.accessToken config)

Usage example:
  ch_input = Channel.fromList(
      samplesheetToList(
          fromDataset([fileName: 'samplesheet.csv'], params.input),
          "assets/schema_input.json"
      )
  )

TODOs for future enhancements:
- Support querying multiple datasets using list-datasets API
- Support automatic version detection/latest version
- Auto-detect fileName from dataset metadata

Related to PR nextflow-io#6515 which added dataset upload functionality.
edmundmiller pushed a commit to edmundmiller/nextflow that referenced this pull request Nov 12, 2025
This commit adds a new `fromDataset()` operator to the nf-tower plugin
that allows downloading datasets from Seqera Platform.

Key Features:
- Downloads dataset files from Seqera Platform via the API
- Supports version specification (defaults to version 1)
- Supports custom file names (defaults to data.csv)
- Returns dataset content as a String for further processing
- Integrates seamlessly with nf-schema and other tools

Usage Examples:
```groovy
// Basic usage - download default file from dataset
def content = Channel.fromDataset('my-dataset-id')

// With nf-schema integration
ch_input = Channel.fromList(
    samplesheetToList(Channel.fromDataset(params.input), "assets/schema_input.json")
)

// Specify version and filename
def dataset = Channel.fromDataset(
    datasetId: 'my-dataset-id',
    version: '2',
    fileName: 'samples.csv'
)
```

Implementation Details:
- DatasetHelper: Handles API communication with Seqera Platform
- TowerChannelExtension: Provides the Channel extension method
- Uses Groovy extension module mechanism for seamless integration
- Properly handles authentication via TOWER_ACCESS_TOKEN
- Comprehensive error handling for HTTP errors (404, 403, 500, etc.)

TODOs for future enhancements:
- Add support for listing datasets (using /datasets API endpoint)
- Auto-detect latest version when not specified
- Query dataset metadata to determine actual filename

Related to PR nextflow-io#6515 (dataset upload functionality)

Signed-off-by: Edmund Miller <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant