Skip to content

PoC for CID store annotations and workflow outputs structure #5885

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
Apr 2, 2025

Conversation

jorgee
Copy link
Contributor

@jorgee jorgee commented Mar 13, 2025

This PR is a PoC for adding the annotations to metadata entries in the CID and restructure the workflow outputs in the way defined in output DSL

It is currently getting annotations from tags

Tested with a variation of the e2e test with a small modification to include tags.

output {
  samples {
    path { sample ->
      ...
      }
    }
    tags (experimentId: "value", experimentDate: "date")
    index {
      ...
    }
  }
}

An example of is the generated WorkflowResults. publishedData is a list of all files published and outputs are how they are defined as records in the ouputsDsl

{
    "type": "WorkflowResults",
    "run": "cid://835f10672ae237225406de48672493a0",
    "outputs": {
        "samples": [
            {
                "id": "delta",
                "fastqc": "cid://835f10672ae237225406de48672493a0/fastqc/delta.fastqc.log",
                "quant": "cid://835f10672ae237225406de48672493a0/quant/delta"
            },
            {
                "id": "beta",
                "fastqc": "cid://835f10672ae237225406de48672493a0/fastqc/beta.fastqc.log",
                "quant": "cid://835f10672ae237225406de48672493a0/quant/beta"
            },
            {
                "id": "alpha",
                "fastqc": "cid://835f10672ae237225406de48672493a0/fastqc/alpha.fastqc.log",
                "quant": "cid://835f10672ae237225406de48672493a0/quant/alpha"
            }
        ]
    },
    "publishedData": [
        "cid://835f10672ae237225406de48672493a0/fastqc/delta.fastqc.log",
        "cid://835f10672ae237225406de48672493a0/quant/delta",
        "cid://835f10672ae237225406de48672493a0/fastqc/beta.fastqc.log",
        "cid://835f10672ae237225406de48672493a0/quant/beta",
        "cid://835f10672ae237225406de48672493a0/fastqc/alpha.fastqc.log",
        "cid://835f10672ae237225406de48672493a0/quant/alpha",
        "cid://835f10672ae237225406de48672493a0/samples.csv"
    ]
}

Workflow output files are annotated with the provided tags

$ nextflow cid show cid://835f10672ae237225406de48672493a0/fastqc/delta.fastqc.log
{
    "type": "WorkflowOutput",
    "path": "/home/jorgee/nextflow_tests/provenance-test/results/fastqc/delta.fastqc.log",
    "checksum": {
        "value": "f8406b93427367bd50bc2bdf34659aa3",
        "algorithm": "nextflow",
        "mode": "standard"
    },
    "source": "cid://a786558405845a775fc6218fa6aa7b03/delta.fastqc.log",
    "size": 6,
    "createdAt": 1741870059179,
    "modifiedAt": 1741870059179,
    "annotations": {
        "experimentId": "value",
        "experimentDate": "date"
    }
}

Copy link

netlify bot commented Mar 13, 2025

Deploy Preview for nextflow-docs-staging ready!

Name Link
🔨 Latest commit 853a9c7
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/67d2dbe06d34d10008e6b5cb
😎 Deploy Preview https://deploy-preview-5885--nextflow-docs-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@jorgee jorgee changed the base branch from master to cid-store March 13, 2025 13:21
@bentsherman bentsherman self-requested a review March 13, 2025 13:27
Copy link
Member

@bentsherman bentsherman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great so far. Left a few minor suggestions. I will try to play with it when I have time

@jorgee
Copy link
Contributor Author

jorgee commented Mar 14, 2025

I have updated the code with annotations field in the output. It can be Map or a closure. The closure is evaluated per sample. So, we could support the case of adding sample information as annotation such as the sampleId

output {
  samples {
    path { sample ->... }

    annotations { sample ->
        return [experimentId: params.experimentId , sampleId : sample.id]
    }
    index {
      path 'samples.csv'
      header true
      sep ','
    }
  }
}

Comment on lines +174 to +180
void annotations(Map value) {
setOption('annotations', value)
}

void annotations(Closure value) {
setOption('annotations', value)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: decide whether to use tags or phase it out in favor of annotations

@jorgee
Copy link
Contributor Author

jorgee commented Mar 18, 2025

Last changes:

  • Implemented support for retreiving output descriptions Channel.fromPath("cid://<workflow_run_cid>/outputs")
  • Workflow output cid files are added within <workflow_run_cid>/outputs to avoid the ambiguity of having an outputs with publish dir outputs
  • Outputs file is managed as an extension of CidPath that creates an InputStream or ByteChannel from the metadata description.
  • When passed to a task it is treated as a foreign file because it keeps the cid scheme and it will be serialized to a real file as a staging file
  • Added a new field in WorkflowResults creationTime to avoid several serializalizations when staging it from different tasks

@pditommaso
Copy link
Member

@jorgee also some conflicts to solve here 🙏

@jorgee jorgee mentioned this pull request Mar 24, 2025
@jorgee jorgee force-pushed the cid-store-annotations branch from c656fda to 4548915 Compare March 28, 2025 07:43
@jorgee
Copy link
Contributor Author

jorgee commented Mar 31, 2025

Last changes:

  • Use string ISO 8601 format for date instead of millis.
  • query and show commands unified.
  • Get rid of publishedFiles
  • New publishedBy field in WorkflowOutput to enable search for all outputs published by a workflow.
  • Pseudo fs: modified path to render metadata elements:
    old: cid://<hash>/outputs
    new: cid://<hash>/#outputs ( I couldn't use ? in fromPath because the glob pattern. I have used the URI fragment instead which also has a better meaning for this purpose).
  • Query params still used for general query e.g nexflow cid show "cid:///?type=WorkflowOutputs&publishedBy=...."
  • There is still an undetermined situation. Should I return the description of the workflow run when fromPath("cid://<hash>/") or just when a fragment is requested?
  • I have recovered the results CID due to the change of render paths. As we are not really doing content-addressing, I could reuse the runCID and update the description with the outputs and remove the results CID. What do you think about it?

@jorgee
Copy link
Contributor Author

jorgee commented Apr 1, 2025

Also added TaskResults descriptions. It includes all the task outputs (files and values) as well as a reference to the taskRun and the workflowRun.

So, we can look for the task results of a certain task run with nextflow cid show cid://?taskRun=<taskRunCid> or all the tasks executed by a workflow with nextflow cid show cid://?runBy=<workflowRunCid>

@pditommaso
Copy link
Member

pditommaso commented Apr 1, 2025

Changed the constructor of GsonEncoder because, according gpt, RuntimeTypeAdapterFactory is not thread safe

@pditommaso
Copy link
Member

Added the serialization of nulls

Comment on lines +195 to 196
return files
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should return if it's not a Path or a collection?

@pditommaso
Copy link
Member

Merging this, and continue the discussion about open points in the baseline PR

@pditommaso pditommaso merged commit 6b3293b into cid-store Apr 2, 2025
5 checks passed
@pditommaso pditommaso deleted the cid-store-annotations branch April 2, 2025 09:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants