Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using custom objects with paths #2085

Open
nh13 opened this issue May 4, 2021 · 27 comments · May be fixed by #4553
Open

Using custom objects with paths #2085

nh13 opened this issue May 4, 2021 · 27 comments · May be fixed by #4553

Comments

@nh13
Copy link

nh13 commented May 4, 2021

Now that we can pass around custom (value/metadata) classes and have them cached, it would be wonderful if Nextflow could recognize Path members/properties so they are localized and available for the process that uses the class. For example:

import nextflow.io.ValueObject

@ValueObject
class SampleData {
    /** The sample name */
    String name
    /** The path to the read one FASTQ */
    Path r1
    /** The path to the read two FASTQ */
    Path r2
}

There could be a way annotate r1 and r2 (or return the list of paths etc) for nextflow to localize them.

@mjhipp
Copy link

mjhipp commented May 4, 2021

I wonder if it would be possible by using a different type as a member of the ValueObject? I am not very familiar with the source code, so I assume this wouldn't actually work, but something like:

import nextflow.io.ValueObject
import nextflow.processor.TaskPath

@ValueObject
class SampleData {
    /** The sample name */
    String name
    /** The path to the read one FASTQ */
    TaskPath r1
    /** The path to the read two FASTQ */
    TaskPath r2
}

Or maybe nextflow.file.FileHolder?

@pditommaso
Copy link
Member

Would just String be enough for paths? internal classes are not meant to be used in pipeline scripts.

@nh13
Copy link
Author

nh13 commented May 4, 2021

I'd love to have things properly typed (eg. Path).

@pditommaso
Copy link
Member

In principle Path should work as well. What's the problem with it?

@nh13
Copy link
Author

nh13 commented May 4, 2021

Does nextflow inspect the object for members of type Path and localize already?

@pditommaso
Copy link
Member

What do you mean for localize?

@nh13
Copy link
Author

nh13 commented May 4, 2021

If I pass in a SampleData class in the script, the r1

process foo {
  input:
     val sample // Type: SampleData
     path fasta
  output:
    ???
  script:
    """
    bwa mem ${fasta} ${sample.r1} ${sample.r2}
    """
}

@mjhipp
Copy link

mjhipp commented May 4, 2021

Files being localized from the working directory to the analysis node (mainly for cloud workflows)

@pditommaso
Copy link
Member

we are entering in a minefield .. 😄 no, the file staging (ie localisation) is determined by the path declaration. The pattern you are suggesting is interesting but not supported.

@ryankelley-lm
Copy link

There's a variety of situations where this capability would make it less error prone to deal with collections of genomic files, including VCF (+ index), BAM (+ index), FASTA (+ index). In the above example, you could imagine additional behaviors such as "is_paired" to allow scripts to handle single- or paired-end data in a more expressive way. Is this something that would completely break the process handling design or is there an opportunity for a user contribution in this area?

@stale

This comment was marked as outdated.

@stale stale bot added the stale label Jan 30, 2022
@nh13
Copy link
Author

nh13 commented Jan 30, 2022

Still relevant, here’s what I do:

import groovy.transform.Immutable
import nextflow.io.ValueObject
import nextflow.util.KryoHelper

@ValueObject
@Immutable(copyWith=true, knownImmutables = ['path'])
class Metadata {
    static { 
        // Register this class with the Kryo framework that serializes and deserializes objects
        // that pass through channles. This allows for caching when this object is used.
        KryoHelper.register(Metadata)
    }

    Path path
}

@edmundmiller
Copy link
Contributor

@nh13 Do you have an example of a pipeline you've used that in?

@nh13
Copy link
Author

nh13 commented May 5, 2022

@emiller88 not in one I can share publicly, happy to help with a small hello world example.

@edmundmiller
Copy link
Contributor

@emiller88 not in one I can share publicly, happy to help with a small hello world example.

I figured 😞. Not sure if that was you offering to write it or answer questions.

I'm guessing you toss the KryoHelper class in a groovy file in lib/. I guess I'm stuck on the output: ??? piece as well?

@nh13
Copy link
Author

nh13 commented May 5, 2022

I was offering, but it’s not high on my list.

@ryankelley-lm
Copy link

@nh13 , I was hoping that a fully worked example wouldn't be necessarily be needed if it is just a question of how to structure the "output" line in the process definition such that the custom class will be appropriately instantiated by the framework, but perhaps that is wishful thinking.

Also, one additional request for clarification, are you saying in the approach that you use above that process results are appropriately cached and any suitably annotated paths in the custom class are also localized?

@nh13
Copy link
Author

nh13 commented May 5, 2022

  1. KryoHelper is part of nextflow, so no need to include it other than the import at the top of the custom class groovy file (yes, put the custom groovy file in /lib)
  2. Something like this should work:
    input:
    tuple val(meta), path(in_bam)

    output:
    tuple(val(meta), path("${meta.name()}.out.txt")         , emit: txt)

@edmundmiller
Copy link
Contributor

@nh13 Okay I threw together an example

N E X T F L O W  ~  version 22.04.0
Launching `main.nf` [elated_bell] DSL2 - revision: 4a32a3cac6
BUG! exception in phase 'semantic analysis' in source unit 'Script_097d04d3' The lookup for CustomObject caused a failed compilation. There should not have been any compilation from this call.

@ryankelley-lm
Copy link

@emiller88 , I suspect compilation of your custom class is failing on resolution of the "Path" type (needs to be imported?). Maybe can strip down example to use native types as a first pass.

@edmundmiller
Copy link
Contributor

@emiller88 , I suspect compilation of your custom class is failing on resolution of the "Path" type (needs to be imported?). Maybe can strip down example to use native types as a first pass.

This was it. I've pushed up a working test, but the class just replaces the map, so it's not really doing much.

@ryankelley-lm
Copy link

@nh13 , I'm assuming that in your previous example, the described workaround only allows for definition of the custom class, but not any associated localization of the files? If that is the case, I tried an alternate approach that would require some modification to the underlying Nextflow framework. Briefly, the custom class must be treated as a path input and provides a collection of associated paths that must be staged. While this addresses the file staging need, none of the custom attributes are accessible. For that, there is an additional interface generated, CustromFileAttributesProvider that the custom class can implement. If available, this interface method is used to define additional properties on the object exposed during script generation.

The changes to Nextflow are summarized in a PR here, ryankelley-lm#1, with an associated example leveraging the interface at https://github.com/ryankelley-lm/hello-custom-objects-1/tree/custom-object-download

but perhaps you were able to achieve this in a simpler way?

@pditommaso , the PR above is a rough first pass, but do you think the general concept is something that could be reworked into an acceptable PR?

CC @emiller88

@nh13
Copy link
Author

nh13 commented May 7, 2022

@ryankelley-lm if you use a path it will work just fine for local paths. I’m not sure how or if remote files will work.

@pditommaso
Copy link
Member

pditommaso commented May 8, 2022

I think to handle this properly it's required to introduce a new input/output qualifier record as an alternative to tuple.

A tentative syntax could be as shown below:

@ValueObject
class SampleData {
    String name
    Path r1
    Path r2
}


process foo {
  output: 
  record SampleData(name: 'some_name', r1: '*_1.bam', r2: '*_2.bam')

  '''
  some_command 
  '''
}

process bar {
  input: 
  record data: SampleData()

  """
  other_command --sample $data.name --reads $data.r1 $data.r2
  """  
}

workflow {
  foo | bar
}

The idea is to continue to have tuple-like semantics both for the input and output declaration to have the ability to use custom record types

@ryankelley-lm
Copy link

@pditommaso ,
Thanks for the feedback. In the above, would "record" be an alternative to "tuple" or "val"? When you say that "record" would be an alternative to "tuple", do you imagine that you would be packaging multiple different custom object instances into a single "record" line, in the same way that multiple instances are packaged into a "tuple"? Or is a each "record" a single object instance, like with a "val"?

At some point, would you still require some contract to be implemented by the custom class, such that the framework knows what paths the custom object is dealing with?

@ryankelley-lm
Copy link

@pditommaso , or is the idea that a "record" is essentially a named tuple, such that the framework knows each of the data members and for path, can handle them appropriately?

@ryankelley-lm
Copy link

ryankelley-lm commented Jun 1, 2022

FWIW, I believe most of the desired behavior here is achievable with the current release of Nextflow if you

  • Implement the tuple interface in your custom object (for example, by extending ArrayBag)
  • Wrapping your process in a subworkflow to capture the process output into your standard object

The idea is that Nextflow will see your custom object as a tuple and treat it accordingly. In your own scripts, you can use the specific behaviors of your custom object. Since Nextflow knows how to stage tuple appropriately, all of the staging will be handled correctly. This idea is captured in the updated example here, https://github.com/ryankelley-lm/hello-custom-objects-1/tree/custom-object-download

nextflow.enable.dsl = 2

process STEPONE_PROCESS {
    input:
    tuple val(myname), path(r1), path(r2)

    output:
    tuple val("${myname}"), path("output1.txt"), path("output2.txt"), emit: paths

    script:
    """
    cat ${r1} ${r2} >> output1.txt
    cat ${r2} ${r1} >> output2.txt
    """
}


workflow STEPONE_WORKFLOW {
    take: workflowInput
    main:
        processResult = STEPONE_PROCESS(workflowInput).map {
            name, path1, path2 ->
                return new CustomObject(name, path1, path2) 
        }
    emit:
        processResult
}
import groovy.transform.Immutable
import nextflow.io.ValueObject
import nextflow.util.KryoHelper
import java.nio.file.Path
import nextflow.util.ArrayBag
import java.util.List
import nextflow.file.FileHolder
import java.util.Map

class CustomObject extends ArrayBag {
    static {
        // Register this class with the Kryo framework that serializes and deserializes objects
        // that pass through channles. This allows for caching when this object is used.
        KryoHelper.register(CustomObject)
    }

    String name

    Path r1

    Path r2

    CustomObject(String name, Path r1, Path r2)
    {
        super()
        this.name = name
        this.r1 = r1
        this.r2 = r2
        this.target.add(name)
        this.target.add(r1)
        this.target.add(r2)
    }

    public String toString()
    {
        return this.name + " " + this.r1.toString() + " " + this.r2.toString()
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants