Skip to content

Two possible solutions for addressing UW-Madison's HTCondor Constraints #6524

@nrminor

Description

@nrminor

Hello!

Previously, in issue #3705, there was some discussion of adjusting Nextflow's experiment HTCondor executor to support University of Wisconsin-Madison's particular constraints, which are nontrivial. As far as I can tell, that work remains unfinished. Below is an overview of the problem and a couple solutions (including a Nextflow fork with working adjustments to the condor executor that I've dogfooded at my workplace for the past two weeks).

Overview of the problems

As stated in the Nextflow HTCondor docs, Nextflow makes the simplifying assumption that it's running on a shared filesystem. This assumption works for many if not the majority of HTCondor users. Unfortunately, these assumptions only partially hold at UW-Madison, where HTCondor was created. Our main HPC cluster introduces three wrinkles: 1) Only part of the filesystem is shared (the part called "staging"), 2) you cannot submit jobs from that shared filesystem, 3) Nodes do not have have access to staging by default; they must be given access to it with an additional condor option: CHTCHasStaging == True. Furthermore, nodes can never access arbitrary files outside of staging.

Note

For the remainder of this issue, I'm going to refer to that shared filesystem as "staging", which is UW-Madison specific jargon and does not refer to the concept of "staging" files onto nodes before task execution.

These wrinkles presents a variety of challenges for getting Nextflow's condor executor (and indeed, nextflow generally) working there:

  1. All data inputs from any arbitrary path mentioned in any .nf file must be on staging. This includes all projectDir and launchDir files.
  2. Work directories must be on staging.
  3. .condor submit files, typically placed alongside .command.run and the other command files on staging, must not be on staging, as they cannot be submitted to the scheduler from there.
  4. bin/ scripts must be made available to each task from staging, not wherever the pipeline code is.
  5. Nextflow itself must not be running on staging, as the thing that runs condor_submit will fail if it tries to submit from staging.

A Solution in the Condor Executor

Before fully grasping the scope of these issues, my first effort was to get condor working in my own fork of Nextflow, which I have done successfully.

Full disclosure: the above constraints have forced us to build a fair amount of UW-Madison-specific changes with new Nextflow syntax that will not apply to other HTCondor users. Most of these changes introduce additional overhead, e.g., copying of files, that is not normally performed upon pipeline launch. The changes also encode a fair amount of domain-specific knowledge about our cluster that the Nextflow maintainers will not have. As such, if I were to submit a PR( of course with appropriate adjustments to the documentation), I'd be interested in feedback on whether the maintainers would be open to merging it or if they would prefer/recommend my second proposed solution: a nextflow plugin for these UW-Madison-specific adjustments to the executor.

That caveat aside, my fork first and foremost prioritizes backwards compatibility: all changes are opt-in, and I expect most condor users will not need to opt-in. If they do, though, my fork adds additional configuration syntax for the condor executor specifically. Here's what my current working version looks like:

executor {
    '$condor' {
        sharedFilesystem = false
        submitFileDir = '.nextflow/condor'
        pathMappings = ['/<REDACTED>': '/staging']
        autoStageDirectories = ['/home/<REDACTED>/taxtriage/assets']
    }
}

(The portions labeled <REDACTED> are real paths on our cluster)

  • The configuration starts with sharedFilesystem = false, which tells Nextflow that it will not be able to proceed normally and will instead need to a) place bin scripts and potentially other files inside of the work directory and b) place condor submit files within the current launchDir.
  • submitFileDir = '.nextflow/condor' tells Nextflow to place condor submit files in a subdirectory alongside other .nextflow/` files.
  • pathMappings = ['/<REDACTED>': '/staging'] is another unfortunate concession to CHTC. When we use CHTCHasStaging == True, it only gives nodes access to a symlink of staging, not the actual, canonicalized path of staging. pathMappings gives Nextflow a mapping of which paths must be replaced in .command.run. By default, this config instructs nextflow to perform no replacements between mapped paths.
    • I'll note that this is where I began feeling like Nextflow maintainers may understandably not be interested in merging and maintaining these changes, though again, I'd be interested in any feedback!
  • autoStageDirectories = ['/home/<REDACTED>/taxtriage/assets'] explicitly tells Nextflow which directories must be made available to nodes in the work directory on staging. When sharedFilesystem = false, this defaults to any non-hidden directory in ${projectDir}, which is a hack and is not efficient, but is also a more lightweight solution to nextflow allowing arbitrary data dependencies anywhere in .nf files.

I use these options in my fork alongside some clusterOptions specific to my lab and process configuration to ensure all resource requirements are set, and it works beautifully. Nextflow submits jobs successfully, manages .nextflow.log correctly, maintains resumability, etc. It has been a massive improvement over our previous solution to using Nextflow without the condor executor.

That said, the implementation is not what I would call a light touch.

Is a Nextflow Plugin an alternative?

Given the scope of the changes above and that they involve new configuration syntax, a question: do Nextflow plugins expose enough of the Nextflow internals to make the above concessions to UW-Madison's cluster possible? For context, my fork modifies the following items:

  • CondorExecutor.register()
  • CondorExecutor.createBashWrapperBuilder()
  • CondorWrapperBuilder.shouldUnstageOutputs()
  • CondorWrapperBuilder.createContainerBuilder()

It also introduces some new helper methods and replaces some pre-existing methods in CondorExecutor and adds a CondorFileCopyStrategy class, all to handle logic that will conditionally apply when sharedFilesystem = false.

I'm not experience with Nextflow plugins, but would a plugin allow similar modifications to Nextflow behavior? Or does getting around UW-Madison's constraints require fundamental enough changes that they'd have to be in Nextflow itself? I'd prefer the plugin path but could see it going either way.


Thanks for reading! Hopefully we can find the right solution to get the executor working for UW-Madison condor users. In the interim, anyone interested is welcome to clone and build my fork for their own purposes!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions