Two possible solutions for addressing UW-Madison's HTCondor Constraints

Hello!

Previously, in issue #3705, there was some discussion of adjusting Nextflow's experiment HTCondor executor to support University of Wisconsin-Madison's particular constraints, which are nontrivial. As far as I can tell, that work remains unfinished. Below is an overview of the problem and a couple solutions (including a Nextflow fork with working adjustments to the condor executor that I've dogfooded at my workplace for the past two weeks).

### Overview of the problems

As stated in [the Nextflow HTCondor docs](https://www.nextflow.io/docs/latest/executor.html#htcondor), Nextflow makes the simplifying assumption that it's running on a shared filesystem. This assumption works for many if not the majority of HTCondor users. Unfortunately, these assumptions only partially hold at UW-Madison, where HTCondor was created. Our main HPC cluster introduces three wrinkles: 1) Only part of the filesystem is shared (the part called "staging"), 2) you _cannot submit jobs from that shared filesystem_, 3) Nodes do not have have access to staging by default; they must be given access to it with an additional condor option: `CHTCHasStaging == True`. Furthermore, nodes can never access arbitrary files outside of staging.

> [!NOTE]
> For the remainder of this issue, I'm going to refer to that shared filesystem as "staging", which is UW-Madison specific jargon and does not refer to the concept of "staging" files onto nodes before task execution.

These wrinkles presents a variety of challenges for getting Nextflow's condor executor (and indeed, nextflow generally) working there:

1. All data inputs from any arbitrary path mentioned in any .nf file must be on staging. This includes all `projectDir` and `launchDir` files.
2. Work directories must be on staging.
3. `.condor` submit files, typically placed alongside `.command.run` and the other command files on staging, must _not_ be on staging, as they cannot be submitted to the scheduler from there.
4. `bin/` scripts must be made available to each task from staging, _not_ wherever the pipeline code is.
5. Nextflow itself must _not_ be running on staging, as the thing that runs `condor_submit` will fail if it tries to submit from staging.

### A Solution in the Condor Executor

Before fully grasping the scope of these issues, my first effort was to get condor working in [my own fork of Nextflow](https://github.com/nrminor/nextflow), which I have done successfully.

Full disclosure: the above constraints have forced us to build a fair amount of UW-Madison-specific changes with new Nextflow syntax that will not apply to other HTCondor users. Most of these changes introduce additional overhead, e.g., copying of files, that is not normally performed upon pipeline launch. The changes also encode a fair amount of domain-specific knowledge about our cluster that the Nextflow maintainers will not have. As such, if I were to submit a PR( of course with appropriate adjustments to the documentation), I'd be interested in feedback on whether the maintainers would be open to merging it or if they would prefer/recommend my second proposed solution: a nextflow plugin for these UW-Madison-specific adjustments to the executor.

That caveat aside, my fork first and foremost prioritizes backwards compatibility: all changes are opt-in, and I expect most condor users will not need to opt-in. If they do, though, my fork adds additional configuration syntax for the condor executor specifically. Here's what my current working version looks like:

```nextflow
executor {
    '$condor' {
        sharedFilesystem = false
        submitFileDir = '.nextflow/condor'
        pathMappings = ['/<REDACTED>': '/staging']
        autoStageDirectories = ['/home/<REDACTED>/taxtriage/assets']
    }
}
```

(The portions labeled `<REDACTED>` are real paths on our cluster)

- The configuration starts with `sharedFilesystem = false`, which tells Nextflow that it will not be able to proceed normally and will instead need to a) place bin scripts and potentially other files inside of the work directory and b) place condor submit files within the current `launchDir`.
- `submitFileDir = '.nextflow/condor' tells Nextflow to place condor submit files in a subdirectory alongside other `.nextflow/` files.
- `pathMappings = ['/<REDACTED>': '/staging']` is another unfortunate concession to CHTC. When we use `CHTCHasStaging == True`, it only gives nodes access to a _symlink_ of staging, _not_ the actual, canonicalized path of staging. `pathMappings` gives Nextflow a mapping of which paths must be replaced in `.command.run`. By default, this config instructs nextflow to perform no replacements between mapped paths.
    - I'll note that this is where I began feeling like Nextflow maintainers may understandably not be interested in merging and maintaining these changes, though again, I'd be interested in any feedback! 
- `autoStageDirectories = ['/home/<REDACTED>/taxtriage/assets']` explicitly tells Nextflow which directories must be made available to nodes in the work directory on staging. When `sharedFilesystem = false`, this defaults to any non-hidden directory in `${projectDir}`, which is a hack and is not efficient, but is also a more lightweight solution to nextflow allowing arbitrary data dependencies anywhere in `.nf` files.

I use these options in my fork alongside some `clusterOptions` specific to my lab and `process` configuration to ensure all resource requirements are set, and it works beautifully. Nextflow submits jobs successfully, manages `.nextflow.log` correctly, maintains resumability, etc. It has been a massive improvement over our previous solution to using Nextflow without the condor executor.

That said, the implementation is not what I would call a light touch.

### Is a Nextflow Plugin an alternative?

Given the scope of the changes above and that they involve new configuration syntax, a question: do Nextflow plugins expose enough of the Nextflow internals to make the above concessions to UW-Madison's cluster possible? For context, my fork modifies the following items:

- `CondorExecutor.register()`
- `CondorExecutor.createBashWrapperBuilder()`
- `CondorWrapperBuilder.shouldUnstageOutputs()`
- `CondorWrapperBuilder.createContainerBuilder()`

It also introduces some new helper methods and replaces some pre-existing methods in `CondorExecutor` and adds a `CondorFileCopyStrategy` class, all to handle logic that will conditionally apply when  `sharedFilesystem = false`.

I'm not experience with Nextflow plugins, but would a plugin allow similar modifications to Nextflow behavior? Or does getting around UW-Madison's constraints require fundamental enough changes that they'd have to be in Nextflow itself? I'd prefer the plugin path but could see it going either way.

---

Thanks for reading! Hopefully we can find the right solution to get the executor working for UW-Madison condor users. In the interim, anyone interested is welcome to clone and build [my fork for their own purposes](https://github.com/nrminor/nextflow)!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Two possible solutions for addressing UW-Madison's HTCondor Constraints #6524

Overview of the problems

A Solution in the Condor Executor

Is a Nextflow Plugin an alternative?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Two possible solutions for addressing UW-Madison's HTCondor Constraints #6524

Description

Overview of the problems

A Solution in the Condor Executor

Is a Nextflow Plugin an alternative?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions