Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 4 #5

Merged
merged 9 commits into from
Oct 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 14 additions & 1 deletion book/_config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,4 +53,17 @@ html:

parse:
myst_substitutions:
miniconda_url: "[Miniconda](https://conda.io/miniconda.html)"
miniconda_url: "[Miniconda](https://conda.io/miniconda.html)"
release_epoch: "2024.5"
tutorial_environment_block: |
````{admonition} Reminder
:class: tip

These examples assume that you have a QIIME 2 deployment that includes the [q2-dwq2](https://github.com/caporaso-lab/q2-dwq2) educational plugin.
Follow the instructions in [](tutorial-setup) if you'd like to follow along with this tutorial.
If you've already followed those instructions, before following this tutorial be sure to activate your conda environment as follows:

```python
conda activate using-qiime2
```
````
4 changes: 4 additions & 0 deletions book/_toc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,18 +4,22 @@ parts:
- caption: Tutorials
chapters:
- file: tutorials/intro
- file: tutorials/parallel-pipeline
- caption: How-tos
chapters:
- file: how-to-guides/merge-metadata
- file: how-to-guides/validate-metadata
- file: how-to-guides/artifacts-as-metadata
- file: how-to-guides/view-visualizations
- file: how-to-guides/pipeline-resumption
- caption: Explanations
chapters:
- file: explanations/metadata
- file: explanations/types-of-parallelization
- caption: References
chapters:
- file: references/metadata
- file: references/parallel-configuration
- caption: Back matter
chapters:
- file: back-matter/glossary
Expand Down
11 changes: 11 additions & 0 deletions book/back-matter/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,12 @@ artifact
When written to file, artifacts typically have the extension {term}`qza`.
Artifacts can be provided as input to QIIME 2 {term}`actions <action>` or exported from QIIME 2 for use with other software.

breaking change
A *breaking change* is a change to how a program works (for example, a QIIME 2 plugin or interface) that introduces an incompatibility with earlier versions of the program.
This will generally require that users make some modification to how they were using some aspect of a system.
For example, if a plugin method added a new required input in version 2, that would be a breaking change with respect to version 1: calling the method without that new parameter would fail in version 2, but would have succeeded with version 1.
This may also be called a backward incompatible change or an API change.

DRY
An acronym of *Don't Repeat Yourself*, and a critical principle of software engineering and equally applicable in research data management.
For more information on DRY and software engineering in general, see {cite:t}`pragprog20`.
Expand All @@ -33,6 +39,11 @@ plugin
As of this writing, a collection of plugins that are installed together are referred to as a distribution.
Additional plugins can be installed, and the primary resource enabling discovery of additional plugins is the [QIIME 2 Library](https://library.qiime2.org).

Python 3 API
QIIME 2's Application Programmer Interface.
This allows advanced users to access all QIIME 2 analytic functionality directly in Python.
This can be very convenient for developing tools that use QIIME 2 as a component, or for performing data analysis without writing intermediary data artifacts to disk unless you specifically want to.

q2cli
[q2cli](https://github.com/qiime2/q2cli) is the original (and still primary, as of March 2024) command line interface for QIIME 2.

Expand Down
2 changes: 1 addition & 1 deletion book/explanations/metadata.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
(metadata-explanation)=
# Metadata in QIIME 2
# Sample and feature metadata

Metadata provides the key to gaining biological insight from your data.
In QIIME 2, **sample metadata** may include technical details, such as the DNA barcodes that were used for each sample in a multiplexed sequencing run, or descriptions of the samples, such as which subject, time point, and body site each sample came from in a human microbiome time series.
Expand Down
14 changes: 14 additions & 0 deletions book/explanations/types-of-parallelization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
(types-of-parallel-support)=
# Types of parallel computing support

## Parallel Pipeline execution

QIIME 2's formal parallel computing support uses [Parsl](https://parsl.readthedocs.io/en/stable/1-parsl-introduction.html>), and enables parallel execution of QIIME 2 {term}`Pipeline` actions.
All QIIME 2 `Pipelines` will have parallel computing options, notably the `--parallel` parameter in {term}`q2cli`, though whether those actually induce parallel computing is up to the implementation of the `Pipeline`.
Actions using this formal parallel computing support can make use of high-performance computing hardware that doesn't necessarily have shared memory.

## Informal parallel support

Some {term}`Method` actions (e.g., `qiime dada2 denoise-*`) wrap multi-threaded applications and may define a parameter (like `--p-n`) that gives the user control over that.
The QIIME 2 parameter type associated with these parameters should always be `NTHREADS` or `NJOBS` (if you observe a parameter where this isn't the case, it was probably an error on the developers part - reach out on the forum to let us know).
Actions using this informal parallel computing support are generally restricted to running on systems with shared memory.
2 changes: 1 addition & 1 deletion book/how-to-guides/artifacts-as-metadata.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
(view-artifacts-as-metadata)=
# How to use QIIME 2 Artifacts as Metadata
# How to use Artifacts as Metadata

In addition to TSV metadata files, QIIME 2 also supports viewing some kinds of artifacts as metadata.
An example of this is artifacts of type `SampleData[AlphaDiversity]`.
Expand Down
35 changes: 35 additions & 0 deletions book/how-to-guides/pipeline-resumption.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,35 @@
(pipeline-resumption)=
# How to resume failed Pipeline runs

If a {term}`Pipeline` fails at some point during its execution, and you rerun it, QIIME 2 can attempt to reuse the results that were calculated by the `Pipeline` before it failed.

## Pipeline resumption through the command line interface (CLI)

By default, when you run a {term}`Pipeline` on the CLI, QIIME 2 will create a pool in its cache (either the default cache, or the cache specified using the `--use-cache` parameter).
This pool will named based on the scheme: `recycle_<plugin>_<action>_<sha1('plugin_action')>`.
This pool will store all intermediate {term}`Results <result>` created by the {term}`Pipeline`.

Should the `Pipeline` run succeed, this pool will be removed.
However, should the `Pipeline` run fail, you can rerun the `Pipeline` using the same command you ran the first time, and the intermediate {term}`Results <result>` stored in the pool will be reused to avoid redoing steps in the Pipeline that had already completed.

If you wish to specify the pool that you would like QIIME 2 should use, either on a `Pipeline`'s first run or on a resumption, you can specify the pool using the `--recycle-pool` option, followed by the name of the pool you wish to use.
This pool will be created in the cache if it does not already exist.
The `--no-recycle` flag may be passed if you do not want QIIME 2 to attempt to recycle any past {term}`Results <result>` or to save its {term}`Results <result>` from this run for future reuse.

It is not necessarily possible to reuse prior {term}`Results <result>` if your inputs to the `Pipeline` differ on resumption with respect to what was provided on the initial run.
In this situation, QIIME 2 will still try to reuse any {term}`Results <result>` that are not dependent on the inputs that changed, but there is no guarantee any will be usable.

## Pipeline resumption through the Python 3 API

When using the Python API, pools are specified using context managers (i.e., using Python's `with` statement).
If you don't want to enable resumption, don't use the context manager.

```python
from qiime2.core.cache import Cache

cache = Cache('cache_path')
pool = cache.create_pool('pool', reuse=True)

with pool:
# run your pipeline here
```
2 changes: 1 addition & 1 deletion book/how-to-guides/view-visualizations.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
(view-visualizations)=
# How to view QIIME 2 Visualizations
# How to view Visualizations

## QIIME 2 View

Expand Down
183 changes: 183 additions & 0 deletions book/references/parallel-configuration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
(parallel-configuration)=
# Parallel Pipeline configuration

QIIME 2 provides formal support for parallel computing of {term}`Pipelines <pipeline>` through [Parsl](https://parsl.readthedocs.io/en/stable/1-parsl-introduction.html>).

## Parsl configuration

A [Parsl configuration](https://parsl.readthedocs.io/en/stable/userguide/configuring.html) tells Parsl what resources are available and how to use them, and is required to use Parsl.
The [Parsl documentation](https://parsl.readthedocs.io/en/stable/) provides full detail on [Parsl configuration](https://parsl.readthedocs.io/en/stable/userguide/configuring.html#).

In the context of QIIME 2, Parsl configuration information is maintained in a QIIME 2 configuration file.
QIIME 2 configuration files are stored on disk in [TOML](https://toml.io/en/) files.

### Default Parsl configuration

For basic multi-processor usage, QIIME 2 writes a default configuration file the first time it's needed (e.g., if you instruct QIIME 2 to execute in parallel without a particular configuration).

The default `qiime2_config.toml` file, as of QIIME 2 2024.10, looks like the following:

(default-parsl-configuration-file)=
```
[parsl]
strategy = "None"

[[parsl.executors]]
class = "ThreadPoolExecutor"
label = "tpool"
max_threads = ...

[[parsl.executors]]
class = "HighThroughputExecutor"
label = "default"
max_workers = ...

[parsl.executors.provider]
class = "LocalProvider"
```

When this file is written to disk, the `max_threads` and `max_workers` values (represented above by `...`) are computed by QIIME 2 as one less than the CPU count on the computer where it is running (`max(psutil.cpu_count() - 1, 1)`).

This configuration defines two `Executors`.

1. The [`ThreadPoolExecutor`](https://parsl.readthedocs.io/en/stable/stubs/parsl.executors.ThreadPoolExecutor.html?highlight=Threadpoolexecutor) that parallelizes jobs across multiple threads in a process.
2. The [`HighThroughputExecutor`](https://parsl.readthedocs.io/en/stable/stubs/parsl.executors.HighThroughputExecutor.html?highlight=HighThroughputExecutor) that parallelizes jobs across multiple processes.

In this case, the `HighThroughputExecutor` is designated as the default by nature of it's `label` value being `default`.
Your parsl configuration **must** define an executor with the label `default`, and this is the executor that QIIME 2 will use to dispatch your jobs to if you do not specify an alternative.

````{admonition} The parsl.Config object
:class: tip

This parsl configuration is ultimately read into a `parsl.Config` object internally in QIIME 2.
The `parsl.Config` object that corresponds to the above example would look like the following:

```python
config = parsl.Config(
executors=[
ThreadPoolExecutor(
label='tpool',
max_threads=... # will be an integer value
),
HighThroughputExecutor(
label='default',
max_workers=..., # will be an integer value
provider=LocalProvider()
)
],
strategy=None
)
```
````

### Parsl configuration, line-by-line

This first line of [the default configuration file presented above](default-parsl-configuration-file) indicates that this is the parsl section (or [table](https://toml.io/en/v1.0.0#table), to use TOML's terminology) of our configuration file.

```
[parsl]
```

The next line:

```
strategy = "None"
```

is a top-level Parsl configuration parameter that you can [read more about in the Parsl documentation](https://parsl.readthedocs.io/en/stable/userguide/configuring.html#multi-threaded-applications).
This may need to be set differently depending on your system.

If you were to load this into Python using tomlkit you would get the following dictionary:

Next, the first executor is added.

```
[[parsl.executors]]
class = "ThreadPoolExecutor"
label = "tpool"
max_threads = 7
```

The double square brackets (`[[ ... ]]`) indicates that [this is an array](https://toml.io/en/v1.0.0#array-of-tables), `executors`, that is nested under the `parsl` table.
`class` indicates the specific parsl class that is being configured ([`parsl.executors.ThreadPoolExecutor`](https://parsl.readthedocs.io/en/stable/stubs/parsl.executors.ThreadPoolExecutor.html#parsl.executors.ThreadPoolExecutor) in this case); `label` provides a label that you can use to refer to this executor elsewhere; and `max_threads` is a configuration value for the ThreadPoolExecutor class which corresponds to a parameter name for the class's constructor.
In this example a value of 7 is specified for `max_threads`, but as noted above this will be computed specifically for your machine when this file is created.

Parsl's `ThreadPoolExecutor` runs on a single node, so we provide a second executor which can utilize up to 2000 nodes.

```
[[parsl.executors]]
class = "HighThroughputExecutor"
label = "default"
max_workers = 7

[parsl.executors.provider]
class = "LocalProvider"
```

The definition of this executor, [`parsl.executors.HighThroughputExecutor`](https://parsl.readthedocs.io/en/stable/stubs/parsl.executors.HighThroughputExecutor.html#parsl.executors.HighThroughputExecutor), looks similar to the definition of the `ThreadPoolExecutor`, but it additionally defines a `provider`.
The provider class provides access to computational resources.
In this case, we use [`parsl.providers.LocalProvider`](https://parsl.readthedocs.io/en/stable/stubs/parsl.providers.LocalProvider.html), which provides access to local resources (i.e., on the laptop or workstation).
[Other providers are available as well](https://parsl.readthedocs.io/en/stable/reference.html#providers), including for Slurm, Amazon Web Services, Kubernetes, and more.

### Mapping {term}`Actions <action>` to executors

An executor mapping can be added to your parsl configuration that defines which actions should run on which executors.
If an action is unmapped, it will run on the default executor.
This can be specified as follows:

```
[parsl.executor_mapping]
action_name = "tpool"
```

```{warning}
The mechanism for specifying action names at present does not handle the case of different plugins defining actions with the same name.
This mechanism will likely change soon, and may be a {term}`breaking change`.
You can track progress on this [here](https://github.com/qiime2/qiime2/issues/802).
```

(view-parsl-configuration)=
### Viewing the current configuration

Using {term}`q2cli`, you can see your current `qiime2_config.toml` file by running:

```shell
qiime info --config-level 2
```

(qiime2-configuration-precedence)=
### QIIME 2 configuration file precedence

When QIIME 2 needs configuration information, the following precedence order is followed to load a configuration file:

1. The path specified in the environment variable `$QIIME2_CONFIG`.
2. The file at `<user_config_dir>/qiime2/qiime2_config.toml`
3. The file at `<site_config_dir>/qiime2/qiime2_config.toml`
4. The file at `$CONDA_PREFIX/etc/qiime2_config.toml`

If no configuration is found after checking those four locations, QIIME 2 writes a default configuration file to `$CONDA_PREFIX/etc/qiime2_config.toml` and uses that.
This implies that after your first time running QIIME 2 in parallel without a config in at least one of the first 3 locations, the path referenced in step 4 will exist and contain a configuration file.

Alternatively, when using {term}`q2cli`, you can provide a specific configuration for use in configuring parsl using the `--parallel-config` option.
If provided, this overrides the priority order above.

````{admonition} user_config_dir and site_config_dir
:class: note
On Linux, `user_config_dir` will usually be `$HOME/.config/qiime2/`.
On macOS, it will usually be `$HOME/Library/Application Support/qiime2/`.

You can get find the directory used on your system by running the following command:

```bash
python -c "import appdirs; print(appdirs.user_config_dir('qiime2'))"
```

On Linux `site_config_dir` will usually be something like `/etc/xdg/qiime2/`, but it may vary based on Linux distribution.
On macOS it will usually be `/Library/Application Support/qiime2/`.

You can get find the directory used on your system by running the following command:

```bash
python -c "import appdirs; print(appdirs.site_config_dir('qiime2'))"
```
````
Loading