From 7825093bdf2bbfacf557e60b1bb4adb6fe548fa0 Mon Sep 17 00:00:00 2001 From: Greg Caporaso Date: Tue, 24 Sep 2024 10:26:15 -0700 Subject: [PATCH] add parallel and pipeline resumption docs fixes #4 fixes caporaso-lab/developing-with-qiime2#29 --- book/_toc.yml | 3 + book/explanations/types-of-parallelization.md | 14 + book/how-to-guides/parallel-configuration.md | 290 ++++++++++++++++++ book/how-to-guides/pipeline-resumption.md | 41 +++ 4 files changed, 348 insertions(+) create mode 100644 book/explanations/types-of-parallelization.md create mode 100644 book/how-to-guides/parallel-configuration.md create mode 100644 book/how-to-guides/pipeline-resumption.md diff --git a/book/_toc.yml b/book/_toc.yml index 2868e86..1290c9d 100644 --- a/book/_toc.yml +++ b/book/_toc.yml @@ -10,9 +10,12 @@ parts: - file: how-to-guides/validate-metadata - file: how-to-guides/artifacts-as-metadata - file: how-to-guides/view-visualizations + - file: how-to-guides/parallel-configuration + - file: how-to-guides/pipeline-resumption - caption: Explanations chapters: - file: explanations/metadata + - file: explanations/types-of-parallelization - caption: References chapters: - file: references/metadata diff --git a/book/explanations/types-of-parallelization.md b/book/explanations/types-of-parallelization.md new file mode 100644 index 0000000..79ae66c --- /dev/null +++ b/book/explanations/types-of-parallelization.md @@ -0,0 +1,14 @@ +(types-of-parallel-support)= +# Types of parallel computing support + +## Formal parallel support + +QIIME 2's formal parallel computing support uses [Parsl](https://parsl.readthedocs.io/en/stable/1-parsl-introduction.html>), and is only accessible through qiime2 {term}`Pipeline` actions. +All QIIME 2 `Pipelines` will have parallel computing options (e.g., the `--parallel` parameter in {term}`q2cli`), though whether those actually induce parallel computing is up to the implementation of the `Pipeline`. +Actions using this formal parallel computing support can make use of high-performance computing hardware that doesn't necessarily have shared memory. + +## Informal parallel support + +Some {term}`Method` actions (e.g., `qiime dada2 denoise-*`) wrap multi-threaded applications and may define a parameter (like `--p-n`) that gives the user control over that. +The QIIME 2 parameter type associated with these parameters should always be `NTHREADS` or `NJOBS` (if you observe a parameter where this isn't the case, it was probably an error on the developers part - reach out on the forum to let us know). +Actions using this informal parallel computing support are generally restricted to running on systems with shared memory. \ No newline at end of file diff --git a/book/how-to-guides/parallel-configuration.md b/book/how-to-guides/parallel-configuration.md new file mode 100644 index 0000000..4798d86 --- /dev/null +++ b/book/how-to-guides/parallel-configuration.md @@ -0,0 +1,290 @@ +(parallel-configuration)= +# How to configure and use parallel computing + +QIIME 2 provides formal support for parallel computing of {term}`Pipelines ` through [Parsl](https://parsl.readthedocs.io/en/stable/1-parsl-introduction.html>).[^formal-informal-parallel] +This allows for faster execution of QIIME 2 `Pipelines` by ensuring that pipeline steps that can run simultaneously do run simultaneously, assuming the compute resources are available. + +A [Parsl configuration](https://parsl.readthedocs.io/en/stable/userguide/configuring.html) is required to use Parsl. +This configuration tells Parsl what resources are available to it, and how to use them. +How to create and use a Parsl configuration through QIIME 2 depends on which interface you're using and will be detailed on a per-interface basis below. + +For basic usage, we have supplied a vendored configuration that we load from a [`.toml`](https://toml.io/en/) file that will be used by default if you instruct QIIME 2 to execute in parallel without a particular configuration. +This configuration file is shown below. +We write this file the first time you attempt to use it, and the `max(psutil.cpu_count() - 1, 1)` is evaluated and written to the file at that time. +An actual number is required there for all user made config files. + +**TODO: The following needs to be updated for the 2024.10 default! Pick up here.** + +``` +[parsl] +strategy = "None" + +[[parsl.executors]] +class = "ThreadPoolExecutor" +label = "default" +max_threads = max(psutil.cpu_count() - 1, 1) + +[[parsl.executors]] +class = "HighThroughputExecutor" +label = "htex" +max_workers = max(psutil.cpu_count() - 1, 1) + +[parsl.executors.provider] +class = "LocalProvider" + +[parsl.executor_mapping] +some_action = "htex" +``` + +An actual `parsel.Config` object in Python looks like the following: + +```python +config = Config( + executors=[ + ThreadPoolExecutor( + label='default', + max_threads=max(psutil.cpu_count() - 1, 1) + ), + HighThroughputExecutor( + label='htex', + max_workers=max(psutil.cpu_count() - 1, 1), + provider=LocalProvider() + ) + ], + # AdHoc Clusters should not be setup with scaling strategy. + strategy=None +) + +# This bit is not part of the actual parsel.Config, but rather is used to tell +# QIIME 2 which non-default executors (if any) you want specific actions to run +# on +mapping = {'some_action': 'htex'} +``` + +The [Parsl documentation](https://parsl.readthedocs.io/en/stable/) provides full detail. +Briefly, we create a [`ThreadPoolExecutor`](https://parsl.readthedocs.io/en/stable/stubs/parsl.executors.ThreadPoolExecutor.html?highlight=Threadpoolexecutor) that parallelizes jobs across multiple threads in a process. +We also create a [`HighThroughputExecutor`](https://parsl.readthedocs.io/en/stable/stubs/parsl.executors.HighThroughputExecutor.html?highlight=HighThroughputExecutor) that parallelizes jobs across multiple processes. + +```{note} +Your config MUST contain an executor with the label `default`. +This is the executor that QIIME 2 will dispatch your jobs to if you do not specify an executor to use. +The default executor in the default config is the ThreadPoolExecutor meaning that unless you specify otherwise all jobs that use the default config will run on the ThreadPoolExecutor. +``` + +## The Config File + +Let's break down that config file further by constructing it from the ground up using 7 as our max threads/workers. + +``` +[parsl] +strategy = "None" +``` + +This very first part of the file indicates that this is the parsl section of our config. +That will be the only section we define at the moment, but in the future we expect to expand on this to provide additional QIIME 2 configuration options through this file. +`strategy = 'None'` is a top level Parsl configuration parameter that you can read more about in the Parsl documentation. +This may need to be set differently depending on your system. +If you were to load this into Python using tomlkit you would get the following dictionary: + +```python +{ + 'parsl': { + 'strategy': 'None' + } +} +``` + +Next, let's add an executor: + +``` +[[parsl.executors]] +class = "ThreadPoolExecutor" +label = "default" +max_threads = 7 +``` + +The `[[ ]]` indicates that this is a list and the `parsl.executors` in the middle indicates that this list is called `executors` and belongs under parsl. +Now our dictionary looks like the following: + +```python +{ + 'parsl': { + 'strategy': 'None' + 'executors': [ + {'class': 'ThreadPoolExecutor', + 'label': 'default', + 'max_threads': 7} + ] + } +} +``` + +To add another executor, we simply add another list element. +Notice that we also have `parsl.executors.provider` for this one. +Some classes of parsl executor require additional classes to fully configure them. +These classes must be specified beneath the executor they belong to. + +``` +[[parsl.executors]] +class = "HighThroughputExecutor" +label = "htex" +max_workers = 7 + +[parsl.executors.provider] +class = "LocalProvider" +``` + +Now our dictionary looks like the following: + +```python +{ + 'parsl': { + 'strategy': 'None' + 'executors': [ + {'class': 'ThreadPoolExecutor', + 'label': 'default', + 'max_threads': 7}, + {'class': 'HighThroughputExecutor', + 'label': 'htex', + 'max_workers': 7, + 'provider': {'class': 'LocalProvider'}}] + } +} +``` + +Finally, we have the executor_mapping, where you can define which actions, if any, you would like to run on which executors. +If an action is unmapped, it will run on the default executor. + +``` +[parsl.executor_mapping] +some_action = "htex" +``` + +Our final result looks like the following. +The `executor_mapping` internally to tell Parsl where you want you actions to run, while the rest of the information is used to instantiate the `parsl.Config` object shown above. + +```python +{ + 'parsl': { + 'strategy': 'None' + 'executors': [ + {'class': 'ThreadPoolExecutor', + 'label': 'default', + 'max_threads': 7}, + {'class': 'HighThroughputExecutor', + 'label': 'htex', + 'max_workers': 7, + 'provider': {'class': 'LocalProvider'}}], + 'executor_mapping': {'some_action': 'htex'} + } +} +``` + +## Using QIIME 2 in parallel through the command line interface (CLI) + +There are two flags that allow you to parallelize a pipeline through the CLI. +The first is the `--parallel` flag. +This flag will use the following priority order to load a Parsl configuration. + +1. Check the environment variable `$QIIME2_CONFIG` for a filepath to a configuration file. +2. Check the path `/qiime2/qiime2_config.toml` +3. Check the path `/qiime2/qiime2_config.toml` +4. Check the path `$CONDA_PREFIX/etc/qiime2_config.toml` +5. Write a default configuration to the path in step 4 and use that. + +This implies that after your first time running QIIME 2 in parallel without a config in at least one of the first 3 locations, the path referenced in step 4 will exist and contain the default config (unless you remove the file or switch to a different conda environment). + +The second flag related to parallelization through the command line interface is the `--parallel-config` flag, which is used to provide path to a configuration file. +This allows you to easily create and use your own custom configuration based on your system, and a value provided using this parameter overrides the above priority order. + +````{admonition} user_config_dir +:class: note +On Linux, `user_config_dir` will usually be `$HOME/.config/qiime2/`. +On macOS, it will usually be `$HOME/Library/Application Support/qiime2/`. + +You can get find the directory used on your system by running the following command: + +```bash +python -c "import appdirs; print(appdirs.user_config_dir('qiime2'))" +``` +```` + +````{admonition} site_config_dir +:class: note +On Linux `site_config_dir` will usually be something like `/etc/xdg/qiime2/`, but it may vary based on Linux distribution. +On macOS it will usually be `/Library/Application Support/qiime2/`. + +You can get find the directory used on your system by running the following command: + +```bash +python -c "import appdirs; print(appdirs.site_config_dir('qiime2'))" +``` +```` + + +## Using QIIME 2 in parallel through the Python 3 API + + +Parallelization through the Python API is done using `parsl.Config` objects as context managers. +These objects take a `parsl.Config` object and a dictionary mapping action names to executor names. +If no config is provided your default config will be used (found using the same priority order as described for the `--parallel` flag above). + +A `parsl.Config` object itself can be created in several different ways. + +First, you can just create it using Parsl directly. + +```python +import psutil + +from parsl.config import Config +from parsl.providers import LocalProvider +from parsl.executors.threads import ThreadPoolExecutor +from parsl.executors import HighThroughputExecutor + + +config = Config( + executors=[ + ThreadPoolExecutor( + label='default', + max_threads=max(psutil.cpu_count() - 1, 1) + ), + HighThroughputExecutor( + label='htex', + max_workers=max(psutil.cpu_count() - 1, 1), + provider=LocalProvider() + ) + ], + # AdHoc Clusters should not be setup with scaling strategy. + strategy=None +) +``` + +Alternatively, you can create it from a QIIME 2 config file. + +```python +from qiime2.sdk.parallel_config import get_config_from_file + +config, mapping = get_config_from_file('path to config') + +# Or if you have no mapping +config, _ = get_config_from_file('path to config') + +# Or if you only have a mapping and are getting the config from elsewhere +_, mapping = get_config_from_file('path_to_config') +``` + +Once you have your config and/or your mapping, you can use it as follows: + +```python +from qiime2.sdk.parallel_config import ParallelConfig + + +# Note that the mapping can also be a dictionary literal +with ParallelConfig(parallel_config=config, action_executor_mapping=mapping): + future = # .parallel(args) + # Make sure to call _result inside of the context manager + result = future._result() +``` + +[^formal-informal-parallel]: QIIME 2 {term}`Actions ` can provide formal (i.e., Parsl-based) or informal (e.g., multi-threaded execution of a third party program) parallel computing support. + To learn more about the distinction, see [](types-of-parallel-support). \ No newline at end of file diff --git a/book/how-to-guides/pipeline-resumption.md b/book/how-to-guides/pipeline-resumption.md new file mode 100644 index 0000000..a8a1dfa --- /dev/null +++ b/book/how-to-guides/pipeline-resumption.md @@ -0,0 +1,41 @@ +(pipeline-resumption)= +# How to resume failed Pipeline runs + +```{note} +This is more of an advanced user or system administrator usage document. +[This is slated to move](https://github.com/caporaso-lab/developing-with-qiime2/issues/29) to the new general-purpose user documentation. + +``` + +If a {term}`Pipeline` fails at some point during its execution, and you rerun it, QIIME 2 can attempt to reuse the results that were calculated by the `Pipeline` before it failed. + +## Pipeline resumption through the command line interface (CLI) + +By default, when you run a {term}`Pipeline` on the CLI, QIIME 2 will create a pool in its cache (either the default cache, or the cache specified using the `--use-cache` parameter). +This poll will named based on the scheme: `recycle___`. +This pool will store all intermediate results created by the pipeline. + +Should the `Pipeline` run succeed, this pool will be removed. +However, should the `Pipeline` run fail, you can rerun the `Pipeline` and the intermediate results stored in the pool will be reused to avoid doing duplicate work. + +If you wish to specify the specific poll that you would like QIIME 2 should use, either on a `Pipeline`'s first run or on a resumption, you can specify the pool using the `--recycle-pool` option, followed by the name of the pool you wish to use. +This pool will be created in the cache if it does not already exist. +The `--no-recycle` flag may be passed if you do not want QIIME 2 to attempt to recycle any past results or to save the results from this run for future reuse. + +It is not necessarily possible to reuse prior results if your inputs to the `Pipeline` differ on resumption with respect to what was provided on the initial run. +In this situation, QIIME 2 will still try to reuse any results that are not dependent on the inputs that changed, but there is no guarantee anything will be usable. + +## Pipeline resumption through the Python 3 API + +When using the Python API, pools are specified using context managers (i.e., using Python's `with` statement). +If you don't want to enable resumption, don't use the context manager. + +```python +from qiime2.core.cache import Cache + +cache = Cache('cache_path') +pool = cache.create_pool('pool', reuse=True) + +with pool: + # run your pipeline here +``` \ No newline at end of file