Module Usage in Projects

As a concrete example, we will apply the unsupervised_analysis module to MyData stored on data/MyData.

Data

We provide a minimal example of the analysis of the UCI ML hand-written digits datasets imported from sklearn:

config
- configuration: config/MyData/MyData_unsupervised_analysis_config.yaml
- annotation: config/MyData/MyData_unsupervised_analysis_annotation.csv
data
- dataset (1797 observations, 64 features): data/MyData/digits_data.csv
- metadata (consisting only of the ground truth label "target"): data/MyData/digits_labels.csv
results will be generated in the configured results folder results/MyData/
performance: on an HPC it took less than 7 minutes to complete a full run (with up to 32GB of memory per task).

Code

First, we provide the configuration file for the application of the unsupervised_analysis module to MyData using this specific and predefined structure within your project's config/config.yaml.

#### Datasets and Workflows to include ###
workflows:
    MyData:
        unsupervised_analysis: "config/MyData/MyData_unsupervised_analysis_config.yaml"

Tip

Recommended folder and naming scheme for config files: config/{dataset_name}/{dataset_name}_{module}_config.yaml.

Second, within the main Snakefile (workflow/Snakefile) we have to do three things

load and parse all configurations into a structured dictionary.

# load configs for all workflows and datasets
config_wf = dict()

for ds in config["workflows"]:
    for wf in config["workflows"][ds]:
        with open(config["workflows"][ds][wf], 'r') as stream:
            try:
                config_wf[ds+'_'+wf]=yaml.safe_load(stream)
            except yaml.YAMLError as exc:
                print(exc)

include the MyData analysis snakfile from the rule subfolder (see last step).

##### load rules (one per dataset) #####
include: os.path.join("rules", "MyData.smk")

require all outputs from the used module as inputs to the target rule all.

#### Target Rule ####
rule all:
    input:
        #### MyData Analysis
        rules.MyData_unsupervised_analysis_all.input,
        ...

Finally, within the dedicated snakefile for the analysis of MyData, workflow/rules/MyData.smk we load the specified version of the unsupervised_analysis module directly from GitHub, provide it with the previously loaded configuration and use a prefix for all (*) loaded rules.

# MyData Analysis

### MyData - Unsupervised Analysis ####
module MyData_unsupervised_analysis:
    snakefile:
        github("epigen/unsupervised_analysis", path="workflow/Snakefile", tag="v2.0.0")
    config:
        config_wf["MyData_unsupervised_analysis"]

use rule * from MyData_unsupervised_analysis as MyData_unsupervised_analysis_*

Tip

Recommended file name for the analysis-specific snakefile: workflow/rules/{dataset_name}.smk.

Recommended prefix for the loaded rules: {dataset_name}_{module}_.

Results

====================== COMING SOON ======================

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Module Usage in Projects

Data

Code

Results

Modules

Module Usage in Projects

Recipes

Tips

CeMM Users

Clone this wiki locally