Skip to content

Module Usage in Projects

Stephan Reichl edited this page Sep 26, 2024 · 9 revisions

As a concrete example, we will apply the unsupervised_analysis module to MyData stored in data/MyData.

Data

We provide a minimal example of the analysis of the UCI ML hand-written digits datasets imported from sklearn:

  • Configuration
    • configuration: config/MyData/MyData_unsupervised_analysis_config.yaml
    • annotation: config/MyData/MyData_unsupervised_analysis_annotation.csv
  • Data
    • dataset (1797 observations, 64 features): data/MyData/digits_data.csv
    • metadata (consisting only of the ground truth label "target"): data/MyData/digits_labels.csv
  • Results will be generated in the configured results folder results/MyData/
  • Performance: on an HPC it took less than 7 minutes to complete a full run (with up to 32GB of memory per task).

Code & Configuration

First, we provide the configuration file for the application of the unsupervised_analysis module to MyData using this specific and predefined structure within your project's config/config.yaml.

#### Datasets and Workflows to include ###
workflows:
    MyData:
        unsupervised_analysis: "config/MyData/MyData_unsupervised_analysis_config.yaml"

Tip

Recommended folder and naming scheme for config files: config/{dataset_name}/{dataset_name}_{module}_config.yaml.

Second, within the main Snakefile (workflow/Snakefile) we have to do three things

  • load and parse all configurations into a structured dictionary.
    # load configs for all workflows and datasets
    config_wf = dict()
    
    for ds in config["workflows"]:
        for wf in config["workflows"][ds]:
            with open(config["workflows"][ds][wf], 'r') as stream:
                try:
                    config_wf[ds+'_'+wf]=yaml.safe_load(stream)
                except yaml.YAMLError as exc:
                    print(exc)
  • include the workflow/rules/MyData.smk analysis snakefile from the rule subfolder (see last step).
    ##### load rules (one per dataset) #####
    include: os.path.join("rules", "MyData.smk")
  • require all outputs from the used module as inputs to the target rule all.
    #### Target Rule ####
    rule all:
        input:
            #### MyData Analysis
            rules.MyData_unsupervised_analysis_all.input,
            ...

Finally, within the dedicated snakefile for the analysis of MyData (workflow/rules/MyData.smk) we load the specified version of the unsupervised_analysis module directly from GitHub, provide it with the previously loaded configuration and use a prefix for all (*) loaded rules.

# MyData Analysis

### MyData - Unsupervised Analysis ####
module MyData_unsupervised_analysis:
    snakefile:
        github("epigen/unsupervised_analysis", path="workflow/Snakefile", tag="v2.0.0")
    config:
        config_wf["MyData_unsupervised_analysis"]

use rule * from MyData_unsupervised_analysis as MyData_unsupervised_analysis_*

Tip

Recommended naming scheme:

  • Datasets/projects always in camelCase (no _ recommended) e.g. ATACstim.
  • Filename for the analysis/dataset-specific rule file: ./workflow/rules/{dataset_name}.smk.
  • Module name: {dataset_name}_{module}
  • Prefix for the loaded rules: {dataset_name}_{module}_.

Results

====================== COMING SOON ======================