-
Notifications
You must be signed in to change notification settings - Fork 1
Module Usage in Projects
As a concrete example, we will apply the unsupervised_analysis
module to MyData
stored in data/MyData.
We provide a minimal example of the analysis of the UCI ML hand-written digits datasets imported from sklearn:
- Configuration
- configuration:
config/MyData/MyData_unsupervised_analysis_config.yaml
- annotation:
config/MyData/MyData_unsupervised_analysis_annotation.csv
- configuration:
- Data
- dataset (1797 observations, 64 features):
data/MyData/digits_data.csv
- metadata (consisting only of the ground truth label "target"):
data/MyData/digits_labels.csv
- dataset (1797 observations, 64 features):
- Results will be generated in the configured results folder
results/MyData/
- Performance: on an HPC it took less than 7 minutes to complete a full run (with up to 32GB of memory per task).
First, we provide the configuration file for the application of the unsupervised_analysis module
to MyData
using this specific and predefined structure within your project's config/config.yaml.
#### Datasets and Workflows to include ###
workflows:
MyData:
unsupervised_analysis: "config/MyData/MyData_unsupervised_analysis_config.yaml"
Tip
Recommended folder and naming scheme for config files: config/{dataset_name}/{dataset_name}_{module}_config.yaml
.
Second, within the main Snakefile (workflow/Snakefile
) we have to do three things
- load and parse all configurations into a structured dictionary.
# load configs for all workflows and datasets config_wf = dict() for ds in config["workflows"]: for wf in config["workflows"][ds]: with open(config["workflows"][ds][wf], 'r') as stream: try: config_wf[ds+'_'+wf]=yaml.safe_load(stream) except yaml.YAMLError as exc: print(exc)
- include the
workflow/rules/MyData.smk
analysis snakefile from the rule subfolder (see last step).##### load rules (one per dataset) ##### include: os.path.join("rules", "MyData.smk")
- require all outputs from the used module as inputs to the target rule
all
.#### Target Rule #### rule all: input: #### MyData Analysis rules.MyData_unsupervised_analysis_all.input, ...
Finally, within the dedicated snakefile for the analysis of MyData
(workflow/rules/MyData.smk
) we load the specified version of the unsupervised_analysis
module directly from GitHub, provide it with the previously loaded configuration and use a prefix for all (*
) loaded rules.
# MyData Analysis
### MyData - Unsupervised Analysis ####
module MyData_unsupervised_analysis:
snakefile:
github("epigen/unsupervised_analysis", path="workflow/Snakefile", tag="v2.0.0")
config:
config_wf["MyData_unsupervised_analysis"]
use rule * from MyData_unsupervised_analysis as MyData_unsupervised_analysis_*
Tip
Recommended nomenclature:
- Filename for the analysis/dataset-specific rule file:
./workflow/rules/{dataset_name}.smk
. - Module name:
{dataset_name}_{module}
- Prefix for the loaded rules:
{dataset_name}_{module}_
.
====================== COMING SOON ======================