Skip to content

mskcc/ACCESS-Pipeline

Repository files navigation

Build Status

Getting Started

Disclaimer: Running the pipeline depends on installation of certain dependencies. Moving to docker containers is the long term solution for this. For now these tools must be previously installed:

Tool Version
GCC 4.4.7
glibc 2.12
Java 7 jdk1.7.0_75
Java 8 jdk1.8.0_31
Python (must exist in PATH) 2.7.10
R (must exist in PATH) 3.5.0
Perl (must exist in PATH) 5.20.2
Node (must exist in PATH) v6.10.1
Trimgalore v0.2.5 (also needs to have paths to fastqc and cutadapt updated manually)
BWA 0.7.15-r1140
bedtools (must exist in PATH) v2.26.0
Cutadapt 1.1
Fastqc v0.10.1
Marianas 1.5
Waltz 2.0
Picard picard-2.8.1.jar
Picard AddOrReplaceReadGroups AddOrReplaceReadGroups-1.96.jar
Picard FixMateInformation FixMateInformation.jar (1.96)
GATK 3.3.0
Abra 2.17
  • HG19 Reference fasta + fai
  • dbSNP & Millis_100G vcf + .vcf.idx files
  • Conda (miniconda is recommended)

Provenance

These CWL modules and python script originated from the Roslin pipeline at MSKCC.

Installation

Note: In these instructions, please replace 1.3.17 with the latest stable version of the pipeline (look for the latest green release on the Releases page).

1. Copy the latest release of the pipeline

(Make sure your virtualenv is active)

$ git clone https://github.com/mskcc/ACCESS-Pipeline.git --branch 1.3.17

2. Run the installation

This will create a new Conda environment, and install the pipeline and its dependencies

$ ./setup.sh

3. Update your environment variables:

Use the following script to get LUNA-specific environment variables for Toil and ACCESS dependencies

(ACCESS) $ source ~/ACCESS-Pipeline/python_tools/pipeline_kickoff/workspace_init.sh

4. Update the run variables

Contact [email protected] or [email protected] for the latest ACCESS-specific interval lists, and get access to all of the required resources.

Then update the paths to these variables inside of the /resources folder.

5. Install Python libraries

Unfortunately, we are using a combination of Conda and Pip to get all the pipeline requirements, so you must enter the conda environment and install these libraries using pip

$ source activate ACCESS

(ACCESS) $ pip install .

Running the test pipeline

NOTE: These steps should be run from a new directory, but still while inside your ACCESS Conda environment, and after sourcing the workspace_init.sh script.

1. Create a run title file from a sample manifest

(example manifests exist in /test/test_data/...)

(ACCESS) $ create_title_file_from_manifest \
  -i ~/ACCESS-Pipeline/test/test_data/umi-T_N-PanCancer/test_manifest.xlsx \
  -o test_title_file.txt

2. Create an inputs file from the title file

This step will create a file inputs.yaml, and pull in the run parameters (-t for test, -c for collapsing) and paths to run files from step 5.

(ACCESS) $ create_inputs_from_title_file \
  -i test_title_file.txt \
  -d ~/ACCESS-Pipeline/test/test_data/umi-T_N-PanCancer \
  -p TEST_run \
  -o inputs.yaml \
  -t \
  -f

3. Run the test pipeline

To run with the CWL reference implementation (faster for testing purposes):

(ACCESS) $ cwltool \
  --debug                                                     # For debug level logging
  --tmpdir-prefix ~/my_TEST_run \                             # Where to put temp directories
  --cachedir ~/my_TEST_run \                                  # Where to cache intermediate outputs (useful for restart after failure)
  ~/ACCESS-Pipeline/workflows/ACCESS_pipeline.cwl \           # The workflow *required*
  inputs.yaml                                                 # The inputs to the workflow *required*

Or, to run with the Toil batch system runner:

(ACCESS) $ toil-cwl-runner ~/ACCESS-Pipeline/workflows/ACCESS-pipeline.cwl inputs.yaml

Running a real run

NOTE: These steps should be run from a new directory, but still while inside your virtual environment, and after sourcing the workspace_init.sh script.

I usually start pipeline runs from a fresh directory, with ample storage space. This is where the batch system log files will be written. However, these logs are different from the Toil log files, which will be placed alongside the pipeline outputs as specified by the output_location parameter. Both sets of log files can be quite large (up to ~50GB if running in debug mode on a large pool).

Note that there are several valiation requirements when running on your own data (use the example manifests in test/test_data for examples):

  1. The header names that are found in the sample manifest should matched with the examples in test/test_data
  2. The sample ID's in the manifest must be matched somewhere in the paths to the fastqs and sample sheets fom the -d data folder
  3. Each sample in the -d data folder must have these three files:
'_R1_001.fastq.gz'
'_R2_001.fastq.gz'
'SampleSheet.csv'
  1. The i5 and i7 barcode indexes from the manifest/title_file must match what is found in the SampleSheet.csv files (i5 may be reverse-complemented depending on the machine).
  2. The sample_class field must always be either "Tumor" or "Normal"
  3. The sample_type field must always be either "Plasma" or "Buffy Coat"

Certain validation requirements can be skipped by using the -f parameter in the pipeline kickoff step.

Example:

1. Use the inputs generation scripts

These are the same as when used for running a test with cwltool or toil-cwl-runner. Note that if there are multiple lanes in the manifest the first script will create multiple title files on a per-lane basis.

(ACCESS) $ create_title_file_from_manifest \
  -i ~/manifests/ES_manifest.xlsx \
  -o ./ES_title_file.txt
(ACCESS) $ create_inputs_from_title_file \
  -i lane-5_ES_title_file.txt \
  -d /home/johnsoni/Data/JAX_0149_AHT3N3BBXX/Project_05500_ES \
  -p 5500-ES_lane-5 \
  -o inputs_lane_5.yaml

2. Use the pipeline runner/submit scripts

Note that we use pipeline_submit here to submit both the leader job as well as the worker jobs to the cluster.

Right now the only supported options for the --batch-system parameter are lsf and singleMachine.

(ACCESS) $ pipeline_submit \
--output_location /home/johnsoni/projects/EJ_4-27_MarkDuplicatesTest \
--inputs_file ./inputs.yaml \
--workflow ~/ACCESS-Pipeline/workflows/ACCESS_pipeline.cwl \
--batch_system lsf

Or alternatively, use pipeline_runner to make use of the gridEngine, mesos, htcondor or slurm options.

This script can be run in the background with &, and will make use of worker nodes for the jobs themselves.

(ACCESS) $ pipeline_runner \
--output_location /home/projects/EJ_4-27_MarkDuplicatesTest \
--inputs_file ./inputs.yaml \
--workflow ~/ACCESS-Pipeline/workflows/ACCESS_pipeline.cwl \
--batch_system gridEngine

This will create the output directory (or restart a failed run in that output directory for --restart), and start the workflow using SGE.

3. Cleanup the output files

Note: Do not run this step until the pipeline has completed. The way to ensure that the run has finished is to download and review the QC report PDF, which can be found the the QC_Results folder. Once you've confired that the run is completed, and all files have been copied to the final outputs folder, there is a script included to create symlinks to the output bams and delete unnecessary output folders left behind by Toil

(ACCESS) ~$ pipeline_postprocessing -d <path/to/outputs>

4. Test the output files

There is a script included to check that the correct samples are paired in the correct folders, and that expected files are present in the final output directory.

(ACCESS) ~$ python -m python_tools.test.test_pipeline_outputs -o <path_to_outputs> -l debug

Issues

Bug reports and questions are helpful, please report any issues, comments, or concerns to the issues page

Documentation

Additional information can be found in the Wiki, including tips for CWL and Toil, and working with ACCESS log files.