-
Notifications
You must be signed in to change notification settings - Fork 5
Bioinformatic Components
For the script somatic.nf
, the pipeline performs the following:
delly call -> delly filter
snppileup -> facets
manta -> strelka
mutect2
NOTE: manta -> strelka
and mutect2
were lifted and modified from Sarek's implementation at https://github.com/mskcc/Sarek
Input File columns:
"idTumor idNormal bamTumor bamNormal baiTumor baiNormal"
Outputs:
They are found in ${params.outDir}/VariantCalling/<tool_name>
Variables used in pipeline:
genomeFile
: reference fasta
idTumor
: tumor sample name
idNormal
: normal sample name
bamTumor
: tumor bam
bamNormal
: normal bam
Execution on lsf:
nextflow run somatic.nf --sample test_inputs/lsf/test_somatic.tsv -profile juno --outDir $PWD
Execution on aws:
nextflow run somatic.nf --sample test_inputs/aws/test_somatic.tsv -profile awsbatch
You can also specifiy which specific tool(s) you want to run with the --tools <toolname(s), comma-delimited>
flag.
If --tools
is not specified, somatic.nf
runs all tools, currently delly
,facets
,manta
,strelka2
, mutect2
, and msisensor
.
Tool name delly
will run processes dellyCall
,makeSamplesFile
, and dellyFilter
.
Tool name manta
runs ONLY process runManta
; manta
and strelka2
needed to run process runStrelka
.
Tool name facets
runs process doSNPPileup
and doFacets
.
Tool name mutect2
runs process runMutect2
.
Tool name msisensor
runs process runMsiSensor
.
Example run of only delly
, manta,
strelka2` on lsf:
nextflow run somatic.nf --sample test_inputs/lsf/test_somatic.tsv -profile juno --outDir $PWD --tools delly, manta, strelka2
Input File columns:
"sequenceType idTumor idNormal bamTumor bamNormal baiTumor baiNormal"
Outputs:
They are found in ${params.outDir}/VariantCalling/${idTumor}_${idNormal}/<tool_name>
Variables used in pipeline:
genomeFile
: reference fasta
sequenceType
: Either exome
or genome
; currently un-used
idTumor
: tumor sample name
idNormal
: normal sample name
bamTumor
: tumor bam
bamNormal
: normal bam
https://github.com/dellytools/delly
For now we assume that we want all five variants of SV performed for delly call
; sv_variant
is defined before the process is initiated in a list named sv_variants
.
We use nextflow
to iterate through that list, submitting the processes in parallel.
outfile="${idTumor}_${idNormal}_\${sv_variant}.bcf"
delly call \
-t "\${sv_variant}" \
-o "\${outfile}" \
-g ${genomeFile} \
${bamTumor} \
${bamNormal}
The next step, delly filter
, requires a sample input file that contains a tumor ID and a normal ID that's considered "control"; this is done in the process makeSampleFile
:
echo "${idTumor}\ttumor\n${idNormal}\tcontrol" > samples.tsv
These sv_variant
processes are then fed to another process dellyFilter
.
delly_call_file="${idTumor}_${idNormal}_\${sv_variant}.bcf"
outfile="${idTumor}_${idNormal}_\${sv_variant}.filter.bcf"
delly filter \
-f somatic \
-o "\${outfile}" \
-s "samples.tsv" \
"\${delly_call_file}"
Running mutect2
consists of three processes: CreateIntervalBeds
, runMutect2
, indexVCF
, runMutect2Filter
, and combineMutect2VCF
.
The output of combineMutect2VCF
is later then sent to process combineChannel
to be merged with strelka2
output into one MAF file.
Before running mutect2
, we use SplitIntervals
(in process CreateIntervalBeds
) against interval file human.b37.genome.bed
. Currently we only split into a scatter-count
of 10; we can adjust in the future.
gatk SplitIntervals \
-R ${genomeFile} \
-L ${intervals} \
--scatter-count 10 \
--subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION_WITH_OVERFLOW \
-O interval_beds
Currently our implementation of process runMutect2
has hard-coded -Xmx
parameters; should be adjusted later.
NOTE: A somatic.nf run of mutect2
will create a certain number of processes for each of the subsequent steps based on how many splits are performed above (currently set at 10), up until combineMutect2VCF
.
gatk --java-options "-Xmx8g" \
Mutect2 \
-R ${genomeFile}\
-I ${bamTumor} -tumor ${idTumor} \
-I ${bamNormal} -normal ${idNormal} \
-O "${idTumor}_vs_${idNormal}_somatic.vcf"
The outputs from runMutect2
- mutect2Vcf
is indexed with tabix
(from process indexVCF
):
tabix -p vcf ${mutect2Vcf}
This creates a .tbi
file stored in mutect2VcfIndex
. They are then fed into FilterMutectCalls
(from process runMutect2Filter
):
# Xmx hard-coded for now due to lsf bug
gatk --java-options "-Xmx8g" \
FilterMutectCalls \
--variant "${mutect2Vcf}" \
--output "${outfile}"
All generated, filtered mutect2
vcf
files are then submitted to process combineMutect2VCF
. This uses bcftools concat | bcftools sort
to create the final mutect2
output VCF.
outfile="${idTumor}_${idNormal}.mutect2.filtered.combined.vcf.gz"
bcftools concat ${mutect2Vcfs} | bcftools sort --output-type z --output-file ${outfile}
configManta.py \
--normalBam ${bamNormal} \
--tumorBam ${bamTumor} \
--reference ${genomeFile} \
--runDir Manta
python Manta/runWorkflow.py -m local -j ${task.cpus}
mv Manta/results/variants/candidateSmallIndels.vcf.gz \
Manta_${idTumor}_vs_${idNormal}.candidateSmallIndels.vcf.gz
mv Manta/results/variants/candidateSmallIndels.vcf.gz.tbi \
Manta_${idTumor}_vs_${idNormal}.candidateSmallIndels.vcf.gz.tbi
mv Manta/results/variants/candidateSV.vcf.gz \
Manta_${idTumor}_vs_${idNormal}.candidateSV.vcf.gz
mv Manta/results/variants/candidateSV.vcf.gz.tbi \
Manta_${idTumor}_vs_${idNormal}.candidateSV.vcf.gz.tbi
mv Manta/results/variants/diploidSV.vcf.gz \
Manta_${idTumor}_vs_${idNormal}.diploidSV.vcf.gz
mv Manta/results/variants/diploidSV.vcf.gz.tbi \
Manta_${idTumor}_vs_${idNormal}.diploidSV.vcf.gz.tbi
mv Manta/results/variants/somaticSV.vcf.gz \
Manta_${idTumor}_vs_${idNormal}.somaticSV.vcf.gz
mv Manta/results/variants/somaticSV.vcf.gz.tbi \
Manta_${idTumor}_vs_${idNormal}.somaticSV.vcf.gz.tbi
mantaCSI
is Manta_${idTumor}_vs_${idNormal}.candidateSmallIndels.vcf.gz
, made in manta
step and passed to strelka2
configureStrelkaSomaticWorkflow.py \
--tumor ${bamTumor} \
--normal ${bamNormal} \
--referenceFasta ${genomeFile} \
--indelCandidates ${mantaCSI} \
--runDir Strelka
python Strelka/runWorkflow.py -m local -j ${task.cpus}
mv Strelka/results/variants/somatic.indels.vcf.gz \
Strelka_${idTumor}_vs_${idNormal}_somatic_indels.vcf.gz
mv Strelka/results/variants/somatic.indels.vcf.gz.tbi \
Strelka_${idTumor}_vs_${idNormal}_somatic_indels.vcf.gz.tbi
mv Strelka/results/variants/somatic.snvs.vcf.gz \
Strelka_${idTumor}_vs_${idNormal}_somatic_snvs.vcf.gz
mv Strelka/results/variants/somatic.snvs.vcf.gz.tbi \
Strelka_${idTumor}_vs_${idNormal}_somatic_snvs.vcf.gz.tbi
The combineChannel
process just takes the VCF files from the combineMutect2VCF
and runStrelka
processes into one Channel, vcfOutputSet
. This allows parallel job submission for each VCF.
Process runBCFToolsFilterNorm
performs three commands to the ${vcf}
file: tabix
to index it; bcftools filter
to filter the file and output a zipped vcf
; and bcftools norm
to perform a normalization step.
tabix -p vcf ${vcf}
bcftools filter \
-r 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,MT,X,Y \
--output-type z \
"${vcf}" | \
bcftools norm \
--fasta-ref ${genomeFile} \
--output-type z \
--output "${outfile}"
The process runBCFToolsMerge
loads vcfFilterNormOutput
as input.
Each ${vcf}
in vcfFilterNormOutput
is first indexed with tabix
and then fed into bcftools merge
. The output is an uncompressed VCF file so that it can be processed by vcf2maf
.
Output is sent to channel vcfMergedOutput
.
for f in *.vcf.gz
do
tabix -p vcf \$f
done
bcftools merge \
--force-samples \
--merge none \
--output-type v \
--output "${idTumor}_${idNormal}.mutect2.strelka2.filtered.norm.merge.vcf" \
*.vcf.gz
Process runVCF2MAF
takes the VCF file in channel vcfMergedOutput
and runs vcf2maf
to make one MAF file.
The hard-coded paths in the below script are relative to the container used by the process.
perl /opt/vcf2maf.pl \
--input-vcf ${vcfMerged} \
--tumor-id ${idTumor} \
--normal-id ${idNormal} \
--vep-path /opt/vep/src/ensembl-vep \
--vep-data ${vepCache} \
--filter-vcf ${vcf2mafFilterVcf} \
--output-maf ${outfile} \
--ref-fasta ${genomeFile}
doFacets.R
requires a counts file, so snp-pileup
is ran first in process doSNPPileup
.
output_filename = idTumor + "_" + idNormal + ".snppileup.dat.gz"
snp-pileup -A -P 50 --gzip "${facetsVcf}" "${output_filename}" "${bamTumor}" "${bamNormal}"
Process doFacets
needs parameters generalized, but for now it is shown below:
snp_pileup_prefix = idTumor + "_" + idNormal
counts_file = "${snp_pileup_prefix}.snppileup.dat.gz"
genome_value = "hg19"
TAG = "${snp_pileup_prefix}"
directory = "."
/usr/bin/facets-suite/doFacets.R \
--cval 100 \
--snp_nbhd 250 \
--ndepth 35 \
--min_nhet 25 \
--purity_cval 500 \
--purity_snp_nbhd 250 \
--purity_ndepth 35 \
--purity_min_nhet 25 \
--genome "${genome_value}" \
--counts_file "${counts_file}" \
--TAG "${TAG}" \
--directory "${directory}" \
--R_lib latest \
--single_chrom F \
--ggplot2 T \
--seed 1000 \
--tumor_id "${idTumor}"
https://github.com/ding-lab/msisensor
msiSensorList is defined in references.config
output_prefix = "${idTumor}_${idNormal}"
msisensor msi -d "${msiSensorList}" -t "${bamTumor}" -n "${bamNormal}" -o "${output_prefix}"