Skip to content

Commit 78750b7

Browse files
authored
Merge pull request #1158 from maxplanck-ie/allelic_whatshap_dna
Allelic whatshap dna
2 parents f177eb2 + 3efd1b8 commit 78750b7

File tree

15 files changed

+297
-16
lines changed

15 files changed

+297
-16
lines changed

.github/conda_ci.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
name: conda_ci
22
dependencies:
33
- python>=3.11,<3.13
4+
- python_abi
45
- anaconda-client
56
- conda-build
67
- conda-verify

docs/content/News.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@ snakePipes 3.5.0
55
----------------
66

77
* Allelic-whatshap mode was added to DNAmapping. Requires --phasedVcf.
8+
* --useSpikeinForNorm sption to split bam files by host and spikein genome, and to generate spikein-normalized bamCoverage tracks, was added to DNAmapping.
89
* Allelic-whatshap mode now works with alignment-free mode and fromBAM in the mRNAseq workflow.
910

1011

docs/content/cookbook.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ Process and analyze Cut&Tag data for chromatin binding. Fastq files are mapped t
2424

2525
.. code-block:: bash
2626
27-
DNAmapping --cutntag --trim --trimmerOptions ' -a nexteraF=CTGTCTCTTATA -A nexteraR=CTGTCTCTTATA ' --fastqc --dedup --mapq 3 -i $input_folder -o analysis_dedup customIndices/GRch38_dm6/GRCh38_g31_dm6.yaml
27+
DNAmapping --cutntag --trim --trimmerOptions ' -a nexteraF=CTGTCTCTTATA -A nexteraR=CTGTCTCTTATA ' --fastqc --dedup --useSpikeinForNorm --mapq 3 -i $input_folder -o analysis_dedup customIndices/GRch38_dm6/GRCh38_g31_dm6.yaml
2828
2929
.. code-block:: bash
3030

docs/content/workflows/DNAmapping.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,12 @@ There is a configuration file in ``snakePipes/workflows/DNAmapping/defaults.yaml
7373
fragmentLength: 200
7474
qualimap: false
7575
verbose: false
76+
#phased vcf
77+
pvcf:
78+
#arguments related to hybrid genome alignment
79+
splitHybridGenome: False
80+
spikeinExt: "_spikein"
81+
spikein_bin_size: 1000
7682

7783
Many of these options can be more conveniently set on the command-line. However, you may need to change the ``reads:`` setting if your paired-end files are not denoted by ``sample_R1.fastq.gz`` and ``sample_R2.fastq.gz``, but rather ``sample_1.fastq.gz`` and ``sample_2.fastq.gz``.
7884

@@ -119,6 +125,8 @@ A number of other directories may optionally be present if you specified read tr
119125

120126
A fair number of useful QC plots are or can be generated by the pipeline. These include correlation and PCA plots as well as the output from MultiQC.
121127

128+
Specyfying `--useSpikeinForNorm` will cause the bam files aligned to a hybrid host and spikein genome to be split into correspondingly named parts. Split bam files will be found in "split_bams" folder . Deeptools qc for the split bam files will be performed and the results will be found in the "split_deepTools_qc" folder. Spikein-scaled bam coverage tracks will be output in the bamCoverage folder.
129+
122130
.. image:: ../images/DNAmapping_correlation.png
123131

124132
Command line options

pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ build = [
6161
]
6262

6363
[tool.setuptools.package-data]
64-
snakePipes = ["**/*.yaml", "**/*.R", "**/*.Rmd", "**/*snakefile", "**/*Snakefile", "**/*sh", "**/*txt"]
64+
snakePipes = ["**/*.yaml", "**/*.R","**/*.py", "**/*.Rmd", "**/*snakefile", "**/*Snakefile", "**/*sh", "**/*txt"]
6565

6666
[project.scripts]
6767
ATACseq = "snakePipes.workflows.ATACseq.ATACseq:main"

snakePipes/shared/rules/deepTools_qc_allelic.snakefile

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,14 +46,15 @@ rule multiBamSummary_allelic:
4646
bams = expand("allelic_bams/{sample}.{suffix}.sorted.bam", sample=samples, suffix = ['genome1', 'genome2']),
4747
bais = expand("allelic_bams/{sample}.{suffix}.sorted.bam.bai", sample=samples, suffix = ['genome1', 'genome2'])
4848
output:
49-
npz = "deepTools_qc/multiBamSummary/read_coverage_allelic.bins.npz"
49+
npz = "deepTools_qc/multiBamSummary/read_coverage_allelic.bins.npz",
50+
sf = "deepTools_qc/multiBamSummary/allelic.scaling_factors.txt"
5051
params:
5152
labels = " ".join(expand('{sample}.{suffix}', sample=samples, suffix = ['genome1', 'genome2'])),
5253
blacklist = "--blackListFileName "+blacklist_bed if blacklist_bed
5354
else "",
5455
read_extension = "--extendReads" if pairedEnd
5556
else "--extendReads " + str(fragmentLength),
56-
scaling_factors = "",
57+
scaling_factors = "--scalingFactors deepTools_qc/multiBamSummary/allelic.scaling_factors.txt",
5758
binSize = "",
5859
spikein_region = ""
5960
benchmark:
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
from pathlib import Path
2+
tools_dir = Path(maindir) / "shared" / "tools"
3+
4+
def get_scaling_factor(sample,input):
5+
sample_names=[]
6+
scale_factors=[]
7+
if os.path.isfile(os.path.join(outdir,input)):
8+
with open(os.path.join(outdir,input)) as f:
9+
for idx, line in enumerate(f):
10+
if idx > 0:
11+
sample_names.append(line.split('\t')[0])
12+
scale_factors.append((line.split('\t')[1]).rstrip("\n"))
13+
sf_dict = dict(zip(sample_names, scale_factors))
14+
scale_factor = sf_dict[sample]
15+
16+
return float(scale_factor)
17+
else:
18+
return float(1)
19+
20+
21+
rule bamCoverage_spikein:
22+
input:
23+
bam = "allelic_bams/{sample}.{suffix}.sorted.bam" ,
24+
bai = "allelic_bams/{sample}.{suffix}.sorted.bam.bai",
25+
scale_factors = "split_deepTools_qc/multiBamSummary/spikein.scaling_factors.txt"
26+
output:
27+
"bamCoverage/allele_specific/{sample}.{suffix}.host_scaled.BYspikein.bw"
28+
params:
29+
bwBinSize = bwBinSize,
30+
genome_size = int(genome_size),
31+
ignoreForNorm = "--ignoreForNormalization {}".format(ignoreForNormalization) if ignoreForNormalization else "",
32+
read_extension = "--extendReads" if pairedEnd
33+
else "--extendReads {}".format(fragmentLength),
34+
blacklist = "--blackListFileName {}".format(blacklist_bed) if blacklist_bed
35+
else "",
36+
scaling_factors = lambda wildcards,input: "--scaleFactor {}".format(get_scaling_factor(wildcards.sample,input.scale_factors)) ## subset for the one factor needed
37+
benchmark:
38+
"bamCoverage/allele_specific/.benchmark/bamCoverage.{sample}.{suffix}_BYspikein.benchmark"
39+
threads: lambda wildcards: 16 if 16<max_thread else max_thread # 4GB per core
40+
conda: CONDA_SHARED_ENV
41+
shell: bamcov_spikein_cmd
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
part=['host','spikein']
2+
blacklist_dict={"host": blacklist_bed,"spikein": spikein_blacklist_bed }
3+
region_dict={"host": " ".join(host_chr.keys()),"spikein": " ".join(spikein_chr.keys())}
4+
5+
6+
def get_scaling_factor(sample,input):
7+
sample_names=[]
8+
scale_factors=[]
9+
if os.path.isfile(os.path.join(outdir,input)):
10+
with open(os.path.join(outdir,input)) as f:
11+
for idx, line in enumerate(f):
12+
if idx > 0:
13+
sample_names.append(line.split('\t')[0])
14+
scale_factors.append((line.split('\t')[1]).rstrip("\n"))
15+
sf_dict = dict(zip(sample_names, scale_factors))
16+
scale_factor = sf_dict[sample]
17+
18+
return float(scale_factor)
19+
else:
20+
return float(1)
21+
22+
23+
rule split_bamfiles_by_genome:
24+
input:
25+
bam = "filtered_bam/{sample}.filtered.bam",
26+
bai = "filtered_bam/{sample}.filtered.bam.bai"
27+
output:
28+
bam = "split_bam/{sample}_{part}.bam",
29+
bai = "split_bam/{sample}_{part}.bam.bai"
30+
params:
31+
region = lambda wildcards: region_dict[wildcards.part]
32+
conda: CONDA_SAMBAMBA_ENV
33+
threads: 4
34+
shell: """
35+
sambamba slice -o {output.bam} {input.bam} {params.region};
36+
sambamba index -t {threads} {output.bam}
37+
"""
38+
39+
rule multiBamSummary_by_part:
40+
input:
41+
bams = lambda wildcards: expand("split_bam/{sample}_{part}.bam", sample=samples,part=wildcards.part),
42+
bais = lambda wildcards: expand("split_bam/{sample}_{part}.bam.bai", sample=samples,part=wildcards.part)
43+
output:
44+
npz = "split_deepTools_qc/multiBamSummary/{part}_read_coverage.bins.npz",
45+
scale_factors = "split_deepTools_qc/multiBamSummary/{part}.scaling_factors.txt"
46+
params:
47+
labels = " ".join(samples),
48+
blacklist = lambda wildcards: "--blackListFileName {}".format(blacklist_dict[wildcards.part]) if blacklist_dict[wildcards.part] else "",
49+
read_extension = "--extendReads" if pairedEnd
50+
else "--extendReads {}".format(fragmentLength),
51+
scaling_factors = "--scalingFactors split_deepTools_qc/multiBamSummary/{part}.scaling_factors.txt",
52+
binSize = lambda wildcards: " --binSize "+str(spikein_bin_size) if wildcards.part=="spikein" else "",
53+
spikein_region = lambda wildcards: " --region "+spikein_region if ((wildcards.part=="spikein") and (spikein_region != "")) else ""
54+
benchmark:
55+
"split_deepTools_qc/.benchmark/{part}_multiBamSummary.benchmark"
56+
threads: lambda wildcards: 24 if 24<max_thread else max_thread
57+
conda: CONDA_SHARED_ENV
58+
shell: multiBamSummary_cmd
59+
60+
61+
rule bamCoverage_by_part:
62+
input:
63+
bam = "split_bam/{sample}_host.bam" ,
64+
bai = "split_bam/{sample}_host.bam.bai",
65+
scale_factors = "split_deepTools_qc/multiBamSummary/{part}.scaling_factors.txt"
66+
output:
67+
"bamCoverage/{sample}.host_scaled.BY{part}.bw"
68+
params:
69+
bwBinSize = bwBinSize,
70+
genome_size = int(genome_size),
71+
ignoreForNorm = "--ignoreForNormalization {}".format(ignoreForNormalization) if ignoreForNormalization else "",
72+
read_extension = "--extendReads" if pairedEnd
73+
else "--extendReads {}".format(fragmentLength),
74+
blacklist = "--blackListFileName {}".format(blacklist_bed) if blacklist_bed
75+
else "",
76+
scaling_factors = lambda wildcards,input: "--scaleFactor {}".format(get_scaling_factor(wildcards.sample,input.scale_factors)) ## subset for the one factor needed
77+
benchmark:
78+
"bamCoverage/.benchmark/bamCoverage.{sample}.BY{part}.filtered.benchmark"
79+
threads: lambda wildcards: 16 if 16<max_thread else max_thread # 4GB per core
80+
conda: CONDA_SHARED_ENV
81+
shell: bamcov_spikein_cmd
82+
83+
84+
rule bamPE_fragment_size_by_part:
85+
input:
86+
bams = lambda wildcards: expand("split_bam/{sample}_{part}.bam", sample=samples,part=wildcards.part),
87+
bais = lambda wildcards: expand("split_bam/{sample}_{part}.bam.bai", sample=samples,part=wildcards.part)
88+
output:
89+
"split_deepTools_qc/bamPEFragmentSize/{part}.fragmentSize.metric.tsv"
90+
params:
91+
plotcmd = lambda wildcards: "" if plotFormat == 'None' else
92+
"-o split_deepTools_qc/bamPEFragmentSize/" + wildcards.part + ".fragmentSizes.{}".format(plotFormat)
93+
threads: lambda wildcards: 24 if 24<max_thread else max_thread
94+
conda: CONDA_SHARED_ENV
95+
shell: bamPEFragmentSize_cmd

snakePipes/shared/rules/whatshap.snakefile

Lines changed: 8 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,15 @@
1+
def collect_bam():
2+
bam = "filtered_bam/{sample}.filtered.bam"
3+
if pipeline == "dnamapping" and useSpikeInForNorm:
4+
bam = "split_bam/{sample}_host.bam"
5+
return(bam)
6+
17
rule whatshap_haplotag:
28
input:
39
ref = genome_fasta,
410
pvcf = pvcf,
5-
bam = "filtered_bam/{sample}.filtered.bam",
6-
bai = "filtered_bam/{sample}.filtered.bam.bai"
11+
bam = collect_bam(),
12+
bai = lambda wildcards: f"{collect_bam()}.bai"
713
output:
814
hbam = "allelic_bams/{sample}.allele_flagged.sorted.bam",
915
hlist = "allelic_bams/{sample}_haplotype_list.tsv"

snakePipes/workflows/ChIPseq/internals.snakefile

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ from ruamel.yaml import YAML
66
import sys
77
import pandas as pd
88
import warnings
9+
from itertools import chain
910

1011
### Functions ##################################################################
1112

@@ -230,18 +231,27 @@ if sampleSheet:
230231
filtered_dict = filter_dict(sampleSheet,dict(zip(chip_samples_w_ctrl, [ get_control_name(x) for x in chip_samples_w_ctrl ])))
231232
else:
232233
filtered_dict = filter_dict(sampleSheet,dict(zip(chip_samples_wo_ctrl, [None]*len(chip_samples_wo_ctrl))))
234+
print(filtered_dict)
233235
genrichDict = cf.sampleSheetGroups(sampleSheet,isMultipleComparison)
234236
if not isMultipleComparison:
235237
for k in genrichDict.keys():
236238
genrichDict[k]=[item for item in genrichDict[k] if item in chip_samples]
237239
reordered_dict = {k: filtered_dict[k] for k in [item for sublist in genrichDict.values() for item in sublist]}
238240
else:
239-
#print(genrichDict)
241+
print(genrichDict)
240242
reordered_dict = {}
241-
for g in genrichDict.keys():
242-
for k in genrichDict[g].keys():
243-
genrichDict[g][k]=[item for item in genrichDict[g][k] if item in chip_samples]
244-
reordered_dict[g] = {k: filtered_dict[k] for k in [item for sublist in genrichDict[g].values() for item in sublist]}
243+
#for g in genrichDict.keys():
244+
# for k in genrichDict[g].keys():
245+
# genrichDict[g][k]=[item for item in genrichDict[g][k] if item in chip_samples]
246+
# reordered_dict[g] = {k: filtered_dict[k] for k in [item for sublist in genrichDict[g].values() for item in sublist]}
247+
for g in genrichDict:
248+
# filter each condition list to only chip samples
249+
for k in genrichDict[g]:
250+
genrichDict[g][k] = [item for item in genrichDict[g][k] if item in chip_samples]
251+
252+
# flatten the lists and build mapping, skipping missing keys just in case
253+
flattened = chain.from_iterable(genrichDict[g].values())
254+
reordered_dict[g] = {fk: filtered_dict[fk] for fk in flattened if fk in filtered_dict}
245255
else:
246256
genrichDict = {"all_samples": chip_samples}
247257

0 commit comments

Comments
 (0)