This pipeline is used for UMI-ATAC-seq raw data processing, including removing the sequencing adapters, extracting UMIs from the original FASTQ read1 file, removing ME sequences and using the UMIs to remove PCR duplicates.
UMI-ATAC-dedup is mainly tested in Python 3. It requires the Python modules gzip
, Bio.SeqIO.QualityIO module
, fuzzysearch
and pysam
. It also requires the software UMI tools
,trimmomatic
, bbmap
.
To install these packages with conda run:
Run python program with the -h argument for detailed help on command-line parameters.
After removing the sequencing adapters, we use the extract
function in UMI tools
package.This program extracts UMIs from Illumina sequence reads and adds them to the FASTQ read1 and read2 header. We can set --bc-pattern=NNNNNN
(Here we take the first six bases as UMI sequence). We can process the paired-end UMI-ATAC-seq data like this:
$ umi_tools extract --stdin=pair.1.fastq.gz --bc-pattern=NNNNNN --read2-in=pair.2.fastq.gz --stdout=processed.1.fastq.gz --read2-out=processed.2.fastq.gz
This program removes the ME sequence (AGATGTGTATAAGAGACAG) and the sequence before it (both sequence and qualities) in FASTQ read1 file. It reads and writes in FASTQ format. The input and output are gzip file format (.gz
).
After removing the ME sequence in FASTQ read1 file, we need to repair the read1 and read2 file so that they are paired. Here we use the repair.sh
function in bbmap
tool. This program will pair the umi fatsq read2.gz
file and umi fastq read1 rm_me.gz
(genearated by remove_me.py
).
$ repair.sh in1=umi_fastq_read1_rm_me.gz in2=umi_fatsq_read2.gz out1=umi_read1_paired.fq out2=umi_read2_paired.fq
This program removes PCR duplicates with mapping coordinates. You can also use softwares(such as Picard
,samtools
) to do this.
This program removes PCR duplicates with mapping coordinates and UMIs. The reads have the identical mapping coordinates but have differnet UMIs, and we consider they come from different Tn5 insertion events rather than real PCR duplicates.
Zhu, T., Liao, K., Zhou, R. et al. ATAC-seq with unique molecular identifiers improves quantification and footprinting. Commun Biol 3, 675 (2020). DOI: https://doi.org/10.1038/s42003-020-01403-4