Transcription Factor and Histone ChIP-seq processing pipeline using BDS (BigDataScript).
- Java (have been installed)
- Conda
- bds (BigDataScript)
- samtools
$ cd $HOME
$ wget https://repo.anaconda.com/archive/Anaconda3-2019.07-Linux-x86_64.sh
$ bash Anaconda3-2019.07-Linux-x86_64.sh
Answer yes
for the final question. If you choose no
, you need to manually add Anaconda3 to you $HOME/.bashrc
.
Do you wish the installer to prepend the Anaconda3 install location
to PATH in your /your/home/.bashrc ? [yes|no]
[no] >>> yes
Run souce ~/.bashrc
after installation.
$ cd $HOME
$ wget https://github.com/leepc12/BigDataScript/blob/master/distro/bds_Linux.tgz?raw=true -O bds_Linux.tgz
$ tar zxvf bds_Linux.tgz
Add export PATH=$PATH:$HOME/.bds
to your $HOME/.bashrc
. If Java memory occurs, add export _JAVA_OPTIONS="-Xms256M -Xmx728M -XX:ParallelGCThreads=1"
too.
Get ChIP-Seq pipeline.
$ cd $PATH/ChIP-seq/package
$ tar zxvf TF_chipseq_pipeline.tar.gz
$ mv TF_chipseq_pipeline ../ && cd ../
Install software dependencies automatically. It will create two conda environments (aquas_chipseq and aquas_chipseq_py3) under your conda.
$ cd $PATH/ChIP-seq/TF_chipseq_pipeline
$ bash install_dependencies.sh
If install_dependencies.sh
fails, run ./uninstall_dependencies.sh
, fix problems and then try bash install_dependencies.sh
again.
Install genome data for a specific genome [GENOME]
. Currently hg19
, mm9
, hg38
and mm10
are available. Specify a directory [DATA_DIR]
to download genome data. A species file generated on [DATA_DIR]
will be automatically added to your ./default.env
so that the pipeline knows that you have installed genome data using install_genome_data.sh
. If you want to install multiple genomes make sure that you use the same directory [DATA_DIR]
for them. Each genome data will be installed on [DATA_DIR]/[GENOME]
. If you use other BDS pipelines, it is recommended to use the same directory [DATA_DIR]
to save disk space.
NOTE: install_genome_data.sh
can take longer than an hour for downloading data and building index. DO NOT run the script on a login node, use qlogin
for SGE and srun --pty bash
for SLURM.
$ cd $PATH/ChIP-seq/TF_chipseq_pipeline
$ bash install_genome_data.sh [GENOME] [DATA_DIR]
There are two ways to define parameters for ChIP-Seq pipelines. 1. Parameters from command line arguments; 2.Parameters from a configuration file. We recommend using configure file to run pipeline.
-
Whenever you want to process fastq file, the first one to do is check the quality of fastq file.
fastqc c1.clean_1.fastq.gz -o ./
-
If quality is good then you can run the ChIP-seq pipeline, otherwise you need processed fastq data to clean data. (For example : adapter error, bar code error) (Note : cutadapter or trim_galore can be used to covert raw data to clean data).
$ python $PATH/ChIP-seq/TF_chipseq_pipeline/chipseq.py [CONF_JSON_FILE]
Parameters from a configuration file:
Choose [CHIPSEQ_TYPE]
between TF
and histone
.
## for pair-end fastq (no replicate)
{
"type" : "[CHIPSEQ_TYPE]",
"species" : [GENOME],
"nth" : [NUM_THREADS],
"pe" : true,
"fastq1_1" : "...",
"fastq1_2" : "...",
"ctl_fastq1_1" : "...",
"ctl_fastq1_2" : "...",
"out_dir" : "[path of output result directory]"
}
## for signal-end fastq (no replicate)
{
"type" : "[CHIPSEQ_TYPE]",
"species" : [GENOME],
"nth" : [NUM_THREADS],
"se" : true,
"fastq1" : "...",
"ctl_fastq1" : "...",
"out_dir" : "[path of output result directory]"
}
Above are base parameters, see the detailed information below.
There are five data types; fastq
, bam
, filt_bam
, tag
and peak
.
- For treat. replicates: define data path with
-[DATA_TYPE][REPLICATE_ID]
. - For contols: define data path with
-ctl_[DATA_TYPE][CONTROL_ID]
.
For fastqs:
- For treat. replicates:
- Define data path as -fastq[REPLICATE_ID], then it's SE (single ended).
- Define data path as -fastq[REPLICATE_ID]_[PAIRING_ID], then it's PE.
- For controls:
- Define data path as -ctl_fastq[REPLICATE_ID], it's SE.
- Define data path as -ctl_fastq[REPLICATE_ID]_[PAIRING_ID], it's PE.
## for pair-end fastq (with replicate)
{
"type" : "[CHIPSEQ_TYPE]",
"species" : [GENOME],
"nth" : [NUM_THREADS],
"pe" : true,
"fastq1_1" : "...",
"fastq1_2" : "...",
"fastq2_1" : "...",
"fastq2_2" : "...",
"ctl_fastq1_1" : "...",
"ctl_fastq1_2" : "...",
"out_dir" : "[path of output result directory]"
}
## for signal-end fastq (with replicate)
{
"type" : "[CHIPSEQ_TYPE]",
"species" : [GENOME],
"nth" : [NUM_THREADS],
"se" : true,
"fastq1" : "...",
"fastq2" : "...",
"ctl_fastq1" : "...",
"out_dir" : "[path of output result directory]"
}
NOTE: ChIP-Seq pipeline disabled gapped/broad peak generation because MACS2 has some flaws in its algorithm for them. (reference ENCODE-DCC/chip-seq-pipeline2#30)
If you want call broad peak, only need to modify chipseq.bds #line34 (true modify to false)
- idr thresh is 0.05 (idr peak is only for tf)
- P-value cutoff (macs2 callpeak) : 0.01
- spp for TF ChIP-seq and macs2 for Histone ChIP-seq
out # root dir. of outputs
│
├ *report.html # HTML report
├ *tracks.json # Tracks datahub (JSON) for WashU browser
├ ENCODE_summary.json # Metadata of all datafiles and QC results
│
├ align # mapped alignments
│ ├ rep1 # for true replicate 1
│ │ ├ *.bam # raw bam
│ │ ├ *.nodup.bam # filtered and deduped bam
│ │ ├ *.tagAlign.gz # tagAlign (bed6) generated from filtered bam
│ │ ├ *.*M.tagAlign.gz # subsampled tagAlign for cross-corr. analysis
│ ├ rep2 # for true repilicate 2
│ ...
│ ├ ctl1 # for control 1
│ ...
│ ├ pooled_rep # for pooled replicate
│ ├ pseudo_reps # for self pseudo replicates
│ │ ├ rep1 # for replicate 1
│ │ │ ├ pr1 # for self pseudo replicate 1 of replicate 1
│ │ │ └ pr2 # for self pseudo replicate 2 of replicate 1
│ │ ├ rep2 # for repilicate 2
│ │ ...
│ └ pooled_pseudo_reps # for pooled pseudo replicates
│ ├ ppr1 # for pooled pseudo replicate 1 (rep1-pr1 + rep2-pr1 + ...)
│ └ ppr2 # for pooled pseudo replicate 2 (rep1-pr2 + rep2-pr2 + ...)
│
├ peak # peaks called
│ ├ macs2 # peaks generated by MACS2
│ │ ├ rep1 # for replicate 1
│ │ │ ├ *.narrowPeak.gz # narrowPeak
│ │ │ ├ *.gappedPeak.gz # gappedPeak
│ │ │ ├ *.filt.narrowPeak.gz # blacklist filtered narrowPeak
│ │ │ ├ *.filt.gappedPeak.gz # blacklist filtered gappedPeak
│ │ ├ rep2 # for replicate 2
│ │ ...
│ │ ├ pseudo_reps # for self pseudo replicates
│ │ └ pooled_pseudo_reps # for pooled pseudo replicates
│ │
│ ├ spp # peaks generated by SPP
│ │ ├ rep1 # for replicate 1
│ │ │ ├ *.regionPeak.gz # regionPeak (narrowPeak format) generated from SPP
│ │ │ ├ *.filt.regionPeak.gz # blacklist filtered narrowPeak
│ │ ├ rep2 # for replicate 2
│ │ ...
│ │ ├ pseudo_reps # for self pseudo replicates
│ │ └ pooled_pseudo_reps # for pooled pseudo replicates
│ │
│ └ idr # IDR thresholded peaks (using peaks from SPP)
│ ├ true_reps # for replicate 1
│ │ ├ *.narrowPeak.gz # IDR thresholded narrowPeak
│ │ ├ *.filt.narrowPeak.gz # IDR thresholded narrowPeak (blacklist filtered)
│ │ └ *.12-col.bed.gz # IDR thresholded narrowPeak track for WashU browser
│ ├ pseudo_reps # for self pseudo replicates
│ │ ├ rep1 # for replicate 1
│ │ ...
│ ├ optimal_set # optimal IDR thresholded peaks
│ │ └ *.filt.narrowPeak.gz # IDR thresholded narrowPeak (blacklist filtered)
│ ├ conservative_set # optimal IDR thresholded peaks
│ │ └ *.filt.narrowPeak.gz # IDR thresholded narrowPeak (blacklist filtered)
│ ├ pseudo_reps # for self pseudo replicates
│ └ pooled_pseudo_reps # for pooled pseudo replicate
│
├ qc # QC logs
│ ├ *IDR_final.qc # Final IDR QC
│ ├ rep1 # for true replicate 1
│ │ ├ *.flagstat.qc # Flagstat QC for raw bam
│ │ ├ *.dup.qc # Picard (or sambamba) MarkDuplicate QC log
│ │ ├ *.pbc.qc # PBC QC
│ │ ├ *.nodup.flagstat.qc # Flagstat QC for filtered bam
│ │ ├ *M.cc.qc # Cross-correlation analysis score for tagAlign
│ │ └ *M.cc.plot.pdf/png # Cross-correlation analysis plot for tagAlign
│ ...
│
├ signal # signal tracks
│ ├ macs2 # signal tracks generated by MACS2
│ │ ├ rep1 # for true replicate 1
│ │ │ ├ *.pval.signal.bigwig (E) # signal track for p-val
│ │ │ └ *.fc.signal.bigwig (E) # signal track for fold change
│ ...
│ └ pooled_rep # for pooled replicate
│
└ report # files for HTML report
-
NRF=Distinct/Total (NRF (Non-Redundant Fraction) : Number of distinct uniquely mapping reads(i.e. after removing duplicates) /Total number of reads)
-
PBC1=OnePair/Distinct (One Read Pairs : number of genomic locations where exactly one read maps uniquely) PBC2=OnePair/TwoPair (Two Read Pairs : number of genomic locations where exactly two read maps uniquely )
PBC1 PBC2 Bottlenecking level NRF Complexity Flag colors <0.5 <1 Severe <0.5 Concerning Orange 0.5<=PBC1<0.8 1<=PBC2<3 Moderate 0.5<=NFR<0.8 Acceptable Yellow 0.8<=PBC1<0.9 3<=PBC2<10 Mild 0.8<=NFR<0.9 Compliant None >=0.9 >=10 None >0.9 Ideal None
- A very useful ChIP-seq quality metric that is independent of peak calling is strand cross-correlation.
- Strand cross-correlation is computed as the Pearson correlation between the positive and the negative strand profiles at different strand shift distances, k
- NSC (normalized strand coefficient) : <1 (no enrichment) , <1.1 (are relatively low NSC scores).
- RSC (relative strand correlation): 0 (无信号), >1 high enrichment
- fraction of all mapped reads that fall into the called peak region
- FPiR > 0.3, or > 0.2 (acceptable)
pipeline https://github.com/kundajelab/chipseq_pipeline#aquas-pipeline
broad peak ENCODE-DCC/chip-seq-pipeline2#30
https://www.jianshu.com/p/0d3515420170
https://www.plob.org/article/10866.html
-
ImportError: Something is wrong with the numpy installation. While importing we detected an older version of numpy in ['/g/zhanye/anaconda3/envs/aquas_chipseq/lib/python2.7/site-packages/numpy']. One method of fixing this is to repeatedly uninstall numpy until none is found, then reinstall this version.
method:
$ source activate aquas_chipseq $ conda uninstall numpy $ conda install numpy $ source deactivate aquas_chipseq
-
plotFingerprint not found
method:
$ source activate aquas_chipseq $ pip install deeptools $ source deactivate aquas_chipseq
-
ImportError: numpy.core.multiarray failed to import
method:
$ pip install --upgrade numpy