Skip to content

DNA methylation analysis pipeline for reduced representation bissulfite sequencing data

License

Notifications You must be signed in to change notification settings

AnaValente/DNA-methylation-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DNA-methylation-analysis

This RRBS Nextflow pipeline was created to discover the genes associated with differentially methylated regions from the CpG methylation patterns using MethylDackel and Metilene.

The pipeline inputs BAM files, and outputs multiple txt and bedGraph files (according to the number of samples):

  • Per base methylation metrics (.bedGraph)
  • Differentially methylated regions (.bedGraph)
  • Correlation matrix and PCA (.png)
  • Heatmap with signature differences between the controls and samples (.pdf)
  • Genomic distribution across the hg38 reference genome of CpGs with different methylation frequencies between samples and controls (.png)
  • Genomic distribution across the hg38 reference genome of differentially methylated regions (.png)
  • Closest RefSeq genes (version from 2023-11-24) to the differentially methylated regions (.txt/.bedGraph)
  • Venn diagram of the closest genes (only if two or more samples were used as input) (.png)

image

Install conda environment

To use this pipeline you need to have installed conda and Nextflow.

git clone https://github.com/AnaValente/DNA-methylation-analysis/
cd DNA-methylation-analysis
conda env create -f methylation_env.yml
conda activate methylation

Usage

Mandatory inputs:

  • --files           Path to scripts and samples folder
  • --samples        [String] sample names separated by comma (always write the control name first!)
  • --replicates   [Integer] number of sample replicates
  • --genome          Path to the hg38 reference genome file (.fa.gz) (available in: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz)

Note: All samples and additional files must be placed in the scripts folder

Optional inputs:

  • --concat       Option to concatenate BAM files from different runs
  • --cell_tpm     Optional file containing two collumns, one with gene names and the other with expression levels in transcripts per million (TPM) for a cell line or cell type identical or similar to the cells under study (available in: https://www.ebi.ac.uk/gxa/experiments/E-MTAB-2770/Results) for gene name filtering
  • --cutoff_regions    [Integer] cutoff (from 1 to 100) for the difference between samples methylation frequency vs control methylation frequency for genomic annotations (default: 75)
  • --cutoff_heatmap    [Integer] cutoff (from 1 to 100) for the difference between samples methylation frequency vs control methylation frequency for clustering analysis (default: 100)

Examples

Example

nextflow run Methylation_pipeline.nf --files "Scripts/*" --samples 'Control','Sample1','Sample2' --replicates 2 --genome Scripts/hg38.fa.gz

Example with BAM concatenation

nextflow run Methylation_pipeline.nf --files "Scripts/*" --samples 'Control','Sample1','Sample2' --replicates 2 --genome Scripts/hg38.fa.gz --concat

Example with genes filtered by file

nextflow run Methylation_pipeline.nf --files "Scripts/*" --samples 'Control','Sample1','Sample2' --replicates 2 --genome Scripts/hg38.fa.gz --cell_tpm E-MTAB-2770-query-results.tsv 

Example with different cutoffs

nextflow run Methylation_pipeline.nf --files "Scripts/*" --samples 'Control','Sample1','Sample2' --replicates 2 --genome Scripts/hg38.fa.gz --cutoff_regions 50 --cutoff_heatmap 75