The MutationalPatterns R package provides a comprehensive set of flexible functions for easy finding and plotting of mutational patterns in Single Nucleotide Variant (SNV) data.
This package is dependent on R version 3.2.4
Install and load devtools package
install.packages("devtools")
library(devtools)
Install and load MutationalPatterns package
options(unzip = 'internal')
install_github("CuppenResearch/MutationalPatterns")
library(MutationalPatterns)
- List all available reference genomes (BSgenome)
available.genomes()
- Download and load your reference genome of interest
ref_genome = "BSgenome.Hsapiens.UCSC.hg19"
source("http://bioconductor.org/biocLite.R")
biocLite(ref_genome)
library(ref_genome, character.only = T)
Find package example data
vcf_files = list.files(system.file("extdata", package="MutationalPatterns"), full.names = T)
Load a single vcf file
vcf = read_vcf(vcf_files[1], "sample1")
Load a list of vcf files
sample_names = c("colon1", "colon2", "colon3", "intestine1", "intestine2", "intestine3", "liver1", "liver2", "liver3")
vcfs = read_vcf(vcf_files, sample_names)
Include relevant metadata in your analysis, e.g. donor id, cell type, age, tissue type, mutant or wild type
tissue = c("colon", "colon", "colon", "intestine", "intestine", "intestine", "liver", "liver", "liver")
Retrieve base substitutions from vcf object as "REF>ALT"
get_muts(vcfs[[1]])
Retrieve base substitutions from vcf and convert to the 6 types of base substitution types that are distinguished by convention: C>A, C>G, C>T, T>A, T>C, T>G. For example, if the reference allele is G and the alternative allele is T (G>T), this functions returns the G:C>T:A mutation as a C>A mutation.
get_types(vcfs[[1]])
Retrieve the context (1 base upstream and 1 base downstream) of the positions in the vcf object from the reference genome.
get_mut_context(vcfs[[1]], ref_genome)
Retrieve the types and context of the base substitution types for all positions in the vcf object. For the base substitutions that are converted to the conventional base substitution types, the reverse complement of the context is returned.
get_type_context(vcfs[[1]], ref_genome)
Count mutation type occurences for one vcf object
type_occurences = mut_type_occurences(vcfs[1], ref_genome)
Count mutation type occurences for all samples in a list of vcf objects
type_occurences = mut_type_occurences(vcfs, ref_genome)
Plot mutation spectrum over all samples. Plottes is the mean relative contribution of each of the 6 base substitution types. Error bars indicate standard deviation over all samples. The n indicates the total number of mutations in the set.
plot_spectrum(type_occurences)
Plot mutation spectrum with distinction between C>T at CpG sites
plot_spectrum(type_occurences, CT = T)
Plot spectrum without legend
plot_spectrum(type_occurences, CT = T, legend = F)
Plot spectrum for each tissue separately
plot_spectrum(type_occurences, by = tissue, CT = T)
Specify 7 colors for spectrum plotting
my_colors = c("pink", "orange", "blue", "lightblue", "green", "red", "purple")
plot_spectrum(type_occurences, CT = T, legend = T, colors = my_colors)
Make 96 trinucleodide mutation count matrix
mut_matrix = make_mut_matrix(vcf_list = vcfs, ref_genome = ref_genome)
Plot 96 profile of three samples
plot_96_profile(mut_matrix[,c(1,4,7)])
Compare two profiles by bootstrapping analysis and get a p-value for their difference
comparison <- profile_bootstrap_comparison(mut_matrix[,1], #colon1
mut_matrix[,4], #intestine1
random.seed = 123)
comparison$overallPvalue
> 0.136
Identify what mutations cause significant differences between two profiles and get their p-values
comparison <- profile_bootstrap_comparison(mut_matrix[,1], #colon1
mut_matrix[,7], #liver1
random.seed = 123)
comparison$mutTypePvalues_corrected[comparison$mutTypePvalues_corrected < 0.05]
GCG C>G ACG C>T CCG C>T GCG C>T TCA C>T TCG C>T ATG T>C CTG T>G GTG T>G TTC T>G
0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.03919184 0.00000000 0.00000000
Estimate optimal rank for NMF mutation matrix decomposition
estimate_rank(mut_matrix, rank_range = 2:5, nrun = 50)
Extract and plot 3 signatures
nmf_res = extract_signatures(mut_matrix, rank = 3)
# provide signature names (optional)
colnames(nmf_res$signatures) = c("Signature A", "Signature B" , "Signature C")
# plot signatures
plot_96_profile(nmf_res$signatures)
Plot signature contribution
# provide signature names (optional)
rownames(nmf_res$contribution) = c("Signature A", "Signature B" , "Signature C")
# plot signature contribution
plot_contribution(nmf_res$contribution, coord_flip = T)
Compare reconstructed mutation profile with original mutation profile
plot_compare_profiles(mut_matrix[,1], nmf_res$reconstructed[,1], profile_names = c("Original", "Reconstructed"))
Download signatures from pan-cancer study Alexandrov et al.
url = "http://cancer.sanger.ac.uk/cancergenome/assets/signatures_probabilities.txt"
cancer_signatures = read.table(url, sep = "\t", header = T)
# reorder (to make the order of the trinucleotide changes the same)
cancer_signatures = cancer_signatures[order(cancer_signatures[,1]),]
# only signatures in matrix
cancer_signatures = as.matrix(cancer_signatures[,4:33])
Fit mutation matrix to cancer signatures. This function finds the optimal linear combination of mutation signatures that most closely reconstructs the mutation matrix by solving nonnegative least-squares constraints problem
fit_res = fit_to_signatures(mut_matrix, cancer_signatures)
# select signatures with some contribution
select = which(rowSums(fit_res$contribution) > 0)
# plot contribution
plot_contribution(fit_res$contribution[select,], coord_flip = T)
Compare reconstructed mutation profile of sample 1 using cancer signatures with original profile
plot_compare_profiles(mut_matrix[,1], fit_res$reconstructed[,1], profile_names = c("Original", "Reconstructed \n cancer signatures"))
A rainfall plot visualizes mutation types and intermutation distance. Rainfall plots can be used to visualize the distribution of mutations along the genome or a subset of chromosomes. The y-axis corresponds to the distance of a mutation with the previous mutation and is log10 transformed. Drop-downs from the plots indicate clusters or "hotspots" of mutations.
Make rainfall plot of sample 1 over all autosomal chromosomes
# define autosomal chromosomes
chromosomes = seqnames(get(ref_genome))[1:22]
# make rainfall plot
plot_rainfall(vcfs[[1]], title = names(vcfs[1]), ref_genome = ref_genome, chromosomes = chromosomes, cex = 1.5)
Make rainfall plot of sample 1 over chromosome 1
chromosomes = seqnames(get(ref_genome))[1]
plot_rainfall(vcfs[[1]], title = names(vcfs[1]), ref_genome = ref_genome, chromosomes = chromosomes[1], cex = 2)
1 Genomic distribution
- genomic regions bed input
- surveyed area??
2 Strand bias analysis
- get gene bodies for each ref genome