The following work is part of my undergraduate thesis project during my time at the Indian Institute of Technology, Delhi (IIT Delhi) in India.
> shapiro.test(subset(df, NRC_regions == "Over rep.")$Frequency)
Shapiro-Wilk normality test
data: subset(df, NRC_regions == "Over rep.")$Frequency
W = 0.86094, p-value < 2.2e-16
> shapiro.test(subset(df, NRC_regions == "Under rep.")$Frequency)
Shapiro-Wilk normality test
data: subset(df, NRC_regions == "Under rep.")$Frequency
W = 0.88901, p-value < 2.2e-16
From both histograms, the methylation data does not appear to be normally distributed. This is further confirmed statistically by Shapiro-Wilk's test which shows p-value < 0.05.
Thus, we reject the null hypothesis of normality for both distributions at the 5% significance level.
To statistically validate the differential methylation status between over and under represented NRC regions, we use the non-parametric Wilcoxon test
.
wilcox.test(df$Frequency ~ df$NRC_regions)
Wilcoxon rank sum test with continuity correction
data: df$Frequency by df$NRC_regions
W = 1632757, p-value < 2.2e-16
Alternative hypothesis: true location shift is not equal to 0
The p-value is less than 0.05, thus the null hypothesis is rejected and it is concluded that there is a significant difference the methylation status of two NRC regions.
To understand which genomic regions within NRC caused this differential expression, we analyzed methylation in intergenic, exonic and intronic regions.
> shapiro.test(subset(df, NRC_regions == "Over rep.")$Frequency)
Shapiro-Wilk normality test
data: subset(df, NRC_regions == "Over rep.")$Frequency
W = 0.82099, p-value < 2.2e-16
> shapiro.test(subset(df, NRC_regions == "Under rep.")$Frequency)
Shapiro-Wilk normality test
data: subset(df, NRC_regions == "Under rep.")$Frequency
W = 0.87614, p-value < 2.2e-16
Wilcoxon rank sum test with continuity correction
data: df$Frequency by df$NRC_regions
W = 801389, p-value = 2.46e-13
Alternative hypothesis: true location shift is not equal to 0
The p-value is less than 0.05. Therefore, there is a significant difference in the methylation status of two intergenic regions.
> shapiro.test(subset(df_exon, NRC_regions == "Over rep.")$Frequency)
Shapiro-Wilk normality test
W = 0.84151, p-value < 2.2e-16
> shapiro.test(subset(df_exon, NRC_regions == "Under rep.")$Frequency)
Shapiro-Wilk normality test
W = 0.85718, p-value = 4.223e-13
Wilcoxon rank sum test with continuity correction
data: dfe$Frequency by dfe$NRC_regions
W = 39913, p-value = 0.0007649
Alternative hypothesis: true location shift is not equal to 0
As p-value < 0.05, there is a significant difference in the methylation status of two exonic regions.
> shapiro.test(subset(df_intron, NRC_regions == "Over rep.")$Frequency)
Shapiro-Wilk normality test
W = 0.85528, p-value = 1.697e-12
> shapiro.test(subset(df_intron, NRC_regions == "Under rep.")$Frequency)
Shapiro-Wilk normality test
W = 0.81156, p-value = 2.905e-10
Wilcoxon rank sum test with continuity correction
data: dfi$Frequency by dfi$NRC_regions
W = 10252, p-value = 0.7405
Alternative hypothesis: true location shift is not equal to 0
As p-value > 0.05, the null hypothesis is TRUE. It means there is no significant difference in the methylation frequencies of introns in over and under represented NRC regions.
Thus, we have presented the variation and distribution of 5mC methylated CpG sites in intergenic, intronic and exonic regions in over and under represented NRC DNA.