This package provides functions that reproduce the analyses reported in Rahmani et al. 2021 (bioRxiv) comparing TCA (Rahmani et al. 2019, Nature Communications) and CellDMC (Zheng et al. 2018, Nature Methods). These functions were also used to generate the figuers in Rahmani et al. 2022, Frontiers in Bioinformatics; see under "2.1 Software and computational tools" in the Supplementary Information of the latter publication for more details.
The package has a requirement of R version 3.5.0 or greater. The results found in Rahmani et al. (2021) were run using R version 3.6.0. Other package requirements are listed in the Imports section of the DESCRIPTION file.
With the devtools
package, you can install this package using the following command:
devtools::install_github("cozygene/CellTypeSpecificMethylationAnalysis")
The functions exported by this package will download most of the source data and output each figure found in Rahmani et al. (2021) into user-specified directories. Note that the data directory will contain just under 5GB of data and may temporarily max out at 13GB for temporary files.
Figure 1 and Supplementary Figure 7 evaluate the methods when a phenotype affects methylation (X|Y). They can be reproduced with the following
parametric_simulations(data_dir, results_dir, plots_dir,
experiment_index=1)
which will save the simulation data, results, and plots (in tiff format) in the provided directories.
Figure 2 and Supplementary Figures 1 and 9 evaluate the methods when a phenotype is affected by methlation (Y|X), which can be reproduced with
parametric_simulations(data_dir, results_dir, plots_dir,
experiment_index=3)
Supplmentary Figures 3 and 4 perform the experiments in Figure 1 with smaller sample size and are generated with:
parametric_simulations(data_dir, results_dir, plots_dir,
experiment_index=4)
Supplementary Figure 8 performs the experiment in Figure 1 but considers the estimated effect direction when determining performance. It can be generated as follows:
parametric_simulations(data_dir, results_dir, plots_dir,
experiment_index=2)
The methods were benchmarked on two independent whole-blood datasets (Hannum et al., Liu et al.) .
Figure 3 visualizes these results and can be reproduced as follows:
smoking.res <- smoking_analysis(hannum_smk_status_path="hannum_smoking_status.txt",
data_dir, results_dir, plot_dir)
where hannum_smk_status_path
is the path to a CSV file that contains the smoking status of samples in the Hannum et al. dataset. This data is currently not publicly available but is required for this analysis. It contains a header indicating ID
and smk
columns that contain the sample IDs and smoking status, respectively. Smoking status is listed as either Never
, Past Smoker
, or Current Smoker
in this file.
Supplementary Figure 2 evaluates estimation of cell fractions in two whole-blood datasets (Koestler et al., Jing et al.) with FACS-based fractions as ground truth. It can be reproduced with
refit.w.res <- refit_w_comparison(bbc_pheno_path="PhenoTypesBBC.Rd",
bbc_facs_path="facsBBC.Rd",
data_dir, plot_dir)
The Koestler et al. data that is analyzed within this function is available as an object exported by this package and is also publicly available on GEO (GSE77797). Preprocessing of these data in the above package followed procedures described in Lehne et al. (Genome Biology 2015). The BBC data, which is available through NODE (OER035661), is downloaded and processed by this function. Note that the phenotype and FACS data are currently under restriced acces. The bbc_pheno_path
and bbc_facs_path
parameters point to the Rd files that contain these data and were provided by Jing et al.
The runtime of each method was also compared in Supplementary Figure 5. These results can be reproduced with the following function:
runtime.res <- runtime_comparison(plot_dir)
Note that this function runs each simulation serially (took on the order of a day on a single core).
Rahmani et al. (2021) also introduces an alternative procedure for fitting the TCA model in an efficient manner. While this generalized method of moments (GMM) approach provides different estimates than the original TCA maximum likelihood (MLE) model, we show in Supplementary Figure 6 that their estimates are highly concordant as sample sizes increase. It can be run with
mle.gmm.comparison.res <- var_mle_gmm_comparison(plot_dir)