“Investigation of early-stage detection of colorectal cancer using machine learning models based on functional profiling of the human gut microbiome”
CRC prediction models based on functional profiling of gut microbiome.
TBU
TBU
This repository contains the code and brief description of the workflow used for the CRC microbiome analysis in Master's thesis that was undertaken at the Albertsen lab (AAU).
This folder contains generated files in this analysis:
- e.g.
predictions_lasso.tsv
This folder contains data & other relevant information:
- Feature tables*
- Metadata
- Library sizes
- Run accessions numbers
This folder contains the generated figures.
This folder contains models based on each feature category. Models are saved as R objects and can be loaded to R environment without re-running the script.
This folder contains the source code used in this project for analysis carried out in RStudio.
This folder contains the source code used in this project for analysis carried out using command line.
* Files too large to be uploaded in this repository
The publicly available raw sequencing data from CRC studies were used in this analysis and are available on the European Nucleotide Archive (ENA) at EMBL-EBI except for the Indian cohort, which was downloaded from the NCBI BioProject database.
Study | Country | Accession number(s) |
---|---|---|
Feng et al., 2015 | Austria | ERP008729 |
Gupta et al., 2019 | India | PRJNA531273 & PRJNA39711 |
Yachida et al., 2019 | Japan | DRA006684 & DRA008156 |
Thomas et al., 2019 | Italy | SRP136711 |
Vogtmann et al., 2016 | USA | PRJEB12449 |
Wirbel et al., 2019 | Germany | PRJEB27928 |
Yu et al., 2015 | China | PRJEB12449 |
Zeller et al., 2014 | France | ERP005534 |
Note: The list of all studies including metadata was retrieved and modified from Wirbel et al., 2021. The metadata from Gupta et al., 2019 and Yamada et al., 2019 were seperately obtained according to the information provided by the researchers in their papers.
Command line was used to pre-process raw data and to perform taxonomic and functional profiling of high-quality reads.
TrimGalore (v.0.6.5) wrapper package with filtering parameters:
--stringency 5 -- length 45 --quality 20 --max_n 2 --trim-n --paired
Bowtie2 (v.2.3.4.1) was used to align reads to the human genome (hg19) and discard contaminant reads from the host.
The bash scripts can be found in CRC_MasterThesis/bash_scripts folder.
Functional profiles of high-quality metagenomic shotgun sequences were determined using HUMAnN 3.0 (Francesco et al., 2020).
bash humann.sh
Taxonomic profiles with MetaPhlAn 3.0
bash mpa.sh
The metadata and feature table preparation was carried out in RStudio using R.
The metadata was retrieved and modified from Wirbel et al., 2019. The metadata from Gupta et al., 2019 and Yamada et al., 2019 were seperately obtained according to the information provided by the researchers in their papers.
The final metadata used in this project can be found in CRC_MasterThesis/data/meta folder as "meta.crc.tsv"
.
or
can be generated by running the script:
2.1_prepare_metadata.R
The feature tables produced by HUMAnN 3.0 were subjected to post-processing in R Studio.
Firstly, multi-sequenced samples were merged together taking into the account the library sizes using these scripts:
2.2_prepare_functional_data.R
2.2_prepare_taxonomic_data.R
Note: exception with CN-CRC study as the number of samples matches the metadata entries
Secondly, the feature tables were cleaned and filtered to remove low-abundant features:
2.3_clean_functional_data.R
2.3_clean_taxonomic_data.R
3.1_profiler_comparison_combined.R
3.2_profiler_comparison_boxplots.R
R package, ampvis2, was utilised for explorative analysis of functional and taxonomic feature tables. The package was originally developed for visualing amplicon data, however, it is capable of dealing with shotgun metagenomics data.
3.3_explorative_analysis_ampvis2.R
Machine learning models were built using SIAMCAT pipeline for associations between gut microbiome and host phenotype (Wirbel et al., 2021). The machine learning workflow including feature filtering was adapted and tailored from an established CRC meta-analysis study (Wirbel et al., 2019).
The machine learning scripts were run in the following order:
4.1_train_models.R
4.2_model_predictions.R
4.3_ml_figures.R
4.4_ml_external_validation_figures.R
4.5_ml_evaluation_figures.R
E-mail: Erika Dvarionaite
Twitter: erika_dva