CRC_MasterThesis_21

Title

“Investigation of early-stage detection of colorectal cancer using machine learning models based on functional profiling of the human gut microbiome”

Background

CRC prediction models based on functional profiling of gut microbiome.

Results

TBU

Conclusion

TBU

Repository structure

This repository contains the code and brief description of the workflow used for the CRC microbiome analysis in Master's thesis that was undertaken at the Albertsen lab (AAU).

Files (TBU)

This folder contains generated files in this analysis:

e.g. predictions_lasso.tsv

Data (TBU)

This folder contains data & other relevant information:

Feature tables*
Metadata
Library sizes
Run accessions numbers

Figures (TBU)

This folder contains the generated figures.

Models (TBU)

This folder contains models based on each feature category. Models are saved as R objects and can be loaded to R environment without re-running the script.

Scripts

This folder contains the source code used in this project for analysis carried out in RStudio.

Bash Scripts

This folder contains the source code used in this project for analysis carried out using command line.

* Files too large to be uploaded in this repository

Workflow

1. Data collection & processing

1.1 Data availability

The publicly available raw sequencing data from CRC studies were used in this analysis and are available on the European Nucleotide Archive (ENA) at EMBL-EBI except for the Indian cohort, which was downloaded from the NCBI BioProject database.

Study	Country	Accession number(s)
Feng et al., 2015	Austria	ERP008729
Gupta et al., 2019	India	PRJNA531273 & PRJNA39711
Yachida et al., 2019	Japan	DRA006684 & DRA008156
Thomas et al., 2019	Italy	SRP136711
Vogtmann et al., 2016	USA	PRJEB12449
Wirbel et al., 2019	Germany	PRJEB27928
Yu et al., 2015	China	PRJEB12449
Zeller et al., 2014	France	ERP005534

Note: The list of all studies including metadata was retrieved and modified from Wirbel et al., 2021. The metadata from Gupta et al., 2019 and Yamada et al., 2019 were seperately obtained according to the information provided by the researchers in their papers.

1.2 Raw data pre-processing

Command line was used to pre-process raw data and to perform taxonomic and functional profiling of high-quality reads.

TrimGalore (v.0.6.5) wrapper package with filtering parameters:
--stringency 5 -- length 45 --quality 20 --max_n 2 --trim-n --paired

Bowtie2 (v.2.3.4.1) was used to align reads to the human genome (hg19) and discard contaminant reads from the host.

The bash scripts can be found in CRC_MasterThesis/bash_scripts folder.

1.3 Functional profiling

Functional profiles of high-quality metagenomic shotgun sequences were determined using HUMAnN 3.0 (Francesco et al., 2020).

bash humann.sh

1.4 Taxonomic profiling

Taxonomic profiles with MetaPhlAn 3.0

bash mpa.sh

2. Metadata and feature table preparation

The metadata and feature table preparation was carried out in RStudio using R.

2.1 Metadata

The metadata was retrieved and modified from Wirbel et al., 2019. The metadata from Gupta et al., 2019 and Yamada et al., 2019 were seperately obtained according to the information provided by the researchers in their papers.

The final metadata used in this project can be found in CRC_MasterThesis/data/meta folder as "meta.crc.tsv".

or

can be generated by running the script:
2.1_prepare_metadata.R

2.2 Feature tables

The feature tables produced by HUMAnN 3.0 were subjected to post-processing in R Studio.

Firstly, multi-sequenced samples were merged together taking into the account the library sizes using these scripts:
2.2_prepare_functional_data.R
2.2_prepare_taxonomic_data.R

Note: exception with CN-CRC study as the number of samples matches the metadata entries

Secondly, the feature tables were cleaned and filtered to remove low-abundant features:
2.3_clean_functional_data.R
2.3_clean_taxonomic_data.R

3. Explorative analysis

Overview of different profilers

3.1_profiler_comparison_combined.R
3.2_profiler_comparison_boxplots.R

Ordination with ampvis2

R package, ampvis2, was utilised for explorative analysis of functional and taxonomic feature tables. The package was originally developed for visualing amplicon data, however, it is capable of dealing with shotgun metagenomics data.
3.3_explorative_analysis_ampvis2.R

4. Machine learning models

Machine learning models were built using SIAMCAT pipeline for associations between gut microbiome and host phenotype (Wirbel et al., 2021). The machine learning workflow including feature filtering was adapted and tailored from an established CRC meta-analysis study (Wirbel et al., 2019).

Logistic LASSO regression

The machine learning scripts were run in the following order:

4.1_train_models.R
4.2_model_predictions.R
4.3_ml_figures.R
4.4_ml_external_validation_figures.R
4.5_ml_evaluation_figures.R

Contact

E-mail: Erika Dvarionaite
Twitter: erika_dva

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

CRC_MasterThesis_21

Title

Background

Results

Conclusion

Repository structure

Files (TBU)

Data (TBU)

Figures (TBU)

Models (TBU)

Scripts

Bash Scripts

Workflow

1. Data collection & processing

1.1 Data availability

1.2 Raw data pre-processing

1.3 Functional profiling

1.4 Taxonomic profiling

2. Metadata and feature table preparation

2.1 Metadata

2.2 Feature tables

3. Explorative analysis

Overview of different profilers

Ordination with ampvis2

4. Machine learning models

Logistic LASSO regression

Contact

Files

README.md

Latest commit

History

README.md

File metadata and controls

CRC_MasterThesis_21

Title

Background

Results

Conclusion

Repository structure

Files (TBU)

Data (TBU)

Figures (TBU)

Models (TBU)

Scripts

Bash Scripts

Workflow

1. Data collection & processing

1.1 Data availability

1.2 Raw data pre-processing

1.3 Functional profiling

1.4 Taxonomic profiling

2. Metadata and feature table preparation

2.1 Metadata

2.2 Feature tables

3. Explorative analysis

Overview of different profilers

Ordination with ampvis2

4. Machine learning models

Logistic LASSO regression

Contact