RGStraP

RGStraP (RNA-seq-based Genetic Stratification PCs) is a bioinformatics pipeline for calculating Principal Components (PCs) showing genetic stratification from RNA-seq data. The pipeline mainly utilizes the variant calling capabilities of GATK4 and the principal component analysis (PCA) of FlashPCA2. The pipeline was built using snakemake.

Contact

Muhamad Fachrul, [email protected]

Citation

Fachrul, M., Karkey, A., Shakya, M., Judd, L. M., Harshegyi, T., Sim, K. S., Tonks, S., Dongol, S., Shrestha, R., Salim, A., Baker, S., Pollard, A. J., Khor, C. C., Dolecek, C., Basnyat, B., Dunstan, S. J., Holt, K. E., & Inouye, M. (2023). Direct inference and control of genetic population structure from RNA sequencing data. Communications Biology, 6(1), Article 1. https://doi.org/10.1038/s42003-023-05171-9

Highlighted by Health Data Research UK (HDRUK): Understanding genetic diversity through RNA data to inform future research

Requirements

Most of the dependencies (including FastQC v0.11.8, Trim galore v0.6.0, BBMap (for Clumpify.sh), STAR v2.7.10a, Picard v2.24.0, Samtools v1.8, GATK4 v4.0.6.0, and PLINK 1.9 v1.90b6.16) are included in the setup.

Please install FlashPCA v2.0 from source.

How to use

Installling Conda and snakemake

Install a Conda-based Python3 distribution such as Miniconda or Mambaforge. In this case, we will use the latter as an example.

curl -L https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh -o Mambaforge-Linux-x86_64.sh
bash Mambaforge-Linux-x86_64.sh

Create and/or navigate to the directory in which you want the analysis of your project to take place, then clone this repository.

git clone https://github.com/fachrulm/RGStraP

Change into the RGStraP directory, and create a Conda environment to run the pipeline.

cd RGStraP

# Activate Conda environment
conda activate base

# Create RGStraP environment
mamba env create --name RGStraP --file environment.yaml

Activate the RGStraP environment. This environment needs to be active everytime you want to use the pipeline.

conda activate RGStraP

# To deactivate the environment
conda deactivate

Running the pipeline

Modify the config/config.yaml file according to where the necessary files are in your system. Variables to modify include:
- Path to a file containing list of ONLY the first pair of paired-end fastq samples to be analyzed.
- Path to metadata file (required for adding read-group information with GATK).
  - Has to be a tab-delimited file with 6 columns and no header, with the first column containing BAM file locations with the format 2_mapped/[FILENAME]_Aligned.sortedByCoord.out.bam and the next five columns representing read-group ID, platform, sample name, library, and platform unit, respectively.
  - More info here, example here.
- Path to directory of reference genome index generated by STAR.
- Path to reference genome fasta file.
- Path to indel files (for GATK's BaseRecalibrator).
- Path to flashpca.
Please adjust the 'dupedist' value according to your sequencing platform in the scripts/clumpify_OpDup.sh file (recommendations included within the script).
Test the pipeline by performing a dry-run.

snakemake -n

Running the pipeline on a cluster using a workload manager / job scheduler, such as slurm, is highly recommended. An example of a snakemake profile to run it on slurm is included.
- Please modify the partition name in slurm/config.yaml file accordingly.
- You can also modify the maximum number of jobs to be run at once in the slurm/config.yaml file.

# To run pipeline on slurm
snakemake --profile slurm

Running the pipeline from VCF file (lite version)

RGStraP can also be used to capture RG-PCs from existing VCF files via the lite version.

Make sure to modify the config/lite_config.yaml file accordingly.

# To run lite pipeline on slurm
snakemake -s lite_Snakefile --cores 2

License

Apache 2.0 License

Name		Name	Last commit message	Last commit date
Latest commit History 50 Commits
config		config
scripts		scripts
slurm		slurm
.DS_Store		.DS_Store
.Snakefile.swp		.Snakefile.swp
LICENSE		LICENSE
README.md		README.md
Snakefile		Snakefile
environment.yaml		environment.yaml
lite_Snakefile		lite_Snakefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RGStraP

Contact

Citation

Highlighted by Health Data Research UK (HDRUK): Understanding genetic diversity through RNA data to inform future research

Requirements

How to use

Installling Conda and snakemake

Running the pipeline

Running the pipeline from VCF file (lite version)

License

About

Releases 1

Packages

Languages

License

fachrulm/RGStraP

Folders and files

Latest commit

History

Repository files navigation

RGStraP

Contact

Citation

Highlighted by Health Data Research UK (HDRUK): Understanding genetic diversity through RNA data to inform future research

Requirements

How to use

Installling Conda and snakemake

Running the pipeline

Running the pipeline from VCF file (lite version)

License

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages