RGStraP (RNA-seq-based Genetic Stratification PCs) is a bioinformatics pipeline for calculating Principal Components (PCs) showing genetic stratification from RNA-seq data. The pipeline mainly utilizes the variant calling capabilities of GATK4 and the principal component analysis (PCA) of FlashPCA2. The pipeline was built using snakemake.
Muhamad Fachrul, [email protected]
Fachrul, M., Karkey, A., Shakya, M., Judd, L. M., Harshegyi, T., Sim, K. S., Tonks, S., Dongol, S., Shrestha, R., Salim, A., Baker, S., Pollard, A. J., Khor, C. C., Dolecek, C., Basnyat, B., Dunstan, S. J., Holt, K. E., & Inouye, M. (2023). Direct inference and control of genetic population structure from RNA sequencing data. Communications Biology, 6(1), Article 1. https://doi.org/10.1038/s42003-023-05171-9
Highlighted by Health Data Research UK (HDRUK): Understanding genetic diversity through RNA data to inform future research
Most of the dependencies (including FastQC v0.11.8, Trim galore v0.6.0, BBMap (for Clumpify.sh), STAR v2.7.10a, Picard v2.24.0, Samtools v1.8, GATK4 v4.0.6.0, and PLINK 1.9 v1.90b6.16) are included in the setup.
Please install FlashPCA v2.0 from source.
- Install a Conda-based Python3 distribution such as Miniconda or Mambaforge. In this case, we will use the latter as an example.
curl -L https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh -o Mambaforge-Linux-x86_64.sh
bash Mambaforge-Linux-x86_64.sh
- Create and/or navigate to the directory in which you want the analysis of your project to take place, then clone this repository.
git clone https://github.com/fachrulm/RGStraP
- Change into the RGStraP directory, and create a Conda environment to run the pipeline.
cd RGStraP
# Activate Conda environment
conda activate base
# Create RGStraP environment
mamba env create --name RGStraP --file environment.yaml
- Activate the RGStraP environment. This environment needs to be active everytime you want to use the pipeline.
conda activate RGStraP
# To deactivate the environment
conda deactivate
- Modify the
config/config.yaml
file according to where the necessary files are in your system. Variables to modify include:- Path to a file containing list of ONLY the first pair of paired-end fastq samples to be analyzed.
- Path to metadata file (required for adding read-group information with GATK).
- Has to be a tab-delimited file with 6 columns and no header, with the first column containing BAM file locations with the format
2_mapped/[FILENAME]_Aligned.sortedByCoord.out.bam
and the next five columns representing read-group ID, platform, sample name, library, and platform unit, respectively. - More info here, example here.
- Has to be a tab-delimited file with 6 columns and no header, with the first column containing BAM file locations with the format
- Path to directory of reference genome index generated by STAR.
- Path to reference genome fasta file.
- Path to indel files (for GATK's BaseRecalibrator).
- Path to flashpca.
- Please adjust the 'dupedist' value according to your sequencing platform in the
scripts/clumpify_OpDup.sh
file (recommendations included within the script). - Test the pipeline by performing a dry-run.
snakemake -n
- Running the pipeline on a cluster using a workload manager / job scheduler, such as slurm, is highly recommended. An example of a snakemake profile to run it on slurm is included.
- Please modify the partition name in
slurm/config.yaml
file accordingly. - You can also modify the maximum number of jobs to be run at once in the
slurm/config.yaml
file.
- Please modify the partition name in
# To run pipeline on slurm
snakemake --profile slurm
RGStraP can also be used to capture RG-PCs from existing VCF files via the lite version.
- Make sure to modify the
config/lite_config.yaml
file accordingly.
# To run lite pipeline on slurm
snakemake -s lite_Snakefile --cores 2
Apache 2.0 License