Skip to content
gibran hemani edited this page Jan 16, 2025 · 17 revisions

System requirements

  • Linux x86 operating system
  • GNU Bash 5.1.16 or higher
  • R 4.1.0 or higher
  • (Optional) Snakemake 8.19 or higher
  • (Optional) Docker version 26.1.3 or higher
  • (Optional) Apptainer version 1.3.2 or higher

We have attempted to make the pipeline as resource efficient as possible. For the analysis on UK Biobank steps 00-03 required < 10Gb RAM and we used 100 cores to complete in < 1 hour. For the GWAS step 04 it takes a couple of minutes per GWAS using 100 cores and < 5Gb RAM.

Get the code

Use git to clone the repository:

git clone https://github.com/MRCIEU/Lifecourse-GWAS.git

Configure the analysis

Setup your directory locations. Copy the config-template.env to a new file called config.env, then edit it to have the paths to genotype / phenotype data locations etc as required

cp config-template.env config.env

We recommend using data paths that are outside of the cloned code repository. You will see that you need the following working data directories, ideally on fast disk that can be accessed by HPC nodes.

genotype_input_list="/EDIT/THIS/PATH"
phenotype_input_dir="/EDIT/THIS/PATH"
phenotype_processed_dir="/EDIT/THIS/PATH"
genotype_processed_dir="/EDIT/THIS/PATH"

Note that we will never request you to transfer any data from the raw individual-level data directories listed above. All the results from the pipeline will be stored in the results_dir.

results_dir="/EDIT/THIS/PATH"

We will only store non-disclosive summary data in here that is safe to transfer to our servers for checking and subsequent meta-analysis etc.

For the sftp_username parameter - please contact us if you haven't received this username and password - it is used to upload results from the analysis to the SFTP server.

For the cohort_name parameter - please provide only an alphanumeric string with no spaces or special characters (e.g. ALSPAC).

Please check that the genome_build field is correct.

Specifying genotype_input_list

The location of the .bgen and .sample files need to be specified in a particular way, please see here for details.

Installing R packages

In R (ideally version 4.4.2) run the following:

install.packages("renv")
renv::restore()

This will automatically install all the correctly versioned R packages required to run the pipeline.

Check things are working

To see if the R packages are installed and the binaries are working please run

./utils/check.sh

You should see the following output:

Checking R packages...
No issues found -- the project is in a consistent state.
Checking plink...
All good!
Checking flashpca...
All good!
Checking king...
All good!
Checking cohort name...
<cohort_name> is a valid cohort name!
Checking genotype input list...
Checking 23 bgen files exist
All good!

If you see otherwise, let us know. You may need to use containers to provide the appropriate executable environment.

Containers (optional)

If you wish to run the analysis in a controlled containerised environment, which we have tested to work with the relevant R packages and binaries pre-installed, you can use docker or apptainer. If you need to use alternative container environments let us know.

For Docker:

docker pull mrcieu/lifecourse-gwas:latest
./utils/run_docker.sh ./utils/check.sh

For Apptainer:

apptainer pull docker://mrcieu/lifecourse-gwas:latest
./utils/run_apptainer.sh ./utils/check.sh