Skip to content

Data preparation

gibran hemani edited this page Jan 16, 2025 · 33 revisions

An example of how to format the various files required for the pipeline is provided here: https://github.com/MRCIEU/Lifecourse-GWAS/tree/main/test/example-data

Covariate data

There are three types of covariate data used in the pipeline

  • Genetic principal components - generated by the pipeline or optionally provided by the analyst
  • Static covariates - provided by the analyst in a file called static_covariates.txt (see below).
  • Time varying covariates - provided by the analyst for relevant phenotypes within the long-format phenotype files (see below).

Static covariates

There should be a single covariate file for the 'static' covariates in the analysis. It must be named static_covariates.txt and stored in the phenotype_input_dir folder specified in config.env, with the following minimum format (note you may need to provide more covariate columns as explained further below):

FID IID sex yob my_covariate1 my_covariate2
ID1 ID1 1 1970 1 1
ID2 ID2 1 1974 0 NA
ID3 ID3 2 NA 0 0
  • FID (required) = family ID (recommended to be same as individual ID)
  • IID (required) = individual ID
  • yob (required) = Year of birth. Note that this may be difficult to source for some cohorts. This covariate matters most for multi-generational cohorts. If year of birth is not easily available then an indicator of generation by providing approximate decade of birth for example will suffice.
  • sex (required) = Number of X chromosomes (e.g. 1 for male, 2 for female)
  • my_covariate1 (optional) = Additional study-specific covariate e.g. batch
  • my_covariate2 (optional) = Additional study-specific covariate e.g. batch

Please ensure that every individual being analysed has an entry in this static_covariates.txt file. This file will be used to provide covariates for all phenotypes, so if individuals are present in the phenotype files but don't have static covariates then they will be omitted from the analaysis for that phenotype.

If you are adding optional study-specific static covariates please ensure that your config.env file is also updated for example the line defining static_covariates will look like:

static_covariates="sex yob my_covariate1 my_covariate2"

Note if year of birth is not available then you can omit this. Don't include the variable in the static_covariates.txt and make sure it's removed from the static_covariates field in config.env.

Missing values are accepted and should be specified with NA in all phenotype and covariate text files

Phenotype data

This file lists all the phenotypes that can be contributed to the analysis: phenotype_list.csv. We require one file for every phenotype that is to be contributed.

e.g. to run the analysis for body mass index there must be a file named bmi.txt located in the phenotype_input_dir folder specified in config.env, with the following format:

FID IID value age
ID1 ID1 23.000 15.25
ID2 ID2 27.000 23.33
ID3 ID3 18.000 5.00
  • FID = family ID (recommended to be same as individual ID)
  • IID = individual ID
  • value = phenotypic value. Note that the phenotypic values must be in the units specified in phenotype_list.csv, and for sanity checking the phenotype values should be within the specified permitted range in this spreadsheet.
  • age = age at which phenotypic value was measured in years. Non-integer values allowed and encouraged for specifying younger ages, for example phenotype measurements might be recorded in months, in which case please convert to years using age = months / 12.

Note that FID IID doesn't have to be unique, as a single individual could be measured at multiple time points. However, duplicate values of FID IID age will be removed in the pipeline.

Extra columns can be included and will be ignored.

Age-specific covariates for some phenotypes

In the phenotype_list.csv each phenotype lists the covariates that are required for the analysis. Most just specify sex, but some require age-specific covariates. For those phenotypes, please make sure that the phenotype file also has the additional covariate column(s) required. The pipeline will be expecting those age-specific covariates in the phenotype file for that phenotype. This is an example for LDL cholesterol:

FID IID value age cholesterol_med
ID1 ID1 2.6 15.33 0
ID2 ID2 2.9 23.50 NA
ID3 ID3 2.5 65.83 1

Optional PCs file

The pipeline will generate PCs automatically from the genotype data, accounting for family relatedness where necessary, and using the standard flashpca algorithm. However if you wish to provide previously generated PCs then this is possible. Save a file named pcs.txt in the $genotype_processed_dir location that is specified in the config.env file. The file should be formatted as:

FID IID PC1 PC2 PC3 PC4
ID1 ID1 0.1 -0.2 0.04 0.07
ID2 ID2 0.2 -0.4 0.03 0.06

i.e. each row is an individual with their family ID and individual ID followed by a value for each principal component that you wish to include in the analysis.

Optional sample inclusion file

It may be convenient to use bulk phenotype or genotype data files and specify a subset of individuals in those files to be analysed in the pipeline. Do this by setting sample_inclusion_list="/path/to/textfile" in config.env and formatting this file with one individual per row, with their family ID and individual ID columns e.g.

ID1 ID1
ID2 ID2
ID3 ID3

This could be useful if you have multiple ancestries in your genotype data, if you know the ancestral groups for each individual you can just specify the inclusion list instead of creating genotype data subsets.

If sample_include_list="" is left blank then the pipeline will attempt to use all individuals.

Imputed genotype data

These are the requirements for the imputed genotype data:

  • Imputed to a recent reference panel such as TOPMED, HRC reference panel v1.1. The Haplotype Reference Consortium offers a free imputation service (including HRC) which you can use here or here. By far the easiest way to do imputation now is to use this service.
    • If this poses a problem e.g. the cohort has ancestry-specific imputation panels, that's absolutely fine please just let us know
    • If you will struggle to impute to one of the more recent reference panels then we can accept imputations to 1000 genomes phase 3, but please contact us about this.
  • Variant positions in hg19 or hg38
  • Imputed data must be in bgen format.
  • A separate file for each chromosome
  • No spaces in the file names
  • All 22 autosomes should be provided
  • Chromosome X is optional, but should be provided in .pgen format. A single chromosome X file is required, so please take care in combining males and females. If this is done erroneously (e.g. creating apparent multiploidy at some variants) then step 00b will throw an error.
  • The pipeline will not modify any of the data provided in $genotype_input_dir, so that can be a read-only directory if necessary.
  • A single major ancestral group is represented for a single run of the pipeline. An example of how the major ancestral groups in UKB were clustered can be found here.

Creating the genotype_input_list

The imputed genotype data must be provided as .bgen format. If you provide data for the X chromosome, males must be coded as diploid (i.e. 0/2).

The pipeline needs to be able to find these files to use for the analysis. To do this, please create a file (e.g. geno_files.txt that has a structure like this:

/path/to/data_01.bgen /path/to/data.sample
/path/to/data_02.bgen /path/to/data.sample
/path/to/data_03.bgen /path/to/data.sample
/path/to/data_04.bgen /path/to/data.sample
/path/to/data_05.bgen /path/to/data.sample
/path/to/data_06.bgen /path/to/data.sample
/path/to/data_07.bgen /path/to/data.sample
/path/to/data_08.bgen /path/to/data.sample
/path/to/data_09.bgen /path/to/data.sample
/path/to/data_10.bgen /path/to/data.sample
/path/to/data_11.bgen /path/to/data.sample
/path/to/data_12.bgen /path/to/data.sample
/path/to/data_13.bgen /path/to/data.sample
/path/to/data_14.bgen /path/to/data.sample
/path/to/data_15.bgen /path/to/data.sample
/path/to/data_16.bgen /path/to/data.sample
/path/to/data_17.bgen /path/to/data.sample
/path/to/data_18.bgen /path/to/data.sample
/path/to/data_19.bgen /path/to/data.sample
/path/to/data_20.bgen /path/to/data.sample
/path/to/data_21.bgen /path/to/data.sample
/path/to/data_22.bgen /path/to/data.sample
/path/to/data_23.bgen /path/to/data.sample

Explanation:

  • Each row represents one chromosome
  • For a .bgen file it has two paths in the row - one to the .bgen file for that chromosome and one to the .sample file for that chromosome
  • .sample files can be 're-used' for multiple chromosomes if necessary
  • Ideally order the rows in chromosome order. Please ensure that the X chromosome is the last row.

Once you've created this file please specify where to find it in the config.env file e.g.

genotype_input_list="/path/to/geno_files.txt"

bgen-1.2 and indexing

If you are using .bgen files, note that the analysis will run faster if the .bgen files are version 1.2 and have corresponding index files (e.g. each .bgen file also has a .bgen.bgi index file. We have created a helper script to migrate to version 1.2 and create an index:

./utils/update_bgen.sh /path/to/new/bgen1.2/directory

Note that this will copy your original ${genotype_input_list} file to ${genotype_input_list}.original and create a new file with the bgen1.2 files listed in the original ${genotype_input_list}.