-
Notifications
You must be signed in to change notification settings - Fork 0
Data preparation
An example of how to format the various files required for the pipeline is provided here: https://github.com/MRCIEU/Lifecourse-GWAS/tree/main/test/example-data
There are three types of covariate data used in the pipeline
- Genetic principal components - generated by the pipeline or optionally provided by the analyst
- Static covariates - provided by the analyst in a file called
static_covariates.txt
(see below). - Time varying covariates - provided by the analyst for relevant phenotypes within the long-format phenotype files (see below).
There should be a single covariate file for the 'static' covariates in the analysis. It must be named static_covariates.txt
and stored in the phenotype_input_dir
folder specified in config.env
, with the following minimum format (note you may need to provide more covariate columns as explained further below):
FID IID sex yob my_covariate1 my_covariate2
ID1 ID1 1 1970 1 1
ID2 ID2 1 1974 0 NA
ID3 ID3 2 NA 0 0
-
FID
(required) = family ID (recommended to be same as individual ID) -
IID
(required) = individual ID -
yob
(required) = Year of birth. Note that this may be difficult to source for some cohorts. This covariate matters most for multi-generational cohorts. If year of birth is not easily available then an indicator of generation by providing approximate decade of birth for example will suffice. -
sex
(required) = Number of X chromosomes (e.g. 1 for male, 2 for female) -
my_covariate1
(optional) = Additional study-specific covariate e.g. batch -
my_covariate2
(optional) = Additional study-specific covariate e.g. batch
Please ensure that every individual being analysed has an entry in this static_covariates.txt
file. This file will be used to provide covariates for all phenotypes, so if individuals are present in the phenotype files but don't have static covariates then they will be omitted from the analaysis for that phenotype.
If you are adding optional study-specific static covariates please ensure that your config.env
file is also updated for example the line defining static_covariates
will look like:
static_covariates="sex yob my_covariate1 my_covariate2"
Note if year of birth is not available then you can omit this. Don't include the variable in the static_covariates.txt
and make sure it's removed from the static_covariates
field in config.env
.
Missing values are accepted and should be specified with NA
in all phenotype and covariate text files
This file lists all the phenotypes that can be contributed to the analysis: phenotype_list.csv. We require one file for every phenotype that is to be contributed.
e.g. to run the analysis for body mass index there must be a file named bmi.txt
located in the phenotype_input_dir
folder specified in config.env
, with the following format:
FID IID value age
ID1 ID1 23.000 15.25
ID2 ID2 27.000 23.33
ID3 ID3 18.000 5.00
-
FID
= family ID (recommended to be same as individual ID) -
IID
= individual ID -
value
= phenotypic value. Note that the phenotypic values must be in the units specified in phenotype_list.csv, and for sanity checking the phenotype values should be within the specified permitted range in this spreadsheet. -
age
= age at which phenotypic value was measured in years. Non-integer values allowed and encouraged for specifying younger ages, for example phenotype measurements might be recorded in months, in which case please convert to years usingage = months / 12
.
Note that FID IID
doesn't have to be unique, as a single individual could be measured at multiple time points. However, duplicate values of FID IID age
will be removed in the pipeline.
Extra columns can be included and will be ignored.
In the phenotype_list.csv each phenotype lists the covariates that are required for the analysis. Most just specify sex
, but some require age-specific covariates. For those phenotypes, please make sure that the phenotype file also has the additional covariate column(s) required. The pipeline will be expecting those age-specific covariates in the phenotype file for that phenotype. This is an example for LDL cholesterol:
FID IID value age cholesterol_med
ID1 ID1 2.6 15.33 0
ID2 ID2 2.9 23.50 NA
ID3 ID3 2.5 65.83 1
The pipeline will generate PCs automatically from the genotype data, accounting for family relatedness where necessary, and using the standard flashpca algorithm. However if you wish to provide previously generated PCs then this is possible. Save a file named pcs.txt
in the $genotype_processed_dir
location that is specified in the config.env
file. The file should be formatted as:
FID IID PC1 PC2 PC3 PC4
ID1 ID1 0.1 -0.2 0.04 0.07
ID2 ID2 0.2 -0.4 0.03 0.06
i.e. each row is an individual with their family ID and individual ID followed by a value for each principal component that you wish to include in the analysis.
It may be convenient to use bulk phenotype or genotype data files and specify a subset of individuals in those files to be analysed in the pipeline. Do this by setting sample_inclusion_list="/path/to/textfile"
in config.env
and formatting this file with one individual per row, with their family ID and individual ID columns e.g.
ID1 ID1
ID2 ID2
ID3 ID3
This could be useful if you have multiple ancestries in your genotype data, if you know the ancestral groups for each individual you can just specify the inclusion list instead of creating genotype data subsets.
If sample_include_list=""
is left blank then the pipeline will attempt to use all individuals.
These are the requirements for the imputed genotype data:
- Imputed to a recent reference panel such as TOPMED, HRC reference panel v1.1. The Haplotype Reference Consortium offers a free imputation service (including HRC) which you can use here or here. By far the easiest way to do imputation now is to use this service.
- If this poses a problem e.g. the cohort has ancestry-specific imputation panels, that's absolutely fine please just let us know
- If you will struggle to impute to one of the more recent reference panels then we can accept imputations to 1000 genomes phase 3, but please contact us about this.
- Variant positions in hg19 or hg38
- Imputed data must be in
bgen
format. - A separate file for each chromosome
- No spaces in the file names
- All 22 autosomes should be provided
- Chromosome X is optional, but should be provided in
.pgen
format. A single chromosome X file is required, so please take care in combining males and females. If this is done erroneously (e.g. creating apparent multiploidy at some variants) then step00b
will throw an error. - The pipeline will not modify any of the data provided in
$genotype_input_dir
, so that can be a read-only directory if necessary. - A single major ancestral group is represented for a single run of the pipeline. An example of how the major ancestral groups in UKB were clustered can be found here.
The imputed genotype data must be provided as .bgen
format. If you provide data for the X chromosome, males must be coded as diploid (i.e. 0/2).
The pipeline needs to be able to find these files to use for the analysis. To do this, please create a file (e.g. geno_files.txt
that has a structure like this:
/path/to/data_01.bgen /path/to/data.sample
/path/to/data_02.bgen /path/to/data.sample
/path/to/data_03.bgen /path/to/data.sample
/path/to/data_04.bgen /path/to/data.sample
/path/to/data_05.bgen /path/to/data.sample
/path/to/data_06.bgen /path/to/data.sample
/path/to/data_07.bgen /path/to/data.sample
/path/to/data_08.bgen /path/to/data.sample
/path/to/data_09.bgen /path/to/data.sample
/path/to/data_10.bgen /path/to/data.sample
/path/to/data_11.bgen /path/to/data.sample
/path/to/data_12.bgen /path/to/data.sample
/path/to/data_13.bgen /path/to/data.sample
/path/to/data_14.bgen /path/to/data.sample
/path/to/data_15.bgen /path/to/data.sample
/path/to/data_16.bgen /path/to/data.sample
/path/to/data_17.bgen /path/to/data.sample
/path/to/data_18.bgen /path/to/data.sample
/path/to/data_19.bgen /path/to/data.sample
/path/to/data_20.bgen /path/to/data.sample
/path/to/data_21.bgen /path/to/data.sample
/path/to/data_22.bgen /path/to/data.sample
/path/to/data_23.bgen /path/to/data.sample
Explanation:
- Each row represents one chromosome
- For a
.bgen
file it has two paths in the row - one to the.bgen
file for that chromosome and one to the.sample
file for that chromosome -
.sample
files can be 're-used' for multiple chromosomes if necessary - Ideally order the rows in chromosome order. Please ensure that the X chromosome is the last row.
Once you've created this file please specify where to find it in the config.env
file e.g.
genotype_input_list="/path/to/geno_files.txt"
If you are using .bgen
files, note that the analysis will run faster if the .bgen
files are version 1.2 and have corresponding index files (e.g. each .bgen
file also has a .bgen.bgi
index file. We have created a helper script to migrate to version 1.2 and create an index:
./utils/update_bgen.sh /path/to/new/bgen1.2/directory
Note that this will copy your original ${genotype_input_list}
file to ${genotype_input_list}.original
and create a new file with the bgen1.2 files listed in the original ${genotype_input_list}
.