-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpractical.rmd
227 lines (150 loc) · 9.2 KB
/
practical.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
---
title: Multi-ancestry PRS practical
author: Gibran Hemani
---
## Background
A GWAS has been conducted for an adiposity-related trait in 5 populations:
- European (EUR), n = 100,000
- African (AFR), n = 30,000
- East Asian (EAS), n = 30,000
- South Asian (SAS), n = 30,000
- Admixed American (AMR), n = 30,000
As you can see the sample sizes are different across populations. The file `gwas_res.csv` has a subset of the results from this GWAS, retaining only the results that were GWAS significant (p < 5e-8) in the EUR population.
In addition, the GWAS results were used to calculate the polygenic risk scores (PRS) in a small independent sample from each population. The file `individual_values.csv` contains the PRS and other phenotypes for each individual in each population.
## Objectives
- Understand the two datasets
- Look for differences in the effect sizes across populations
- Examine why differences in effect sizes might arise
- Investigate the performance of the PRS across populations
## Setup
Load relevant libraries
```{r}
library(ggplot2)
library(dplyr)
```
Read in the relevant data. We are reading the files directly from github here but you can also download the files and read them locally if you prefer
```{r}
gwas_res <- read.csv(url("https://raw.githubusercontent.com/MRCIEU/multi-ancestry-prs-practical/refs/heads/main/gwas_res.csv"))
individual_values <- read.csv(url("https://raw.githubusercontent.com/MRCIEU/multi-ancestry-prs-practical/refs/heads/main/individual_values.csv"))
```
## 1. Simple GWAS hits comparisons
Have a look at the GWAS summary data
```{r}
str(gwas_res)
```
**QUESTION**: What are the rows and columns in the dataset and what do they represent?
Let's see which SNPs are significant in Europeans
```{r}
table(gwas_res$pval_EUR < 5e-8)
```
**TASK**: How many SNPs are significant in each of the other 4 populations?
**QUESTION**: What factors might be driving the results that you have found?
Let's look at the allele frequencies of the significant SNPs, for example comparing AFR and EUR:
```{r}
ggplot(gwas_res, aes(x=af_EUR, y=af_AFR)) +
geom_point() +
geom_abline() +
geom_density_2d_filled(alpha=0.5)
```
**QUESTION**: In which population are the SNPs more common? Does this match what you expected based on the p-value comparison?
## 2. Comparing effect sizes across populations
Let's compare the effect sizes now. Subset the data to only include the effect estimates, then plot them against each other:
```{r}
subset(gwas_res, select = c("bhat_EUR", "bhat_AFR", "bhat_EAS", "bhat_SAS", "bhat_AMR")) |> pairs()
```
**QUESTION**: What do you observe from the plot? Does this match what you expected based on the p-value comparison?
## 3. Accounting for differences in power
Instead of looking at p-values to determine if the associations are consistent across populations, we could examine whether the effects are consistent. This can be achieved using a Cochran's Q heterogeneity test.
$$
Q = \sum_{i=1}^k w_i (\hat{\beta}_i - \hat{\beta})^2
$$
where $w_i = 1 / \text{SE}(\hat{\beta}_i)^2$ and $\hat{\beta}$ is the overall effect estimate. Essentially, this test asks if the confidence intervals of the effect estimates overlap. If they overlap, we would conclude that the effects are statistically consistent across populations.
The `Qpval` column gives us a p-value for the Cochran's Q test statistic for each SNP. Are any of these significant for heterogeneity after multiple testing?
```{r}
table(gwas_res$Qpval < 0.05 / nrow(gwas_res))
min(gwas_res$Qpval)
```
Let's examine the effect estimates for two SNPs, one that has a significant Qpval and one that doesn't. First the one that is not significant:
```{r}
plot_effect_size <- function(gwas_res, i) {
dat <- data.frame(
population = c("EUR", "AFR", "EAS", "SAS", "AMR"),
beta = c(gwas_res$bhat_EUR[i], gwas_res$bhat_AFR[i], gwas_res$bhat_EAS[i], gwas_res$bhat_SAS[i], gwas_res$bhat_AMR[i]),
se = c(gwas_res$se_EUR[i], gwas_res$se_AFR[i], gwas_res$se_EAS[i], gwas_res$se_SAS[i], gwas_res$se_AMR[i])
)
ggplot(dat, aes(x=beta, y=population)) +
geom_point(aes(size=1/se)) +
geom_errorbarh(aes(xmin=beta - 1.96 * se, xmax=beta + 1.96 * se)) +
geom_vline(xintercept=0, linetype="dashed") +
theme_minimal()
}
plot_effect_size(gwas_res, 1)
```
We can see that this SNP has slightly different effect estimates across populations, but the confidence intervals overlap quite substantially, suggesting that the true causal effect of the SNPs is relatively consistent across populations.
**TASK**: Use the `plot_effect_size` function above to now plot the effect estimates for the SNP that has strong evidence of heterogeneity. Which population appears to be driving the heterogeneity?
## 4. Examining why differences in effect sizes might arise
The SNP that you identified as having strong evidence of heterogeneity is available in another dataset, along with the adiposity phenotype, and a potential interacting variable (`alcohol` intake). Could a genotype x environment interaction explain the result, where the environmental variable here is alcohol consumption?
Let's load the data:
```{r}
str(individual_values)
```
**QUESTION**: Describe this dataset - what are the columns and rows, what do they represent?
**QUESTION**: If this were real data (it's not in this case, it's all simulated!), would you handle it differently to the data in `gwas_res`? Why?
Let's look at whether the heterogeneity effect by population still exists here:
```{r}
# For each population, fit a linear model
effect_estimates <- lapply(c("EUR", "AFR", "EAS", "SAS", "AMR"), \(p) {
# The linear model is estimating the effect of the SNP on phen in that subgroup
mod <- summary(lm(phen ~ rs58903867, data=subset(individual_values, pop == p)))
# Save the effect estimate and standard error
data.frame(pop=p, beta=mod$coefficients[2,1], se=mod$coefficients[2,2])
}) %>% bind_rows()
# Plot the effect estimates in a forest plot
ggplot(effect_estimates, aes(x=beta, y=pop)) +
geom_point(aes(size=1/se)) +
geom_errorbarh(aes(xmin=beta - 1.96 * se, xmax=beta + 1.96 * se)) +
geom_vline(xintercept=0, linetype="dashed") +
theme_minimal()
```
**QUESTION**: Does this replicate the evidence for heterogeneity that you observed in the GWAS summary data?
Instead of partitioning the samples by ancestry, let's now partition the samples by alcohol consumption instead to see if that is the relevant mechanism. First examine if alcohol consumption varies by population:
```{r}
individual_values %>% group_by(pop) %>% summarise(mean_alcohol=mean(alcohol))
```
It looks like alcohol consumption is much lower in South Asians. Let's now examine the effect estimates for the SNP by alcohol consumption:
```{r}
# For each alcohol value, fit a linear model
effect_estimates <- lapply(c(0, 1), \(p) {
# The linear model is estimating the effect of the SNP on phen in that subgroup
mod <- summary(lm(phen ~ rs58903867, data=subset(individual_values, alcohol == p)))
# Save the effect estimate and standard error
data.frame(alcohol=p, beta=mod$coefficients[2,1], se=mod$coefficients[2,2])
}) %>% bind_rows()
# Plot the effect estimates in a forest plot
ggplot(effect_estimates, aes(x=beta, y=as.factor(alcohol))) +
geom_point(aes(size=1/se)) +
geom_errorbarh(aes(xmin=beta - 1.96 * se, xmax=beta + 1.96 * se)) +
geom_vline(xintercept=0, linetype="dashed") +
theme_minimal()
```
**QUESTION**: Propose a mechanism that would explain the results you have seen.
**QUESTION**: How could the linear model be improved to reduce the possibility of bias?
## 5. Investigating the performance of the PRS across populations
Now let's examine the polygenic risk scores across populations. In the `individual_values.csv` file, this is simulated data, and we have the true polygenic scores that explain all the heritability of the trait in the `true_score` column. Let's look at the distribution of the true scores across populations:
```{r}
ggplot(individual_values, aes(x=true_score, fill=pop)) +
geom_density(alpha=0.5) +
theme_minimal()
ggplot(individual_values, aes(x=eur_score, fill=pop)) +
geom_density(alpha=0.5) +
theme_minimal()
```
The true polygenic scores have very similar distributions across populations, with some evidence that the mean score is slightly lower in East Asians compared to other populations.
We also generated polygenic scores based on the GWAS results. Here, **GWAS hits were chosen based on the European discovery sample**. This means that instead of using all causal variants to generate the score, it was only generated from the 287 SNPs that were significant in the European GWAS.
**TASK:** The estimated PRS is in the column `dat$eur_score`. Plot the distributions of the estimated PRS across populations. Do you see any differences?
We can also evaluate the prediction accuracy of the PRS in each population. This is how to do it for the EUR population:
```{r}
cor(individual_values$eur_score[individual_values$pop == "EUR"], individual_values$phen[individual_values$pop == "EUR"])^2
```
**QUESTION:** Which population has the highest prediction accuracy using the EUR PRS? Why do you think this is the case?
**QUESTION:** What could be done to improve the prediction accuracy in the other populations?