-
Notifications
You must be signed in to change notification settings - Fork 91
/
Copy path11_missing_data.Rmd
561 lines (425 loc) · 24.4 KB
/
11_missing_data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
# (PART) Workflow {-}
Throughout this book we have tried to provide the most efficient approaches to data analysis using R.
In this section, we will provide workflows, or ways-of-working, which maximise efficiency, incorporate reporting of results within analyses, make exporting of tables and plots easy, and keep data safe, secured and backed up.
We also include a section on dealing with missing data in R. Something that we both feel strongly about and which is often poorly described and dealt with in academic publishing.
# The problem of missing data{#chap11-h1}
\index{missing data@\textbf{missing data}}
>In heaven, all the interesting people are missing.
>Friedrich Nietzsche
<!-- >People say nothing is impossible, but I do nothing every day. -->
<!-- >A.A.Milne, Winnie-the-Pooh -->
```{r echo=FALSE, message=FALSE}
library(knitr)
library(kableExtra)
mykable = function(x, caption = "CAPTION", ...){
kable(x, row.names = FALSE, align = c("l", "l", "r", "r", "r", "r", "r", "r", "r"),
booktabs = TRUE, caption = caption,
linesep = c("", "", "\\addlinespace"), ...) %>%
kable_styling(latex_options = c("scale_down", "hold_position"))
}
```
## Identification of missing data
As journal editors, we often receive studies in which the investigators fail to describe, analyse, or even acknowledge missing data.
This is frustrating, as it is often of the utmost importance.
Conclusions may (and do) change when missing data are accounted for.
Some folk seem to not even appreciate that in a conventional regression, only rows with complete data are included.
By reading this, you will not be one of them!
These are the five steps to ensuring missing data are correctly identified and appropriately dealt with:
1. Ensure your data are coded correctly.
2. Identify missing values within each variable.
3. Look for patterns of missingness.
4. Check for associations between missing and observed data.
5. Decide how to handle missing data.
We will work through a number of functions that will help with each of these.
But first, here are some terms that are easy to mix up.
These are important as they describe the mechanism of missingness and this determines how you can handle the missing data.
For each of the following examples we will imagine that we are collecting data on the relationship between gender, smoking and the outcome of cancer treatment.
The ground truth in this imagined scenario is that both gender and smoking influence the outcome from cancer treatment.
### Missing completely at random (MCAR)
\index{missing data@\textbf{missing data}!missing completely at random}
As it says, values are randomly missing from your dataset.
Missing data values do not relate to any other data in the dataset and there is no pattern to the actual values of the missing data themselves.
In our example, smoking status is missing from a random subset of male and female patients.
This may have the effect of making our population smaller, but the complete case population has the same characteristics as the missing data population.
This is easy to handle, but unfortunately, data are almost never missing completely at random.
### Missing at random (MAR)
\index{missing data@\textbf{missing data}!missing at random}
This is confusing and would be better named *missing conditionally at random*.
Here, missingness in a particular variable has an association with one or more other variables in the dataset.
However, the *actual values of the missing data are random*.
In our example, smoking status is missing for some female patients but not for male patients.
But data is missing from the same number of female smokers as female non-smokers.
So the complete case female patients has the same characteristics as the missing data female patients.
### Missing not at random (MNAR)
\index{missing data@\textbf{missing data}!missing not at random}
The pattern of missingness is related to other variables in the dataset, but in addition, the *actual values of the missing data are not random*.
In our example, smoking status is missing in female patients who are more likely to smoke, but not for male patients.
Thus, the complete case female patients have different characteristics to the missing data female patients.
For instance, the missing data female patients may be more likely to die after cancer treatment.
Looking at our available population, we therefore under estimate the likelihood of a female dying from cancer treatment.
Missing not at random data are important, can alter your conclusions, and are the most difficult to diagnose and handle.
They can only be detected by collecting and examining some of the missing data.
This is often difficult or impossible to do.
How you deal with missing data is dependent on the type of missingness.
Once you know the type, you can start addressing it.
More on this below.
## Ensure your data are coded correctly: `ff_glimpse()`
\index{variable types@\textbf{variable types}}
While it sounds obvious, this step is often ignored in the rush to get results.
The first step in any analysis is robust data cleaning and coding.
Lots of packages have a glimpse-type function and our own **finalfit** is no different.
This function has three specific goals:
1. Ensure all variables are of the type you expect them to be. That is the commonest reason to get an error with a **finalfit** function. Numbers should be numeric, categorical variables should be characters or factors, and dates should be dates (for a reminder on these, see Section \@ref(chap02-vartypes).
2. Ensure you know which variables have missing data. This presumes missing values are correctly assigned `NA`.
3. Ensure factor levels and variable labels are assigned correctly.
### The Question
Using the `colon_s` colon cancer dataset, we are interested in exploring the association between a cancer obstructing the bowel and 5-year survival, accounting for other patient and disease characteristics.
For demonstration purposes, we will make up MCAR and MAR smoking variables (`smoking_mcar` and `smoking_mar`).
Do not worry about understanding the long cascading mutate and `sample()` functions below, this is merely for creating the example variables.
You would not be 'creating' your data, we hope.
```{r message=FALSE, warning=FALSE}
# Create some extra missing data
library(finalfit)
library(dplyr)
set.seed(1)
colon_s <- colon_s %>%
mutate(
## Smoking missing completely at random
smoking_mcar = sample(c("Smoker", "Non-smoker", NA),
n(), replace=TRUE,
prob = c(0.2, 0.7, 0.1)) %>%
factor() %>%
ff_label("Smoking (MCAR)"),
## Smoking missing conditional on patient sex
smoking_mar = ifelse(sex.factor == "Female",
sample(c("Smoker", "Non-smoker", NA),
sum(sex.factor == "Female"),
replace = TRUE,
prob = c(0.1, 0.5, 0.4)),
sample(c("Smoker", "Non-smoker", NA),
sum(sex.factor == "Male"),
replace=TRUE, prob = c(0.15, 0.75, 0.1))
) %>%
factor() %>%
ff_label("Smoking (MAR)")
)
```
We will then examine our variables of interest using `ff_glimpse()`:
```{r}
explanatory <- c("age", "sex.factor",
"nodes", "obstruct.factor",
"smoking_mcar", "smoking_mar")
dependent <- "mort_5yr"
colon_s %>%
ff_glimpse(dependent, explanatory)
```
You don't need to specify the variables, and if you don't, `ff_glimpse()` will summarise all variables:
```{r results="hide"}
colon_s %>%
ff_glimpse()
```
Use this to check that the variables are all assigned and behaving as expected.
The proportion of missing data can be seen, e.g., `smoking_mar` has `r round(sum(100*is.na(colon_s$smoking_mar))/dim(colon_s)[1], 0)`% missing data.
## Identify missing values in each variable: `missing_plot()`
\index{missing data@\textbf{missing data}!missingness plot}
Visualising data is essential to help understand it, and missing data is no exception.
`missing_plot()` function also from **finalfit** is useful for grasping the amount of missing data in each variable.
Row number is on the x-axis and all included variables are on the y-axis.
```{r}
colon_s %>%
missing_plot(dependent, explanatory)
```
Further visualisations of missingness can be done using the [naniar](http://naniar.njtierney.com) package.
## Look for patterns of missingness: `missing_pattern()`
\index{missing data@\textbf{missing data}!missingness pattern}
Using **finalfit**, `missing_pattern()` wraps a function from the **mice** package, `md.pattern()`.
This produces a table and a plot showing the pattern of missingness between variables.
```{r fig.height=4, fig.width=10}
explanatory <- c("age", "sex.factor",
"obstruct.factor",
"smoking_mcar", "smoking_mar")
dependent <- "mort_5yr"
colon_s %>%
missing_pattern(dependent, explanatory)
```
This allows us to look for patterns of missingness between variables.
There are 11 patterns in these data.
The number and pattern of missingness help us to determine the likelihood of it being random rather than systematic.
## Including missing data in demographics tables
\index{missing data@\textbf{missing data}!demographics table}
"Table 1" in a healthcare study is often a demographics table of an “explanatory variable of interest” against other explanatory variables/confounders.
Do not silently drop missing values in this table.
It is easy to do this correctly with `summary_factorlist()`.
This function provides a useful summary of a dependent variable against explanatory variables.
Despite its name, continuous variables are handled nicely.
`na_include=TRUE` ensures missing data from the explanatory variables (but not dependent) are included.
To include missing values from the dependent, add `na_include_dependent = TRUE`.
Including a total column (`total_col = TRUE`) is also useful, as well as column totals (`add_col_totals = TRUE`).
If you are using a lot of continuous explanatory variables with missing values, then these can be seen easily using `add_row_totals = TRUE`.
Note that missing data is not included when *p*-values are generated.
If you wish missing data to be passed to statistical tests, then include `na_to_p = TRUE`.
```{r}
# Explanatory or confounding variables
explanatory <- c("age", "sex.factor",
"nodes",
"smoking_mcar", "smoking_mar")
# Explanatory variable of interest
dependent <- "obstruct.factor" # Bowel obstruction
table1 <- colon_s %>%
summary_factorlist(dependent, explanatory,
na_include=TRUE, na_include_dependent = TRUE,
total_col = TRUE, add_col_totals = TRUE, p=TRUE)
```
```{r echo=FALSE}
table1 %>%
mykable(caption = "Simulated missing completely at random (MCAR) and missing at random (MAR) dataset.")
```
## Check for associations between missing and observed data
\index{missing data@\textbf{missing data}!associations}
In deciding whether data is MCAR or MAR, one approach is to explore patterns of missingness between levels of included variables.
This is particularly important (we would say absolutely required) for a primary outcome measure / dependent variable.
Take for example “death”.
When that outcome is missing it is often for a particular reason.
For example, perhaps patients undergoing emergency surgery were less likely to have complete records compared with those undergoing planned surgery.
And of course, death is more likely after emergency surgery.
`missing_pairs()` uses functions from the **GGally** package.
It produces pairs plots to show relationships between missing values and observed values in all variables.
```{r fig.height=13.5, fig.width=13.5, message=FALSE, warning=FALSE, fig.cap="Missing data matrix with `missing_pairs()`."}
explanatory <- c("age", "sex.factor",
"nodes", "obstruct.factor",
"smoking_mcar", "smoking_mar")
dependent <- "mort_5yr"
colon_s %>%
missing_pairs(dependent, explanatory)
```
For continuous variables (age and nodes), the distributions of observed and missing data can immediately be visually compared.
For example, look at Row 1 Column 2.
The age of patients who's mortality data is known is the blue box plot, and the age of patients with missing mortality data is the grey box plot.
For categorical data, the comparisons are presented as counts (remember `geom_bar()` from Chapter \@ref(chap04-h1)).
To be able to compare proportions, we can add the `position = "fill"` argument:
```{r fig.height=13.5, fig.width=13.5, message=FALSE, warning=FALSE, fig.cap="Missing data matrix with `missing_pairs(position = 'fill')` ."}
colon_s %>%
missing_pairs(dependent, explanatory, position = "fill")
```
Find the two sets of bar plots that show the proportion of missing smoking data for sex (bottom of Column 3).
Missingness in Smoking (MCAR) does not relate to sex - females and males have the same proportion of missing data.
Missingness in Smoking (MAR), however, does differ by sex as females have more missing data than men here.
This is how we designed the example at the top of this chapter, so it all makes sense.
We can also confirm this by using `missing_compare()`:
```{r message=FALSE, warning=FALSE}
explanatory <- c("age", "sex.factor",
"nodes", "obstruct.factor")
dependent <- "smoking_mcar"
missing_mcar <- colon_s %>%
missing_compare(dependent, explanatory)
```
```{r echo=FALSE}
missing_mcar %>%
mykable(caption = "Missing data comparison: Smoking (MCAR).")
```
```{r message=FALSE, warning=FALSE}
dependent <- "smoking_mar"
missing_mar <- colon_s %>%
missing_compare(dependent, explanatory)
```
```{r echo=FALSE}
missing_mar %>%
mykable(caption = "Missing data comparison: Smoking (MAR).")
```
It takes dependent and explanatory variables, and in this context "dependent" refers to the variable being tested for missingness against the explanatory variables.
^[By default, `missing_compare()` uses an F-test test for continuous variables and chi-squared for categorical variables; you can change these the same way you change tests in `summary_factorlist()`.
Check the Help tab or online documentation for a reminder.]
As expected, a relationship is seen between sex and smoking (MAR) but not smoking (MCAR).
### For those who like an omnibus test
If you work predominately with continuous rather than categorical data, you may find these tests from the `MissMech` package useful.
It provides two tests which can be used to determine whether data are MCAR; the package and its output are well documented.
```{r}
library(MissMech)
explanatory <- c("age", "nodes")
dependent <- "mort_5yr"
colon_s %>%
select(all_of(explanatory)) %>%
MissMech::TestMCARNormality()
```
## Handling missing data: MCAR
\index{missing data@\textbf{missing data}!handling}
Prior to a standard regression analysis, we can either:
* Delete the variable with the missing data
* Delete the cases with the missing data
* Impute (fill in) the missing data
* Model the missing data
Using the examples, we identify that smoking (MCAR) is missing completely at random.
We know nothing about the missing values themselves, but we know of no plausible reason that the values of the missing data, for say, people who died should be different to the values of the missing data for those who survived.
The pattern of missingness is therefore not felt to be MNAR.
### Common solution: row-wise deletion
Depending on the number of data points that are missing, we may have sufficient power with complete cases to examine the relationships of interest.
We therefore elect to omit the patients in whom smoking is missing.
This is known as list-wise deletion and will be performed by default and usually silently by any standard regression function.
```{r message=FALSE, warning=FALSE}
explanatory <- c("age", "sex.factor",
"nodes", "obstruct.factor",
"smoking_mcar")
dependent <- "mort_5yr"
fit = colon_s %>%
finalfit(dependent, explanatory)
```
```{r echo=FALSE}
fit %>%
mykable(caption = "Regression analysis with missing data: List-wise deletion.")
```
### Other considerations
* Sensitivity analysis
* Omit the variable
* Imputation
* Model the missing data
If the variable in question is thought to be particularly important, you may wish to perform a sensitivity analysis.
A sensitivity analysis in this context aims to capture the effect of uncertainty on the conclusions drawn from the model.
Thus, you may choose to re-label all missing smoking values as “smoker”, and see if that changes the conclusions of your analysis. The same procedure can be performed labelling with “non-smoker”.
If smoking is not associated with the explanatory variable of interest or the outcome, it may be considered not to be a confounder and so could be omitted.
That deals with the missing data issue, but of course may not always be appropriate.
Imputation and modelling are considered below.
## Handling missing data: MAR
But life is rarely that simple.
Considering that the smoking variable is more likely to be missing if the patient is female (missing_compare shows a relationship).
But, say, that the missing values are not different from the observed values.
Missingness is then MAR.
If we simply drop all the patients for whom smoking is missing (list-wise deletion), then we drop relatively more females than men.
This may have consequences for our conclusions if sex is associated with our explanatory variable of interest or outcome.
### Common solution: Multivariate Imputation by Chained Equations (mice)
\index{missing data@\textbf{missing data}!imputation}
\index{missing data@\textbf{missing data}!multiple imputation}
\index{imputation}
\index{multiple imputation}
**mice** is our go to package for multiple imputation.
That’s the process of filling in missing data using a best-estimate from all the other data that exists.
When first encountered, this may not sound like a good idea.
However, taking our simple example, if missingness in smoking is predicted strongly by sex (and other observed variables), and the values of the missing data are random, then we can impute (best-guess) the missing smoking values using sex and other variables in the dataset.
Imputation is not usually appropriate for the explanatory variable of interest or the outcome variable, although these can be used to impute other variables.
In both cases, the hypothesis is that there is a meaningful association with other variables in the dataset, therefore it doesn’t make sense to use these variables to impute them.
The process of multiple imputation involves:
* **Impute** missing data *m* times, which results in *m* complete datasets
* **Diagnose** the quality of the imputed values
* **Analyse** each completed dataset
* **Pool** the results of the repeated analyses
We will present a `mice()` example here.
The package is well documented, and there are a number of checks and considerations that should be made to inform the imputation process.
Read the documentation carefully prior to doing this yourself.
Note also `missing_predictorMatrix()` from **finalfit**.
This provides a straightforward way to include or exclude variables to be imputed or to be used for imputation.
**Impute**
```{r message=FALSE, warning=FALSE}
# Multivariate Imputation by Chained Equations (mice)
library(finalfit)
library(dplyr)
library(mice)
explanatory <- c("age", "sex.factor",
"nodes", "obstruct.factor", "smoking_mar")
dependent <- "mort_5yr"
```
Choose which variable to input missing values for and which variables to use for the imputation process.
```{r message=FALSE, warning=FALSE}
colon_s %>%
select(dependent, explanatory) %>%
missing_predictorMatrix(
drop_from_imputed = c("obstruct.factor", "mort_5yr")
) -> predM
```
Make 10 imputed datasets and run our logistic regression analysis on each set.
```{r}
fits <- colon_s %>%
select(dependent, explanatory) %>%
# Usually run imputation with 10 imputed sets, 4 here for demonstration
mice(m = 4, predictorMatrix = predM) %>%
# Run logistic regression on each imputed set
with(glm(formula(ff_formula(dependent, explanatory)),
family="binomial"))
```
**Extract metrics from each model**
```{r message=FALSE, warning=FALSE}
# Examples of extracting metrics from fits and taking the mean
## AICs
fits %>%
getfit() %>%
purrr::map(AIC) %>%
unlist() %>%
mean()
# C-statistic
fits %>%
getfit() %>%
purrr::map(~ pROC::roc(.x$y, .x$fitted)$auc) %>%
unlist() %>%
mean()
```
**Pool models together**
```{r fig.height=4, fig.width=10, message=FALSE, warning=FALSE}
# Pool results
fits_pool <- fits %>%
pool()
## Can be passed to or_plot
colon_s %>%
or_plot(dependent, explanatory, glmfit = fits_pool, table_text_size=4)
# Summarise and put in table
fit_imputed <- fits_pool %>%
fit2df(estimate_name = "OR (multiple imputation)", exp = TRUE)
# Use finalfit merge methods to create and compare results
explanatory <- c("age", "sex.factor",
"nodes", "obstruct.factor", "smoking_mar")
table_uni_multi <- colon_s %>%
finalfit(dependent, explanatory, keep_fit_id = TRUE)
explanatory = c("age", "sex.factor",
"nodes", "obstruct.factor")
fit_multi_no_smoking <- colon_s %>%
glmmulti(dependent, explanatory) %>%
fit2df(estimate_suffix = " (multivariable without smoking)")
# Combine to final table
table_imputed <-
table_uni_multi %>%
ff_merge(fit_multi_no_smoking) %>%
ff_merge(fit_imputed, last_merge = TRUE)
```
```{r echo=FALSE}
table_imputed %>%
mykable(caption = "Regression analysis with missing data: Multiple imputation using mice().")
```
By examining the coefficients, the effect of the imputation compared with the complete case analysis can be seen.
**Other considerations**
* Omit the variable
* Model the missing data
As above, if the variable does not appear to be important, it may be omitted from the analysis.
A sensitivity analysis in this context is another form of imputation.
But rather than using all other available information to best-guess the missing data, we simply assign the value as above.
Imputation is therefore likely to be more appropriate.
There is an alternative method to model the missing data for the categorical in this setting – just consider the missing data as a factor level.
This has the advantage of simplicity, with the disadvantage of increasing the number of terms in the model.
```{r message=FALSE, warning=FALSE}
library(dplyr)
explanatory = c("age", "sex.factor",
"nodes", "obstruct.factor", "smoking_mar")
fit_explicit_na = colon_s %>%
mutate(
smoking_mar = forcats::fct_explicit_na(smoking_mar)
) %>%
finalfit(dependent, explanatory)
```
```{r echo=FALSE}
fit_explicit_na %>%
mykable(caption = "Regression analysis with missing data: Explicitly modelling missing data.")
```
## Handling missing data: MNAR
Missing not at random data is tough in healthcare.
To determine if data are MNAR for definite, we need to know their value in a subset of observations (patients).
Imagine that smoking status is poorly recorded in patients admitted to hospital as an emergency with an obstructing bowel cancer.
Obstructing bowel cancers may be larger or their position may make the prognosis worse.
Smoking may relate to the aggressiveness of the cancer and may be an independent predictor of prognosis.
The missing values for smoking may therefore not be random.
Smoking may be more common in the emergency patients and may be more common in those that die.
There is no easy way to handle this.
If at all possible, try to get the missing data.
Otherwise, be careful when drawing conclusions from analyses where data are thought to be missing not at random.
## Summary
The more data analysis you do, the more you realise just how important missing data is.
It is imperative that you understand where missing values exist in your own data.
By following the simple steps in this chapter, you will be able to determine whether the cases (commonly patients) with missing values are a different population to those with complete data.
This is the basis for understanding the impact of missing data on your analyses.
Whether you remove cases, remove variables, impute data, or model missing values, always check how each approach alters the conclusions of your analysis.
Be transparent when you report your results and include the alternative approaches in appendices of published work.