11_missing_data.Rmd

# (PART) Workflow {-}

Throughout this book we have tried to provide the most efficient approaches to data analysis using R. 
In this section, we will provide workflows, or ways-of-working, which maximise efficiency, incorporate reporting of results within analyses, make exporting of tables and plots easy, and keep data safe, secured and backed up. 

We also include a section on dealing with missing data in R. Something that we both feel strongly about and which is often poorly described and dealt with in academic publishing. 

# The problem of missing data{#chap11-h1}
\index{missing data@\textbf{missing data}}

>In heaven, all the interesting people are missing.  
>Friedrich Nietzsche
<!-- >People say nothing is impossible, but I do nothing every day.   -->
<!-- >A.A.Milne, Winnie-the-Pooh -->

```{r echo=FALSE, message=FALSE}
library(knitr)
library(kableExtra)
mykable = function(x, caption = "CAPTION", ...){
  kable(x, row.names = FALSE, align = c("l", "l", "r", "r", "r", "r", "r", "r", "r"), 
        booktabs = TRUE, caption = caption, 
        linesep = c("", "", "\\addlinespace"), ...) %>%
    kable_styling(latex_options = c("scale_down", "hold_position"))
}
```

## Identification of missing data

As journal editors, we often receive studies in which the investigators fail to describe, analyse, or even acknowledge missing data. 
This is frustrating, as it is often of the utmost importance. 
Conclusions may (and do) change when missing data are accounted for. 
Some folk seem to not even appreciate that in a conventional regression, only rows with complete data are included.
By reading this, you will not be one of them!

These are the five steps to ensuring missing data are correctly identified and appropriately dealt with:

1. Ensure your data are coded correctly.
2. Identify missing values within each variable.
3. Look for patterns of missingness.
4. Check for associations between missing and observed data.
5. Decide how to handle missing data.

We will work through a number of functions that will help with each of these. 
But first, here are some terms that are easy to mix up. 
These are important as they describe the mechanism of missingness and this determines how you can handle the missing data.

For each of the following examples we will imagine that we are collecting data on the relationship between gender, smoking and the outcome of cancer treatment. 
The ground truth in this imagined scenario is that both gender and smoking influence the outcome from cancer treatment.  

### Missing completely at random (MCAR)
\index{missing data@\textbf{missing data}!missing completely at random}
As it says, values are randomly missing from your dataset. 
Missing data values do not relate to any other data in the dataset and there is no pattern to the actual values of the missing data themselves.

In our example, smoking status is missing from a random subset of male and female patients.

This may have the effect of making our population smaller, but the complete case population has the same characteristics as the missing data population. 
This is easy to handle, but unfortunately, data are almost never missing completely at random.

### Missing at random (MAR)
\index{missing data@\textbf{missing data}!missing at random}
This is confusing and would be better named *missing conditionally at random*. 
Here, missingness in a particular variable has an association with one or more other variables in the dataset. 
However, the *actual values of the missing data are random*.

In our example, smoking status is missing for some female patients but not for male patients. 

But data is missing from the same number of female smokers as female non-smokers. 
So the complete case female patients has the same characteristics as the missing data female patients.  

### Missing not at random (MNAR)
\index{missing data@\textbf{missing data}!missing not at random}
The pattern of missingness is related to other variables in the dataset, but in addition, the *actual values of the missing data are not random*.

In our example, smoking status is missing in female patients who are more likely to smoke, but not for male patients. 

Thus, the complete case female patients have different characteristics to the missing data female patients. 
For instance, the missing data female patients may be more likely to die after cancer treatment. 
Looking at our available population, we therefore under estimate the likelihood of a female dying from cancer treatment. 

Missing not at random data are important, can alter your conclusions, and are the most difficult to diagnose and handle. 
They can only be detected by collecting and examining some of the missing data. 
This is often difficult or impossible to do.

How you deal with missing data is dependent on the type of missingness. 
Once you know the type, you can start addressing it. 
More on this below.

## Ensure your data are coded correctly: `ff_glimpse()`
\index{variable types@\textbf{variable types}}

While it sounds obvious, this step is often ignored in the rush to get results. 
The first step in any analysis is robust data cleaning and coding. 
Lots of packages have a glimpse-type function and our own **finalfit** is no different. 
This function has three specific goals:

1. Ensure all variables are of the type you expect them to be. That is the commonest reason to get an error with a **finalfit** function. Numbers should be numeric, categorical variables should be characters or factors, and dates should be dates (for a reminder on these, see Section \@ref(chap02-vartypes).
2. Ensure you know which variables have missing data. This presumes missing values are correctly assigned `NA`. 
3. Ensure factor levels and variable labels are assigned correctly.

### The Question
Using the `colon_s` colon cancer dataset, we are interested in exploring the association between a cancer obstructing the bowel and 5-year survival, accounting for other patient and disease characteristics.

For demonstration purposes, we will make up MCAR and MAR smoking variables (`smoking_mcar` and `smoking_mar`).
Do not worry about understanding the long cascading mutate and `sample()` functions below, this is merely for creating the example variables.
You would not be 'creating' your data, we hope.

```{r message=FALSE, warning=FALSE}
# Create some extra missing data
library(finalfit)
library(dplyr)
set.seed(1)
colon_s <- colon_s %>% 
  mutate(
    ## Smoking missing completely at random
    smoking_mcar = sample(c("Smoker", "Non-smoker", NA), 
                          n(), replace=TRUE, 
                          prob = c(0.2, 0.7, 0.1)) %>% 
      factor() %>% 
      ff_label("Smoking (MCAR)"),
    
    ## Smoking missing conditional on patient sex
    smoking_mar = ifelse(sex.factor == "Female",
                         sample(c("Smoker", "Non-smoker", NA), 
                                sum(sex.factor == "Female"), 
                                replace = TRUE,
                                prob = c(0.1, 0.5, 0.4)),
                         
                         sample(c("Smoker", "Non-smoker", NA), 
                                sum(sex.factor == "Male"), 
                                replace=TRUE, prob = c(0.15, 0.75, 0.1))
    ) %>% 
      factor() %>% 
      ff_label("Smoking (MAR)")
  )
```

We will then examine our variables of interest using `ff_glimpse()`:

```{r}

explanatory <- c("age", "sex.factor", 
                 "nodes", "obstruct.factor",  
                 "smoking_mcar", "smoking_mar")
dependent <- "mort_5yr"

colon_s %>% 
  ff_glimpse(dependent, explanatory)
```

You don't need to specify the variables, and if you don't, `ff_glimpse()` will summarise all variables:

```{r results="hide"}
colon_s %>%
  ff_glimpse()
```

Use this to check that the variables are all assigned and behaving as expected. 
The proportion of missing data can be seen, e.g., `smoking_mar` has `r round(sum(100*is.na(colon_s$smoking_mar))/dim(colon_s)[1], 0)`% missing data.

## Identify missing values in each variable: `missing_plot()`
\index{missing data@\textbf{missing data}!missingness plot}

Visualising data is essential to help understand it, and missing data is no exception. 
`missing_plot()` function also from **finalfit** is useful for grasping the amount of missing data in each variable.
Row number is on the x-axis and all included variables are on the y-axis. 

```{r}
colon_s %>%
  missing_plot(dependent, explanatory)
```

Further visualisations of missingness can be done using the [naniar](http://naniar.njtierney.com) package. 

## Look for patterns of missingness: `missing_pattern()`
\index{missing data@\textbf{missing data}!missingness pattern}

Using **finalfit**, `missing_pattern()` wraps a function from the **mice** package, `md.pattern()`. 
This produces a table and a plot showing the pattern of missingness between variables.

```{r fig.height=4, fig.width=10}
explanatory <- c("age", "sex.factor", 
                 "obstruct.factor",  
                 "smoking_mcar", "smoking_mar")
dependent <- "mort_5yr"

colon_s %>% 
  missing_pattern(dependent, explanatory)
```

This allows us to look for patterns of missingness between variables. 
There are 11 patterns in these data. 
The number and pattern of missingness help us to determine the likelihood of it being random rather than systematic.

## Including missing data in demographics tables
\index{missing data@\textbf{missing data}!demographics table}

"Table 1" in a healthcare study is often a demographics table of an “explanatory variable of interest” against other explanatory variables/confounders. 
Do not silently drop missing values in this table. 
It is easy to do this correctly with `summary_factorlist()`. 
This function provides a useful summary of a dependent variable against explanatory variables. 
Despite its name, continuous variables are handled nicely.

`na_include=TRUE` ensures missing data from the explanatory variables (but not dependent) are included. 
To include missing values from the dependent, add `na_include_dependent = TRUE`. 
Including a total column (`total_col = TRUE`) is also useful, as well as column totals (`add_col_totals = TRUE`). 

If you are using a lot of continuous explanatory variables with missing values, then these can be seen easily using `add_row_totals = TRUE`. 

Note that missing data is not included when *p*-values are generated. 
If you wish missing data to be passed to statistical tests, then include `na_to_p = TRUE`. 

```{r}
# Explanatory or confounding variables
explanatory <- c("age", "sex.factor", 
                 "nodes",  
                 "smoking_mcar", "smoking_mar")

# Explanatory variable of interest
dependent <- "obstruct.factor" # Bowel obstruction

table1 <- colon_s %>% 
  summary_factorlist(dependent, explanatory, 
                     na_include=TRUE, na_include_dependent = TRUE, 
                     total_col = TRUE, add_col_totals = TRUE, p=TRUE)
```

```{r echo=FALSE}
table1 %>% 
  mykable(caption = "Simulated missing completely at random (MCAR) and missing at random (MAR) dataset.")
```

## Check for associations between missing and observed data
\index{missing data@\textbf{missing data}!associations}

In deciding whether data is MCAR or MAR, one approach is to explore patterns of missingness between levels of included variables. 
This is particularly important (we would say absolutely required) for a primary outcome measure / dependent variable.

Take for example “death”. 
When that outcome is missing it is often for a particular reason. 
For example, perhaps patients undergoing emergency surgery were less likely to have complete records compared with those undergoing planned surgery. 
And of course, death is more likely after emergency surgery.

`missing_pairs()` uses functions from the **GGally** package. 
It produces pairs plots to show relationships between missing values and observed values in all variables.

```{r fig.height=13.5, fig.width=13.5, message=FALSE, warning=FALSE, fig.cap="Missing data matrix with `missing_pairs()`."}
explanatory <- c("age", "sex.factor", 
                 "nodes", "obstruct.factor",  
                 "smoking_mcar", "smoking_mar")
dependent <- "mort_5yr"
colon_s %>% 
  missing_pairs(dependent, explanatory)
```

For continuous variables (age and nodes), the distributions of observed and missing data can immediately be visually compared.
For example, look at Row 1 Column 2.
The age of patients who's mortality data is known is the blue box plot, and the age of patients with missing mortality data is the grey box plot.

For categorical data, the comparisons are presented as counts (remember `geom_bar()` from Chapter \@ref(chap04-h1)).
To be able to compare proportions, we can add the `position = "fill"` argument:

```{r fig.height=13.5, fig.width=13.5, message=FALSE, warning=FALSE, fig.cap="Missing data matrix with `missing_pairs(position = 'fill')` ."}
colon_s %>% 
  missing_pairs(dependent, explanatory, position = "fill")
```

Find the two sets of bar plots that show the proportion of missing smoking data for sex (bottom of Column 3).
Missingness in Smoking (MCAR) does not relate to sex - females and males have the same proportion of missing data.
Missingness in Smoking (MAR), however, does differ by sex as females have more missing data than men here. 
This is how we designed the example at the top of this chapter, so it all makes sense.

We can also confirm this by using `missing_compare()`:

```{r message=FALSE, warning=FALSE}
explanatory <- c("age", "sex.factor", 
                 "nodes", "obstruct.factor")
dependent <- "smoking_mcar"

missing_mcar <- colon_s %>% 
  missing_compare(dependent, explanatory)
```

```{r echo=FALSE}
missing_mcar %>% 
  mykable(caption = "Missing data comparison: Smoking (MCAR).")
```

```{r message=FALSE, warning=FALSE}
dependent <- "smoking_mar"

missing_mar <- colon_s %>% 
  missing_compare(dependent, explanatory)
```

```{r echo=FALSE}
missing_mar %>% 
  mykable(caption = "Missing data comparison: Smoking (MAR).")
```

It takes dependent and explanatory variables, and in this context "dependent" refers to the variable being tested for missingness against the explanatory variables.
^[By default, `missing_compare()` uses an F-test test for continuous variables and chi-squared for categorical variables; you can change these the same way you change tests in `summary_factorlist()`.
Check the Help tab or  online documentation for a reminder.]
As expected, a relationship is seen between sex and smoking (MAR) but not smoking (MCAR).

### For those who like an omnibus test
If you work predominately with continuous rather than categorical data, you may find these tests from the `MissMech` package useful. 
It provides two tests which can be used to determine whether data are MCAR; the package and its output are well documented.

```{r}
library(MissMech)
explanatory <- c("age", "nodes")
dependent <- "mort_5yr" 

colon_s %>% 
  select(all_of(explanatory)) %>% 
  MissMech::TestMCARNormality()
```

## Handling missing data: MCAR
\index{missing data@\textbf{missing data}!handling}

Prior to a standard regression analysis, we can either:

* Delete the variable with the missing data
* Delete the cases with the missing data
* Impute (fill in) the missing data
* Model the missing data

Using the examples, we identify that smoking (MCAR) is missing completely at random.

We know nothing about the missing values themselves, but we know of no plausible reason that the values of the missing data, for say, people who died should be different to the values of the missing data for those who survived. 
The pattern of missingness is therefore not felt to be MNAR.

### Common solution: row-wise deletion
Depending on the number of data points that are missing, we may have sufficient power with complete cases to examine the relationships of interest.

We therefore elect to omit the patients in whom smoking is missing.
This is known as list-wise deletion and will be performed by default and usually silently by any standard regression function.

```{r message=FALSE, warning=FALSE}
explanatory <- c("age", "sex.factor", 
                 "nodes", "obstruct.factor",  
                 "smoking_mcar")
dependent <- "mort_5yr"
fit = colon_s %>% 
  finalfit(dependent, explanatory)
```

```{r echo=FALSE}
fit %>% 
  mykable(caption = "Regression analysis with missing data: List-wise deletion.")
```

### Other considerations

* Sensitivity analysis
* Omit the variable
* Imputation
* Model the missing data

If the variable in question is thought to be particularly important, you may wish to perform a sensitivity analysis.
A sensitivity analysis in this context aims to capture the effect of uncertainty on the conclusions drawn from the model.
Thus, you may choose to re-label all missing smoking values as “smoker”, and see if that changes the conclusions of your analysis. The same procedure can be performed labelling with “non-smoker”.

If smoking is not associated with the explanatory variable of interest or the outcome, it may be considered not to be a confounder and so could be omitted.
That deals with the missing data issue, but of course may not always be appropriate.

Imputation and modelling are considered below.

## Handling missing data: MAR

But life is rarely that simple.

Considering that the smoking variable is more likely to be missing if the patient is female (missing_compare shows a relationship).
But, say, that the missing values are not different from the observed values.
Missingness is then MAR.

If we simply drop all the patients for whom smoking is missing (list-wise deletion), then we drop relatively more females than men. 
This may have consequences for our conclusions if sex is associated with our explanatory variable of interest or outcome.

### Common solution: Multivariate Imputation by Chained Equations (mice)
\index{missing data@\textbf{missing data}!imputation}
\index{missing data@\textbf{missing data}!multiple imputation}
\index{imputation}
\index{multiple imputation}

**mice** is our go to package for multiple imputation. 
That’s the process of filling in missing data using a best-estimate from all the other data that exists. 
When first encountered, this may not sound like a good idea.

However, taking our simple example, if missingness in smoking is predicted strongly by sex (and other observed variables), and the values of the missing data are random, then we can impute (best-guess) the missing smoking values using sex and other variables in the dataset.

Imputation is not usually appropriate for the explanatory variable of interest or the outcome variable, although these can be used to impute other variables. 
In both cases, the hypothesis is that there is a meaningful association with other variables in the dataset, therefore it doesn’t make sense to use these variables to impute them.

The process of multiple imputation involves: 

* **Impute** missing data *m* times, which results in *m* complete datasets
* **Diagnose** the quality of the imputed values
* **Analyse** each completed dataset
* **Pool** the results of the repeated analyses

We will present a `mice()` example here. 
The package is well documented, and there are a number of checks and considerations that should be made to inform the imputation process.
Read the documentation carefully prior to doing this yourself.

Note also `missing_predictorMatrix()` from **finalfit**.
This provides a straightforward way to include or exclude variables to be imputed or to be used for imputation.

**Impute**

```{r message=FALSE, warning=FALSE}
# Multivariate Imputation by Chained Equations (mice)
library(finalfit)
library(dplyr)
library(mice)
explanatory <- c("age", "sex.factor", 
                 "nodes", "obstruct.factor", "smoking_mar")
dependent <- "mort_5yr"
```

Choose which variable to input missing values for and which variables to use for the imputation process. 

```{r message=FALSE, warning=FALSE}
colon_s %>% 
  select(dependent, explanatory) %>% 
  missing_predictorMatrix(
    drop_from_imputed = c("obstruct.factor", "mort_5yr")
  ) -> predM
```

Make 10 imputed datasets and run our logistic regression analysis on each set. 

```{r}
fits <- colon_s %>% 
  select(dependent, explanatory) %>% 
  
  # Usually run imputation with 10 imputed sets, 4 here for demonstration
  mice(m = 4, predictorMatrix = predM) %>% 
  
  # Run logistic regression on each imputed set
  with(glm(formula(ff_formula(dependent, explanatory)), 
           family="binomial"))
```

**Extract metrics from each model** 

```{r message=FALSE, warning=FALSE}
# Examples of extracting metrics from fits and taking the mean
## AICs
fits %>% 
  getfit() %>% 
  purrr::map(AIC) %>%
  unlist() %>% 
  mean()

# C-statistic
fits %>% 
  getfit() %>% 
  purrr::map(~ pROC::roc(.x$y, .x$fitted)$auc) %>% 
  unlist() %>% 
  mean()
```

**Pool models together**

```{r fig.height=4, fig.width=10, message=FALSE, warning=FALSE}
# Pool  results
fits_pool <- fits %>% 
  pool()

## Can be passed to or_plot
colon_s %>% 
  or_plot(dependent, explanatory, glmfit = fits_pool, table_text_size=4)

# Summarise and put in table
fit_imputed <- fits_pool %>%                                  
  fit2df(estimate_name = "OR (multiple imputation)", exp = TRUE)

# Use finalfit merge methods to create and compare results
explanatory <- c("age", "sex.factor", 
                 "nodes", "obstruct.factor", "smoking_mar")

table_uni_multi <- colon_s %>% 
  finalfit(dependent, explanatory, keep_fit_id = TRUE) 

explanatory = c("age", "sex.factor", 
                "nodes", "obstruct.factor")

fit_multi_no_smoking <- colon_s %>% 
  glmmulti(dependent, explanatory) %>% 
  fit2df(estimate_suffix = " (multivariable without smoking)") 

# Combine to final table
table_imputed <- 
  table_uni_multi %>% 
  ff_merge(fit_multi_no_smoking) %>% 
  ff_merge(fit_imputed, last_merge = TRUE)
```

```{r echo=FALSE}
table_imputed  %>% 
  mykable(caption = "Regression analysis with missing data: Multiple imputation using mice().")
```

By examining the coefficients, the effect of the imputation compared with the complete case analysis can be seen.

**Other considerations**

* Omit the variable
* Model the missing data

As above, if the variable does not appear to be important, it may be omitted from the analysis.
A sensitivity analysis in this context is another form of imputation.
But rather than using all other available information to best-guess the missing data, we simply assign the value as above.
Imputation is therefore likely to be more appropriate.

There is an alternative method to model the missing data for the categorical in this setting – just consider the missing data as a factor level. 
This has the advantage of simplicity, with the disadvantage of increasing the number of terms in the model.

```{r message=FALSE, warning=FALSE}
library(dplyr)
explanatory = c("age", "sex.factor", 
                "nodes", "obstruct.factor", "smoking_mar")
fit_explicit_na = colon_s %>% 
  mutate(
    smoking_mar = forcats::fct_explicit_na(smoking_mar)
  ) %>% 
  finalfit(dependent, explanatory)
```

```{r echo=FALSE}
fit_explicit_na %>% 
  mykable(caption = "Regression analysis with missing data: Explicitly modelling missing data.")
```

## Handling missing data: MNAR

Missing not at random data is tough in healthcare.
To determine if data are MNAR for definite, we need to know their value in a subset of observations (patients).

Imagine that smoking status is poorly recorded in patients admitted to hospital as an emergency with an obstructing bowel cancer.
Obstructing bowel cancers may be larger or their position may make the prognosis worse.
Smoking may relate to the aggressiveness of the cancer and may be an independent predictor of prognosis. 
The missing values for smoking may therefore not be random. 
Smoking may be more common in the emergency patients and may be more common in those that die.

There is no easy way to handle this.
If at all possible, try to get the missing data. 
Otherwise, be careful when drawing conclusions from analyses where data are thought to be missing not at random.

## Summary

The more data analysis you do, the more you realise just how important missing data is. 
It is imperative that you understand where missing values exist in your own data. 
By following the simple steps in this chapter, you will be able to determine whether the cases (commonly patients) with missing values are a different population to those with complete data. 
This is the basis for understanding the impact of missing data on your analyses. 

Whether you remove cases, remove variables, impute data, or model missing values, always check how each approach alters the conclusions of your analysis.
Be transparent when you report your results and include the alternative approaches in appendices of published work.