Skip to content

Commit

Permalink
After 2nd copy edit
Browse files Browse the repository at this point in the history
  • Loading branch information
ewenharrison committed Sep 7, 2020
1 parent 21a0f3f commit 90fc837
Show file tree
Hide file tree
Showing 15 changed files with 197 additions and 170 deletions.
4 changes: 2 additions & 2 deletions 01_introduction.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,7 @@ For finding help on things you have not used before, it is best to Google it.

R has about 2 million users so someone somewhere has probably had the same question or problem.

RStudio also has a Help drop-down menu at the very top (same row where you find "File", "Edit, ...).
RStudio also has a Help drop-down menu at the very top (same row where you find "File", "Edit", ...).
The most notable things in the Help drop-down menu are the Cheatsheets.
These tightly packed two-pagers include many of the most useful functions from `tidyverse` packages.
They are not particularly easy to learn from, but invaluable as an *aide-mémoire*.
Expand Down Expand Up @@ -226,7 +226,7 @@ After installing RStudio, you should go change two small but important things in
1. **Uncheck** "Restore .RData into Workspace on startup"
2. Set "Save .RData on exit" to **Never**

```{r chap01-fig-settings, echo = FALSE, fig.cap = "Configuring your RStudio Tools -> Global Options: Untick \"Restore .RData into Workspace on Exit\" and Set \"Save .RData on exit\" to Never.", out.width="100%"}
```{r chap01-fig-settings, echo = FALSE, fig.cap = "Configuring your RStudio Tools -> Global Options: Untick \`\`Restore .RData into Workspace on Exit\" and Set \'\'Save .RData on exit\" to Never.", out.width="100%"}
knitr::include_graphics("images/chapter01/rstudio_settings.png")
```
Expand Down
23 changes: 16 additions & 7 deletions 02_basics.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -364,7 +364,7 @@ A lot more about factor handling will be covered later (\@ref(chap08-h1)).
R is good for working with dates.
For example, it can calculate the number of days/weeks/months between two dates, or it can be used to find what future date is (e.g., "what's the date exactly 60 days from now?").
It also knows about time zones and is happy to parse dates in pretty much any format - as long as you tell R how your date is formatted (e.g., day before month, month name abbreviated, year in 2 or 4 digits, etc.).
Since R displays dates and times between quotes (""), they look similar to characters.
Since R displays dates and times between quotes (`` ''), they look similar to characters.
However, it is important to know whether R has understood which of your columns contain date/time information, and which are just normal characters.

```{r, message = FALSE}
Expand All @@ -379,10 +379,15 @@ my_datetime
When printed, the two objects - `current_datetime` and `my_datetime` seem to have a similar format.
But if we try to calculate the difference between these two dates, we get an error:

```{r, error = TRUE}
```{r eval=FALSE, include=TRUE}
my_datetime - current_datetime
```

```{r echo=FALSE}
print("Error in `-.POSIXt`(my_datetime, current_datetime)")
```


That's because when we assigned a value to `my_datetime`, R assumed the simpler type for it - so a character.
We can check what the type of an object or variable is using the `class()` function:

Expand Down Expand Up @@ -419,10 +424,14 @@ ymd_hm("2021-01-02 12:00") + my_datesdiff
But if we want to use the number of days in a normal calculation, e.g., what if a measurement increased by 560 arbitrary units during this time period.
We might want to calculate the increase per day like this:

```{r, error = TRUE}
```{r eval=FALSE, include=TRUE}
560/my_datesdiff
```

```{r echo=FALSE}
print("Error in `/.difftime`(560, my_datesdiff)")
```

Doesn't work, does it.
We need to convert `my_datesdiff` (which is a difftime value) into a numeric value by using the `as.numeric()` function:

Expand Down Expand Up @@ -492,7 +501,7 @@ mydata <- tibble(
)
mydata %>%
knitr::kable(booktabs = TRUE, caption = "Example of data in columns and rows, including missing values denoted `NA` (Not applicable/Not available). Once this dataset has been read into R it gets called dataframe/tibble.") %>%
knitr::kable(booktabs = TRUE, caption = "Example of data in columns and rows, including missing values denoted NA (Not applicable/Not available). Once this dataset has been read into R it gets called dataframe/tibble.") %>%
kableExtra::kable_styling(font_size=9)
```

Expand Down Expand Up @@ -726,7 +735,7 @@ If you use the assignment arrow, an object holding the results will get saved in
\index{pipe@\textbf{pipe}}

The pipe - denoted `%>%` - is probably the oddest looking thing you'll see in this book.
But please bear with; it is not as scary as it looks!
But please bear with us; it is not as scary as it looks!
Furthermore, it is super useful.
We use the pipe to send objects into functions.

Expand Down Expand Up @@ -836,7 +845,7 @@ gbd_short %>%
filter(year = 1995)
```

> The answer to 'do you need ==?" is almost always, "Yes R, I do, thank you".
> The answer to "do you need ==?" is almost always, "Yes R, I do, thank you".
But that's just because `filter()` is a clever cookie and is used to this common mistake.
There are other useful functions we use these operators in, but they don't always know to tell us that we've just confused `=` for `==`.
Expand Down Expand Up @@ -1117,7 +1126,7 @@ typesdata %>%
mutate(mean_measurement = mean(measurement))
```

Which in return can be useful for calculating a standardized measurement (i.e. relative to the mean):
Which in return can be useful for calculating a standardized measurement (i.e., relative to the mean):

```{r}
typesdata %>%
Expand Down
12 changes: 8 additions & 4 deletions 03_summarising.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -453,7 +453,9 @@ For example, here we want to collect all the columns that include the words Fema

```{r}
gbd_wide %>%
pivot_longer(matches("Female|Male"), names_to = "sex_year", values_to = "deaths_millions") %>%
pivot_longer(matches("Female|Male"),
names_to = "sex_year",
values_to = "deaths_millions") %>%
slice(1:6)
```

Expand All @@ -477,7 +479,9 @@ We can use the `separate()` function to deal with that.
```{r}
gbd_wide %>%
# same pivot_longer as before
pivot_longer(matches("Female|Male"), names_to = "sex_year", values_to = "deaths_millions") %>%
pivot_longer(matches("Female|Male"),
names_to = "sex_year",
values_to = "deaths_millions") %>%
separate(sex_year, into = c("sex", "year"), sep = "_", convert = TRUE)
```

Expand Down Expand Up @@ -506,7 +510,7 @@ gbd_long %>%
slice(1:3)
```

The `-` doesn't work for categorical variables, they need to be put in `desc()` for arranging in descending order:
The `-` doesn't work for categorical variables; they need to be put in `desc()` for arranging in descending order:

```{r}
gbd_long %>%
Expand Down Expand Up @@ -669,7 +673,7 @@ full_join(summary_data1, summary_data2) %>%

Instead of creating the two summarised tibbles and using a `full_join()`, achieve the same result as in the previous exercise with a single pipeline using `summarise()` and then `mutate()`.

Hint: you have to do it the other way round, so `group_by(year, cause) %>% summarise(...)` first, then `group_by(year) %>% mutate()`.
Hint: you have to do it the other way around, so `group_by(year, cause) %>% summarise(...)` first, then `group_by(year) %>% mutate()`.

Bonus: `select()` columns `year`, `cause`, `percentage`, then `pivot_wider()` the `cause` variable using `percentage` as values.

Expand Down
2 changes: 1 addition & 1 deletion 05_fine_tuning_plots.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -369,7 +369,7 @@ ggsave(p0, file = "my_saved_plot_larger.pdf", width = 10, height = 8)
```


```{r chap05-fig-ggsave, echo = FALSE, out.width="100%", fig.cap = "Experimenting with the width and height options within `ggsave()` can be used to quickly change how big or small some of the text on your plot looks."}
```{r chap05-fig-ggsave, echo = FALSE, out.width="100%", fig.cap = "Experimenting with the width and height options within ggsave() can be used to quickly change how big or small some of the text on your plot looks."}
# these get put together into a single figure by hand - then to images/chapter05/
ggsave(p0 + labs(title = "ggsave(..., width = 5, height = 4)"), file = "my_saved_plot.pdf", width = 5, height = 4)
ggsave(p0 + labs(title = "ggsave(..., width = 10, height = 8)") + theme(title = element_text(size = 24)), file = "my_saved_plot_larger.pdf", width = 10, height = 8)
Expand Down
8 changes: 5 additions & 3 deletions 06_working_continuous.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -88,8 +88,10 @@ sum_gapdata[[1]] %>%


```{r message=FALSE, echo=FALSE}
sum_gapdata[[2]] %>%
select(-c(5, 9)) %>%
t = sum_gapdata[[2]] %>%
select(-c(5, 9))
t$levels[2] = c("\`\`Africa\", \`\`Americas\", \`\`Asia\", \`\`Europe\", \`\`Oceania\"")
t %>%
kable(row.names = FALSE, align = c("l", "l", "l", "r", "r", "r", "r", "r", "r", "r"),
booktabs = TRUE, caption = "Gapminder dataset, ff\\_glimpse: categorical.",
linesep = c("", "", "\\addlinespace")) %>%
Expand Down Expand Up @@ -140,7 +142,7 @@ Quantile-quantile sounds more complicated than it really is.
It is a graphical method for comparing the distribution (think shape) of our own data to a theoretical distribution, such as the normal distribution.
In this context, quantiles are just cut points which divide our data into bins each containing the same number of observations.
For example, if we have the life expectancy for 100 countries, then quartiles (note the quar-) for life expectancy are the three ages which split the observations into 4 groups each containing 25 countries.
A Q-Q plot simply plots the quantiles for our data against the theoretical quantiles for a particular distributions (the default shown below is the normal distribution).
A Q-Q plot simply plots the quantiles for our data against the theoretical quantiles for a particular distribution (the default shown below is the normal distribution).
If our data follow that distribution (e.g., normal), then our data points fall on the theoretical straight line.

```{r chap06-fig-qq-life-year, fig.width=7, fig.height=3.5, fig.cap="Q-Q plot: Country life expectancy by continent and year."}
Expand Down
26 changes: 18 additions & 8 deletions 07_linear_regression.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ If the observations are not equally distributed around the line, the histogram o

The distance of the observations from the fitted line should be the same on the left side as on the right side.
Look at the fan-shaped data on the simple regression diagnostics Shiny app.
This fan shape can be seen on the residuals vs. fitted values plot.
This fan shape can be seen on the residuals vs fitted values plot.

Everything we talk about in this chapter is really about making sure that the line you draw through your data points is valid.
It is about ensuring that the regression line is appropriate across the range of the explanatory variable and dependent variable.
Expand Down Expand Up @@ -654,25 +654,31 @@ sum_wcgs[[1]] %>%
select(-c(5, 8, 9, 11, 12)) %>%
mykable(caption = "WCGS data, ff\\_glimpse: continuous.") %>%
column_spec(1, width = "4cm")
sum_wcgs[[2]] %>%
select(-c(5, 9)) %>%
```

```{r echo=FALSE, message=FALSE}
t = sum_wcgs[[2]] %>%
select(-c(5, 9))
t$levels = stringr::str_replace_all(t$levels, "\"", "\\`\\`")
t$levels = stringr::str_replace_all(t$levels, "``,", "\"")
t %>%
mykable(caption = "WCGS data, ff\\_glimpse: categorical.") %>%
kable_styling(latex_options = c("scale_down", "hold_position")) %>%
kable_styling(latex_options = c("scale_down")) %>%
column_spec(6, width = "3cm") %>%
column_spec(7, width = "3cm")
```

### Plot the data

```{r chap07-fig-bp-personality_type, fig.height=3, fig.width=4.5, fig.cap="Scatter and line plot. Systolic blood pressure by weight and personality type."}
```{r chap07-fig-bp-personality-type, fig.height=3, fig.width=4.5, fig.cap="Scatter and line plot. Systolic blood pressure by weight and personality type."}
wcgsdata %>%
ggplot(aes(y = sbp, x = weight,
colour = personality_2L)) + # Personality type
geom_point(alpha = 0.2) + # Add transparency
geom_smooth(method = "lm", se = FALSE)
```

From Figure \@ref(fig:chap07-fig-bp-personality_type), we can see that there is a weak relationship between weight and blood pressure.
From Figure \@ref(fig:chap07-fig-bp-personality-type), we can see that there is a weak relationship between weight and blood pressure.

In addition, there is really no meaningful effect of personality type on blood pressure.
This is really important because, as you will see below, we are about to "find" some highly statistically significant effects in a model.
Expand Down Expand Up @@ -732,6 +738,10 @@ fit_sbp2 <- wcgsdata %>%
```{r chap07-tab-bp-personality-weight, echo=FALSE}
fit_sbp2[[1]] %>% mykable(caption = "Multivariable linear regression: Systolic blood pressure by personality type and weight.") %>%
column_spec(1, width = "4cm")
```


```{r chap07-tab-bp-personality-weight2, echo=FALSE}
fit_sbp2[[2]] %>% mykable(caption = "Multivariable linear regression metrics: Systolic blood pressure by personality type and weight.", col.names = "") %>%
column_spec(1, width = "18cm")
```
Expand All @@ -741,7 +751,7 @@ The output shows us the range for weight (78 to 320 pounds) and the mean (standa
The coefficient with 95% confidence interval is provided by default.
This is interpreted as: for each pound increase in weight, there is on average a corresponding increase of 0.18 mmHg in systolic blood pressure.

Note the difference in the interpretation of continuous and categorical variables in the regression model output (Figure \@ref(tab:chap07-tab-bp-personality-weight)).
Note the difference in the interpretation of continuous and categorical variables in the regression model output (Table \@ref(tab:chap07-tab-bp-personality-weight)).

The adjusted R-squared is now higher - the personality and weight together explain 6.8% of the variation in blood pressure.

Expand Down Expand Up @@ -876,7 +886,7 @@ wcgsdata %>%
```

An important message in the results relates to the highly significant *p*-values in the table above.
Should we conclude that in a "multivariable regression model controlling for BMI, age, and serum cholesterol, blood pressure was significantly elevated in those with a Type A personality (1.56 (0.57 to 2.56, p=0.002) compared with Type B?
Should we conclude that in a multivariable regression model controlling for BMI, age, and serum cholesterol, blood pressure was significantly elevated in those with a Type A personality (1.56 (0.57 to 2.56, p=0.002) compared with Type B?
The *p*-value looks impressive, but the actual difference in blood pressure is only 1.6 mmHg.
Even at a population level, that may not be clinically significant, fitting with our first thoughts when we saw the scatter plot.

Expand Down
10 changes: 6 additions & 4 deletions 09_logistic_regression.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ It allows the principles of linear regression to be applied when outcomes are no
\index{binary data}

A regression analysis is a statistical approach to estimating the relationships between variables, often by drawing straight lines through data points.
For instance, we may try to predict blood pressure in a group of patients based on their coffee consumption (Figure \@ref(fig:chap07-fig-regression) from Chapter\@ref(chap07-h1)).
For instance, we may try to predict blood pressure in a group of patients based on their coffee consumption (Figure \@ref(fig:chap07-fig-regression) from Chapter \@ref(chap07-h1)).
As blood pressure and coffee consumption can be considered on a continuous scale, this is an example of simple linear regression.

Logistic regression is an extension of this, where the variable being predicted is *categorical*.
Expand Down Expand Up @@ -99,8 +99,8 @@ Why?
Because in a logistic regression the slopes of fitted lines (coefficients) can be interpreted as odds ratios.
This is very useful when interpreting the association of a particular predictor with an outcome.

For a given categorical predictor such as smoking, the difference in chance of the outcome occurring for smokers vs non-smokers can be expressed as a ratio of odds or odds ratio Figure \@ref(fig:chap09-fig-or).
For example, if the odds of a smoker have a CV event are 1.5 and the odds of a non smoker are 1.0, then the odds of a smoker having an event are 1.5-times greater than a non-smoker, odds ratio = 1.5.
For a given categorical predictor such as smoking, the difference in chance of the outcome occurring for smokers vs non-smokers can be expressed as a ratio of odds or odds ratio (Figure \@ref(fig:chap09-fig-or)).
For example, if the odds of a smoker having a CV event are 1.5 and the odds of a non-smoker are 1.0, then the odds of a smoker having an event are 1.5-times greater than a non-smoker, odds ratio = 1.5.

```{r chap09-fig-or, echo = FALSE, fig.cap="Odds ratios."}
knitr::include_graphics("images/chapter09/1_or.png", auto_pdf = TRUE)
Expand Down Expand Up @@ -659,6 +659,8 @@ We recommend looking at three metrics:
* C-statistic (area under the receiver operator curve), which should be maximised;
* Hosmer–Lemeshow test, which should be non-significant.

\newpage

**AIC**
\index{logistic regression@\textbf{logistic regression}!AIC}
\index{AIC}
Expand Down Expand Up @@ -760,7 +762,7 @@ melanoma <- melanoma %>%
fit <- melanoma %>%
finalfit(dependent, c("ulcer.factor", "age.factor"), metrics = TRUE)
fit[[1]] %>% mykable(caption = "Multivariable logistic regression: using `cut` to convert a continuous variable as a factor (fit 3).") %>%
fit[[1]] %>% mykable(caption = "Multivariable logistic regression: using cut to convert a continuous variable as a factor (fit 3).") %>%
column_spec(1, width = "3.5cm")
fit[[2]] %>% mykable(caption = "Model metrics: using `cut` to convert a continuous variable as a factor (fit 3).", col.names = "") %>%
column_spec(1, width = "18cm")
Expand Down
6 changes: 3 additions & 3 deletions 10_survival.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -354,8 +354,7 @@ melanoma <- melanoma %>%
explanatory <- c("age", "sex", "thickness", "ulcer",
"cluster(hospital_id)")
melanoma %>%
finalfit(dependent_os, explanatory) %>%
mykable(caption = "Cox Proportional Hazards: Overall survival following surgery for melanoma with robust standard errors (cluster model).")
finalfit(dependent_os, explanatory)
```

```{r echo=FALSE}
Expand All @@ -367,7 +366,8 @@ melanoma <- melanoma %>%
explanatory <- c("age", "sex", "thickness", "ulcer",
"cluster(hospital_id)")
melanoma %>%
finalfit(dependent_os, explanatory)
finalfit(dependent_os, explanatory) %>%
mykable(caption = "Cox Proportional Hazards: Overall survival following surgery for melanoma with robust standard errors (cluster model).")
```

```{r eval=FALSE}
Expand Down
8 changes: 4 additions & 4 deletions 11_missing_data.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -60,13 +60,13 @@ This is easy to handle, but unfortunately, data are almost never missing complet
### Missing at random (MAR)
\index{missing data@\textbf{missing data}!missing at random}
This is confusing and would be better named *missing conditionally at random*.
Here, missingness in particular variable has an association with one or more other variables in the dataset.
Here, missingness in a particular variable has an association with one or more other variables in the dataset.
However, the *actual values of the missing data are random*.

In our example, smoking status is missing for some female patients but not for male patients.

But data is missing from the same number of female smokers as female non-smokers.
So the complete case female patients have the same characteristics as the missing data female patients.
So the complete case female patients has the same characteristics as the missing data female patients.

### Missing not at random (MNAR)
\index{missing data@\textbf{missing data}!missing not at random}
Expand Down Expand Up @@ -230,7 +230,7 @@ table1 %>%
mykable(caption = "Simulated missing completely at random (MCAR) and missing at random (MAR) dataset.")
```

## Check for associations between missing and observed data`
## Check for associations between missing and observed data
\index{missing data@\textbf{missing data}!associations}

In deciding whether data is MCAR or MAR, one approach is to explore patterns of missingness between levels of included variables.
Expand Down Expand Up @@ -501,7 +501,7 @@ table_imputed <-

```{r echo=FALSE}
table_imputed %>%
mykable(caption = "Regression analysis with missing data: Multiple imputation using `mice()`.")
mykable(caption = "Regression analysis with missing data: Multiple imputation using mice().")
```

By examining the coefficients, the effect of the imputation compared with the complete case analysis can be seen.
Expand Down
Loading

0 comments on commit 90fc837

Please sign in to comment.