After 2nd copy edit

SurgicalInformatics · Sep 7, 2020 · 90fc837 · 90fc837
1 parent 21a0f3f
commit 90fc837
Show file tree

Hide file tree

Showing 15 changed files with 197 additions and 170 deletions.
diff --git a/01_introduction.Rmd b/01_introduction.Rmd
@@ -184,7 +184,7 @@ For finding help on things you have not used before, it is best to Google it.
 
 R has about 2 million users so someone somewhere has probably had the same question or problem.
 
-RStudio also has a Help drop-down menu at the very top (same row where you find "File", "Edit, ...). 
+RStudio also has a Help drop-down menu at the very top (same row where you find "File", "Edit", ...). 
 The most notable things in the Help drop-down menu are the Cheatsheets. 
 These tightly packed two-pagers include many of the most useful functions from `tidyverse` packages. 
 They are not particularly easy to learn from, but invaluable as an *aide-mémoire*. 
@@ -226,7 +226,7 @@ After installing RStudio, you should go change two small but important things in
 1. **Uncheck** "Restore .RData into Workspace on startup"
 2. Set "Save .RData on exit" to **Never**
 
-```{r chap01-fig-settings, echo = FALSE, fig.cap = "Configuring your RStudio Tools -> Global Options:  Untick \"Restore .RData into Workspace on Exit\" and Set \"Save .RData on exit\" to Never.", out.width="100%"}
+```{r chap01-fig-settings, echo = FALSE, fig.cap = "Configuring your RStudio Tools -> Global Options:  Untick \`\`Restore .RData into Workspace on Exit\" and Set \'\'Save .RData on exit\" to Never.", out.width="100%"}
 knitr::include_graphics("images/chapter01/rstudio_settings.png")
 ```
 

diff --git a/02_basics.Rmd b/02_basics.Rmd
@@ -364,7 +364,7 @@ A lot more about factor handling will be covered later (\@ref(chap08-h1)).
 R is good for working with dates. 
 For example, it can calculate the number of days/weeks/months between two dates, or it can be used to find what future date is (e.g., "what's the date exactly 60 days from now?").
 It also knows about time zones and is happy to parse dates in pretty much any format - as long as you tell R how your date is formatted (e.g., day before month, month name abbreviated, year in 2 or 4 digits, etc.).
-Since R displays dates and times between quotes (""), they look similar to characters. 
+Since R displays dates and times between quotes (`` ''), they look similar to characters. 
 However, it is important to know whether R has understood which of your columns contain date/time information, and which are just normal characters.
 
 ```{r, message = FALSE}
@@ -379,10 +379,15 @@ my_datetime
 When printed, the two objects - `current_datetime` and `my_datetime` seem to have a similar format.
 But if we try to calculate the difference between these two dates, we get an error:
 
-```{r, error = TRUE}
+```{r eval=FALSE, include=TRUE}
 my_datetime - current_datetime
 ```
 
+```{r echo=FALSE}
+print("Error in `-.POSIXt`(my_datetime, current_datetime)")
+```
+
+
 That's because when we assigned a value to `my_datetime`, R assumed the simpler type for it - so a character.
 We can check what the type of an object or variable is using the `class()` function:
 
@@ -419,10 +424,14 @@ ymd_hm("2021-01-02 12:00") + my_datesdiff
 But if we want to use the number of days in a normal calculation, e.g., what if a measurement increased by 560 arbitrary units during this time period.
 We might want to calculate the increase per day like this:
 
-```{r, error = TRUE}
+```{r eval=FALSE, include=TRUE}
 560/my_datesdiff
 ```
 
+```{r echo=FALSE}
+print("Error in `/.difftime`(560, my_datesdiff)")
+```
+
 Doesn't work, does it.
 We need to convert `my_datesdiff` (which is a difftime value) into a numeric value by using the `as.numeric()` function:
 
@@ -492,7 +501,7 @@ mydata <- tibble(
 )
 
 mydata %>% 
-  knitr::kable(booktabs = TRUE, caption = "Example of data in columns and rows, including missing values denoted `NA` (Not applicable/Not available). Once this dataset has been read into R it gets called dataframe/tibble.") %>% 
+  knitr::kable(booktabs = TRUE, caption = "Example of data in columns and rows, including missing values denoted NA (Not applicable/Not available). Once this dataset has been read into R it gets called dataframe/tibble.") %>% 
   kableExtra::kable_styling(font_size=9)
 ```
 
@@ -726,7 +735,7 @@ If you use the assignment arrow, an object holding the results will get saved in
 \index{pipe@\textbf{pipe}}
 
 The pipe - denoted `%>%` - is probably the oddest looking thing you'll see in this book. 
-But please bear with; it is not as scary as it looks! 
+But please bear with us; it is not as scary as it looks! 
 Furthermore, it is super useful. 
 We use the pipe to send objects into functions.
 
@@ -836,7 +845,7 @@ gbd_short %>%
   filter(year = 1995)
 ```
 
-> The answer to 'do you need ==?" is almost always, "Yes R, I do, thank you".
+> The answer to "do you need ==?" is almost always, "Yes R, I do, thank you".
 
 But that's just because `filter()` is a clever cookie and is used to this common mistake. 
 There are other useful functions we use these operators in, but they don't always know to tell us that we've just confused `=` for `==`.
@@ -1117,7 +1126,7 @@ typesdata %>%
   mutate(mean_measurement = mean(measurement))
 ```
 
-Which in return can be useful for calculating a standardized measurement (i.e. relative to the mean):
+Which in return can be useful for calculating a standardized measurement (i.e., relative to the mean):
 
 ```{r}
 typesdata %>% 

diff --git a/03_summarising.Rmd b/03_summarising.Rmd
@@ -453,7 +453,9 @@ For example, here we want to collect all the columns that include the words Fema
 
 ```{r}
 gbd_wide %>% 
-  pivot_longer(matches("Female|Male"), names_to = "sex_year", values_to = "deaths_millions") %>% 
+  pivot_longer(matches("Female|Male"), 
+               names_to = "sex_year", 
+               values_to = "deaths_millions") %>% 
   slice(1:6)
 ```
 
@@ -477,7 +479,9 @@ We can use the `separate()` function to deal with that.
 ```{r}
 gbd_wide %>% 
   # same pivot_longer as before
-  pivot_longer(matches("Female|Male"), names_to = "sex_year", values_to = "deaths_millions") %>% 
+  pivot_longer(matches("Female|Male"), 
+               names_to = "sex_year", 
+               values_to = "deaths_millions") %>% 
   separate(sex_year, into = c("sex", "year"), sep = "_", convert = TRUE)
 ```
 
@@ -506,7 +510,7 @@ gbd_long %>%
   slice(1:3)
 ```
 
-The `-` doesn't work for categorical variables, they need to be put in `desc()` for arranging in descending order:
+The `-` doesn't work for categorical variables; they need to be put in `desc()` for arranging in descending order:
 
 ```{r}
 gbd_long %>% 
@@ -669,7 +673,7 @@ full_join(summary_data1, summary_data2) %>%
 
 Instead of creating the two summarised tibbles and using a `full_join()`, achieve the same result as in the previous exercise with a single pipeline using `summarise()` and then `mutate()`.
 
-Hint: you have to do it the other way round, so `group_by(year, cause) %>% summarise(...)` first, then `group_by(year) %>% mutate()`.
+Hint: you have to do it the other way around, so `group_by(year, cause) %>% summarise(...)` first, then `group_by(year) %>% mutate()`.
 
 Bonus: `select()` columns `year`, `cause`, `percentage`, then `pivot_wider()` the `cause` variable using `percentage` as values.
 

diff --git a/05_fine_tuning_plots.Rmd b/05_fine_tuning_plots.Rmd
@@ -369,7 +369,7 @@ ggsave(p0, file = "my_saved_plot_larger.pdf", width = 10, height = 8)
 ```
 
 
-```{r chap05-fig-ggsave, echo = FALSE, out.width="100%", fig.cap = "Experimenting with the width and height options within `ggsave()` can be used to quickly change how big or small some of the text on your plot looks."}
+```{r chap05-fig-ggsave, echo = FALSE, out.width="100%", fig.cap = "Experimenting with the width and height options within ggsave() can be used to quickly change how big or small some of the text on your plot looks."}
 # these get put together into a single figure by hand - then to images/chapter05/
 ggsave(p0 + labs(title = "ggsave(..., width = 5, height = 4)"), file = "my_saved_plot.pdf", width = 5, height = 4)
 ggsave(p0 + labs(title = "ggsave(..., width = 10, height = 8)") + theme(title = element_text(size = 24)), file = "my_saved_plot_larger.pdf", width = 10, height = 8)

diff --git a/06_working_continuous.Rmd b/06_working_continuous.Rmd
@@ -88,8 +88,10 @@ sum_gapdata[[1]] %>%
 
 
 ```{r message=FALSE, echo=FALSE}
-sum_gapdata[[2]] %>% 
-  select(-c(5, 9)) %>% 
+t = sum_gapdata[[2]] %>% 
+  select(-c(5, 9))
+t$levels[2] = c("\`\`Africa\", \`\`Americas\", \`\`Asia\", \`\`Europe\", \`\`Oceania\"")
+t %>% 
   kable(row.names = FALSE, align = c("l", "l", "l", "r", "r", "r", "r", "r", "r", "r"), 
         booktabs = TRUE, caption = "Gapminder dataset, ff\\_glimpse: categorical.", 
         linesep = c("", "", "\\addlinespace")) %>%
@@ -140,7 +142,7 @@ Quantile-quantile sounds more complicated than it really is.
 It is a graphical method for comparing the distribution (think shape) of our own data to a theoretical distribution, such as the normal distribution. 
 In this context, quantiles are just cut points which divide our data into bins each containing the same number of observations.
 For example, if we have the life expectancy for 100 countries, then quartiles (note the quar-) for life expectancy are the three ages which split the observations into 4 groups each containing 25 countries. 
-A Q-Q plot simply plots the quantiles for our data against the theoretical quantiles for a particular distributions (the default shown below is the normal distribution). 
+A Q-Q plot simply plots the quantiles for our data against the theoretical quantiles for a particular distribution (the default shown below is the normal distribution). 
 If our data follow that distribution (e.g., normal), then our data points fall on the theoretical straight line.
 
 ```{r chap06-fig-qq-life-year, fig.width=7, fig.height=3.5, fig.cap="Q-Q plot: Country life expectancy by continent and year."}

diff --git a/07_linear_regression.Rmd b/07_linear_regression.Rmd
@@ -119,7 +119,7 @@ If the observations are not equally distributed around the line, the histogram o
 
 The distance of the observations from the fitted line should be the same on the left side as on the right side. 
 Look at the fan-shaped data on the simple regression diagnostics Shiny app. 
-This fan shape can be seen on the residuals vs. fitted values plot. 
+This fan shape can be seen on the residuals vs fitted values plot. 
 
 Everything we talk about in this chapter is really about making sure that the line you draw through your data points is valid.
 It is about ensuring that the regression line is appropriate across the range of the explanatory variable and dependent variable.
@@ -654,25 +654,31 @@ sum_wcgs[[1]] %>%
   select(-c(5, 8, 9, 11, 12)) %>% 
   mykable(caption = "WCGS data, ff\\_glimpse: continuous.") %>%  
   column_spec(1, width = "4cm")
-sum_wcgs[[2]] %>% 
-  select(-c(5, 9)) %>% 
+```
+
+```{r echo=FALSE, message=FALSE}
+t = sum_wcgs[[2]] %>% 
+  select(-c(5, 9))
+t$levels = stringr::str_replace_all(t$levels, "\"", "\\`\\`")
+t$levels = stringr::str_replace_all(t$levels, "``,", "\"")  
+t %>% 
   mykable(caption = "WCGS data, ff\\_glimpse: categorical.") %>%
-  kable_styling(latex_options = c("scale_down", "hold_position")) %>%  
+  kable_styling(latex_options = c("scale_down")) %>%  
   column_spec(6, width = "3cm") %>% 
   column_spec(7, width = "3cm")
 ```
 
 ### Plot the data
 
-```{r chap07-fig-bp-personality_type, fig.height=3, fig.width=4.5, fig.cap="Scatter and line plot. Systolic blood pressure by weight and personality type."}
+```{r chap07-fig-bp-personality-type, fig.height=3, fig.width=4.5, fig.cap="Scatter and line plot. Systolic blood pressure by weight and personality type."}
 wcgsdata %>%
   ggplot(aes(y = sbp, x = weight,
              colour = personality_2L)) +   # Personality type
   geom_point(alpha = 0.2) +                # Add transparency
   geom_smooth(method = "lm", se = FALSE)
 ```
 
-From Figure \@ref(fig:chap07-fig-bp-personality_type), we can see that there is a weak relationship between weight and blood pressure. 
+From Figure \@ref(fig:chap07-fig-bp-personality-type), we can see that there is a weak relationship between weight and blood pressure. 
 
 In addition, there is really no meaningful effect of personality type on blood pressure. 
 This is really important because, as you will see below, we are about to "find" some highly statistically significant effects in a model.
@@ -732,6 +738,10 @@ fit_sbp2 <- wcgsdata %>%
 ```{r chap07-tab-bp-personality-weight, echo=FALSE}
 fit_sbp2[[1]] %>% mykable(caption = "Multivariable linear regression: Systolic blood pressure by personality type and weight.") %>% 
   column_spec(1, width = "4cm")
+```
+
+
+```{r chap07-tab-bp-personality-weight2, echo=FALSE}
 fit_sbp2[[2]] %>% mykable(caption = "Multivariable linear regression metrics: Systolic blood pressure by personality type and weight.", col.names = "") %>% 
   column_spec(1, width = "18cm")
 ```
@@ -741,7 +751,7 @@ The output shows us the range for weight (78 to 320 pounds) and the mean (standa
 The coefficient with 95% confidence interval is provided by default. 
 This is interpreted as: for each pound increase in weight, there is on average a corresponding increase of 0.18 mmHg in systolic blood pressure.
 
-Note the difference in the interpretation of continuous and categorical variables in the regression model output (Figure \@ref(tab:chap07-tab-bp-personality-weight)). 
+Note the difference in the interpretation of continuous and categorical variables in the regression model output (Table \@ref(tab:chap07-tab-bp-personality-weight)). 
 
 The adjusted R-squared is now higher - the personality and weight together explain 6.8% of the variation in blood pressure. 
 
@@ -876,7 +886,7 @@ wcgsdata %>%
 ```
 
 An important message in the results relates to the highly significant *p*-values in the table above. 
-Should we conclude that in a "multivariable regression model controlling for BMI, age, and serum cholesterol, blood pressure was significantly elevated in those with a Type A personality (1.56 (0.57 to 2.56, p=0.002) compared with Type B?
+Should we conclude that in a multivariable regression model controlling for BMI, age, and serum cholesterol, blood pressure was significantly elevated in those with a Type A personality (1.56 (0.57 to 2.56, p=0.002) compared with Type B?
 The *p*-value looks impressive, but the actual difference in blood pressure is only 1.6 mmHg. 
 Even at a population level, that may not be clinically significant, fitting with our first thoughts when we saw the scatter plot. 
 

diff --git a/09_logistic_regression.Rmd b/09_logistic_regression.Rmd
@@ -30,7 +30,7 @@ It allows the principles of linear regression to be applied when outcomes are no
 \index{binary data}
 
 A regression analysis is a statistical approach to estimating the relationships between variables, often by drawing straight lines through data points. 
-For instance, we may try to predict blood pressure in a group of patients based on their coffee consumption (Figure \@ref(fig:chap07-fig-regression) from Chapter\@ref(chap07-h1)). 
+For instance, we may try to predict blood pressure in a group of patients based on their coffee consumption (Figure \@ref(fig:chap07-fig-regression) from Chapter \@ref(chap07-h1)). 
 As blood pressure and coffee consumption can be considered on a continuous scale, this is an example of simple linear regression. 
 
 Logistic regression is an extension of this, where the variable being predicted is *categorical*. 
@@ -99,8 +99,8 @@ Why?
 Because in a logistic regression the slopes of fitted lines (coefficients) can be interpreted as odds ratios. 
 This is very useful when interpreting the association of a particular predictor with an outcome. 
 
-For a given categorical predictor such as smoking, the difference in chance of the outcome occurring for smokers vs non-smokers can be expressed as a ratio of odds or odds ratio Figure \@ref(fig:chap09-fig-or).
-For example, if the odds of a smoker have a CV event are 1.5 and the odds of a non smoker are 1.0, then the odds of a smoker having an event are 1.5-times greater than a non-smoker, odds ratio = 1.5. 
+For a given categorical predictor such as smoking, the difference in chance of the outcome occurring for smokers vs non-smokers can be expressed as a ratio of odds or odds ratio (Figure \@ref(fig:chap09-fig-or)).
+For example, if the odds of a smoker having a CV event are 1.5 and the odds of a non-smoker are 1.0, then the odds of a smoker having an event are 1.5-times greater than a non-smoker, odds ratio = 1.5. 
 
 ```{r chap09-fig-or, echo = FALSE, fig.cap="Odds ratios."}
 knitr::include_graphics("images/chapter09/1_or.png", auto_pdf = TRUE)
@@ -659,6 +659,8 @@ We recommend looking at three metrics:
 * C-statistic (area under the receiver operator curve), which should be maximised;
 * Hosmer–Lemeshow test, which should be non-significant. 
 
+\newpage
+
 **AIC**
 \index{logistic regression@\textbf{logistic regression}!AIC}
 \index{AIC}
@@ -760,7 +762,7 @@ melanoma <- melanoma %>%
 
 fit <- melanoma %>% 
   finalfit(dependent, c("ulcer.factor", "age.factor"), metrics = TRUE)
-fit[[1]] %>% mykable(caption = "Multivariable logistic regression: using `cut` to convert a continuous variable as a factor (fit 3).") %>% 
+fit[[1]] %>% mykable(caption = "Multivariable logistic regression: using cut to convert a continuous variable as a factor (fit 3).") %>% 
   column_spec(1, width = "3.5cm")
 fit[[2]] %>% mykable(caption = "Model metrics: using `cut` to convert a continuous variable as a factor (fit 3).", col.names = "") %>% 
   column_spec(1, width = "18cm")

diff --git a/10_survival.Rmd b/10_survival.Rmd
@@ -354,8 +354,7 @@ melanoma <- melanoma %>%
 explanatory <- c("age", "sex", "thickness", "ulcer", 
                 "cluster(hospital_id)")
 melanoma %>% 
-	finalfit(dependent_os, explanatory) %>% 
-	mykable(caption = "Cox Proportional Hazards: Overall survival following surgery for melanoma with robust standard errors (cluster model).")
+	finalfit(dependent_os, explanatory)
 ```
 
 ```{r echo=FALSE}
@@ -367,7 +366,8 @@ melanoma <- melanoma %>%
 explanatory <- c("age", "sex", "thickness", "ulcer", 
                 "cluster(hospital_id)")
 melanoma %>% 
-	finalfit(dependent_os, explanatory)
+	finalfit(dependent_os, explanatory) %>% 
+	mykable(caption = "Cox Proportional Hazards: Overall survival following surgery for melanoma with robust standard errors (cluster model).")
 ```
 
 ```{r eval=FALSE}

diff --git a/11_missing_data.Rmd b/11_missing_data.Rmd
@@ -60,13 +60,13 @@ This is easy to handle, but unfortunately, data are almost never missing complet
 ### Missing at random (MAR)
 \index{missing data@\textbf{missing data}!missing at random}
 This is confusing and would be better named *missing conditionally at random*. 
-Here, missingness in particular variable has an association with one or more other variables in the dataset. 
+Here, missingness in a particular variable has an association with one or more other variables in the dataset. 
 However, the *actual values of the missing data are random*.
 
 In our example, smoking status is missing for some female patients but not for male patients. 
 
 But data is missing from the same number of female smokers as female non-smokers. 
-So the complete case female patients have the same characteristics as the missing data female patients.  
+So the complete case female patients has the same characteristics as the missing data female patients.  
 
 ### Missing not at random (MNAR)
 \index{missing data@\textbf{missing data}!missing not at random}
@@ -230,7 +230,7 @@ table1 %>%
   mykable(caption = "Simulated missing completely at random (MCAR) and missing at random (MAR) dataset.")
 ```
 
-## Check for associations between missing and observed data`
+## Check for associations between missing and observed data
 \index{missing data@\textbf{missing data}!associations}
 
 In deciding whether data is MCAR or MAR, one approach is to explore patterns of missingness between levels of included variables. 
@@ -501,7 +501,7 @@ table_imputed <-
 
 ```{r echo=FALSE}
 table_imputed  %>% 
-  mykable(caption = "Regression analysis with missing data: Multiple imputation using `mice()`.")
+  mykable(caption = "Regression analysis with missing data: Multiple imputation using mice().")
 ```
 
 By examining the coefficients, the effect of the imputation compared with the complete case analysis can be seen.