06-linear_models-plots.qmd

```{r include=FALSE}
source("R/common.R")
knitr::opts_chunk$set(fig.path = "figs/ch06/")
```

::: {.content-visible unless-format="pdf"}
{{< include latex/latex-commands.qmd >}}
:::


# Plots for univariate response models {#sec-linear-models-plots}

For a univariate linear model fit using `lm()`, `glm()` and similar functions, the standard `plot()`
method gives basic versions of _diagnostic_ plots of residuals and other calculated quantities for assessing
possible violations of the model assumptions.
Some of these can be considerably enhanced using other packages.

Beyond this, 

* tables of model coefficients, standard errors and test statistics can often be usefully 
supplemented or even replaced by suitable _coeficient plots_ providing essentially the same information.

* when there are two or more predictors, standard plots of $y$ vs. each $x$ can be misleading because they ignore the other predictors in the model.
_Added-variable plots_ (also called _partial regression plots_) allow you to visualize _partial_ relations
between the response and a predictor by adjusting (controlling for) all other predictors. _Component + residual_ plots help to diagnose nonlinearity.

* in this same situation, you can more easily understand their separate impact
on the response by plotting the marginal _predicted effects_ of one or more focal variables, averaging
over other variables not shown in a given plot. 

* when there are highly correlated predictors, some specialized plots are useful to
understand the nature of _multicolinearity_.

The classic reference on regression diagnostics is @Belsley-etal:80.
My favorite modern texts are the brief @Fox2020 and the more complete @FoxWeisberg:2018,
both of which are supported by the `r package("car", cite=TRUE)`. Some of the examples in
this chapter are inspired by @FoxWeisberg:2018.

**Packages**

In this chapter I use the following packages. Load them now:

```{r load-pkgs}
library(car)
library(dplyr)
library(easystats)
library(effects)
library(ggeffects)
library(ggplot2)
library(ggstats)
library(marginaleffects)
library(modelsummary)
library(performance)
library(tidyr)
```

```{r bib-pkgs}
#| echo: false
# to generate .bib items
library(ggeffects)
library(marginaleffects)
```

## The "regression quartet"

For a fitted model, plotting the model object with `plot(model)` provides for any of six basic plots,
of which four are produced by default, giving rise to the term _regression quartet_ for this collection.
These are:

* **Residuals vs. Fitted**: For well-behaved data, the points should hover around a horizontal line at residual = 0,
with no obvious pattern or trend.

* **Normal Q-Q plot**: A plot of sorted standardized residuals $e_i$ (obtained from`rstudent(model)`) against the theoretical values those values would have in a standard normal $\mathcal{N}(0, 1)$ distribution.

* **Scale-Location**: Plots the square-root of the absolute values of the standardized residuals $\sqrt{| e_i |}$
as a measure of "scale" against the fitted values $\hat{y}_i$ as a measure of "location". This provides an assessment
of homogeneity of variance. Violations appear as a tendency for scale (variability) to vary with location.


* **Residuals vs. Leverage**: Plots standardized residuals against leverage to help identify possibly influential observations.
Leverage, or "hat" values (given by `hat(model)`) are proportional to the squared Mahalanobis distances of
the predictor values $\mathbf{x}_i$ from the means, and measure the potential of an observation to
change the fitted coefficients if that observation was deleted. Actual influence is measured by Cooks's distance
(`cooks.distance(model)`) and is proportional to the product of residual times leverage. Contours of constant
Cook's $D$ are added to the plot.

One key feature of these plots is providing **reference** lines or smoothed curves for ease of judging the
extent to which a plot conforms to the expected pattern; another is the **labeling** of observations which
deviate from an assumption.

The base-R `plot(model)` plots are done much better in a variety of packages.
I illustrate some versions from the **car** [@R-car] and **performance** [@Ludecke-etal-performance] packages,
part of the **easystats** [@R-easystats] suite of packages.


### Example: Duncan's occupational prestige {#sec-example-duncan}

In a classic study in sociology, @Duncan:61 used data from the U.S. Census in 1950 to study how one could
predict the prestige of occupational categories --- which is hard to measure ---
from available information in the census for those occupations. His data is available in `carData:Duncan`, and contains

* `type`: the category of occupation, one of `prof` (professional), `wc` (white collar) or `bc` (blue collar);
* `income`: the percentage of occupational incumbents with a reported income > 3500 (about 40,000 in current dollars);
* `education`: the percentage of occupational incumbents who were high school graduates;
* `prestige`: the percentage of respondents in a social survey who rated the occupation as “good” or better in prestige.

These variables are a bit quirky in they are measured in percents, 0-100, rather dollars for `income` and
years for `education`, but this common scale permitted Duncan to ask an interesting sociological question:
Assuming that both income and education predict prestige, are they equally important, as might be
assessed by testing the hypothesis $\mathcal{H}_0: \beta_{\text{income}} = \beta_{\text{education}}$.
If so, this would provide a very simple theory for occupational prestige.

A quick look at the data shows the variables and a selection of the occupational categories, which are
the `row.names()` of the dataset.
```{r duncan}
data(Duncan, package = "carData")
set.seed(42)
car::some(Duncan)
```

Let's start by fitting a simple model using just income and education as predictors. The results look very good!
Both `income` and `education` are highly significant and the $R^2 = 0.828$ for the model indicates
that `prestige` is very well predicted by just these variables.

```{r duncan-mod}
duncan.mod <- lm(prestige ~ income + education, data=Duncan)
summary(duncan.mod)
```

Beyond this, Duncan was interested in the coefficients and whether income and education could be said
to have equal impacts on predicting occupational prestige. A nice display of model coefficients with
confidence intervals is provided by `parameters::model_parameters()`.

```{r duncan-coef}
parameters::model_parameters(duncan.mod)
```

We can also test Duncan's hypothesis that income and education have equal effects on prestige
with `car::linearHypothesis()`. This is constructed as a test of a restricted model in which
the two coefficients are forced to be equal against the unrestricted model. Duncan was very happy with
this result.
```{r duncan-linhyp}
car::linearHypothesis(duncan.mod, "income = education")
```

Equivalently, the linear hypothesis that $\beta_{\text{Inc}} = \beta_{\text{Educ}}$ can be tested
with a Wald test for difference between these coefficients, expressed as
$\mathcal{H}_0 : \mathbf{C} \mathbf{\beta} = 0$, using $\mathbf{C} = (0, -1, 1)$.
The estimated value, -0.053 has a confidence interval [-0.462, 0.356], consistent with 
Duncan's hypothesis.

```{r duncan-wald}
wtest <- spida2::wald(duncan.mod, c(0, -1, 1))[[1]]
wtest$estimate
```

We can visualize this test and confidence intervals using a joint confidence ellipse for the
coefficients for income and education in the model `duncan.mod`. In @fig-duncan-beta-diff
the size of the ellipse is set to $\sqrt{F^{0.95}_{1,\nu}} = t^{0.95}_{\nu}$, so that
its shadows on the horizontal and vertical axes correspond to 1D 95% confidence intervals.
In this plot, the line through the origin with slope $= 1$ corresponds to
equal coefficients for income and education and the line with slope $= -1$
corresponds to their difference, $\beta_{\text{Educ}} - \beta_{\text{Inc}}$.
The orthogonal projection of the coefficient vector 
$(\widehat{\beta}_{\text{Inc}}, \widehat{\beta}_{\text{Educ}})$ (the center of the ellipse)
is the point estimate of $\widehat{\beta}_{\text{Educ}} - \widehat{\beta}_{\text{Inc}}$
and the shadow of the ellipse along this axis is the 95% confidence interval for the difference
in slopes.

<!-- figure-code: R/Duncan/dunc-conf-ellipse.R -->

```{r echo=-1}
#| label: fig-duncan-beta-diff
#| code-fold: true
#| code-summary: See the code
#| out-width: "80%"
#| fig-width: 8
#| fig-height: 7
#| fig-cap: "Joint 95% confidence ellipse for $(\\beta_{\\text{Inc}}, \\beta_{\\text{Educ}})$, together with their 1D shadows, which give 95% confidence intervals for the separate coefficients and the linear hypothesis that the coefficients are equal. Projecting the confidence ellipse along the line with unit slope gives a confidence interval for the difference between coefficients, shown by the dark red line."
op <- par(mar = c(4,4, 1, 0)+ .3)
confidenceEllipse(duncan.mod, col = "blue",
  levels = 0.95, dfn = 1,
  fill = TRUE, fill.alpha = 0.2,
  xlim = c(-.4, 1),
  ylim = c(-.4, 1), asp = 1,
  cex.lab = 1.3,
  grid = FALSE,
  xlab = expression(paste("Income coefficient, ", beta[Inc])),
  ylab = expression(paste("Education coefficient, ", beta[Educ])))

abline(h=0, v=0, lwd = 2)

# confidence intervals for each coefficient
beta <- coef( duncan.mod )[-1]
CI <- confint(duncan.mod)       # confidence intervals
lines( y = c(0,0), x = CI["income",] , lwd = 5, col = 'blue')
lines( x = c(0,0), y = CI["education",] , lwd = 5, col = 'blue')
points(rbind(beta), col = 'black', pch = 16, cex=1.5)
points(diag(beta) , col = 'black', pch = 16, cex=1.4)
arrows(beta[1], beta[2], beta[1], 0, angle=8, len=0.2)
arrows(beta[1], beta[2], 0, beta[2], angle=8, len=0.2)

# add line for equal slopes
abline(a=0, b = 1, lwd = 2, col = "darkgreen")
text(0.8, 0.8, expression(beta[Educ] == beta[Inc]), 
     srt=45, pos=3, cex = 1.5, col = "darkgreen")

# add line for difference in slopes
col <- "darkred"
x <- c(-1.5, .5)
lines(x=x, y=-x)
text(-.15, -.15, expression(~beta["Educ"] - ~beta["Inc"]), 
     col=col, cex=1.5, srt=-45)

# confidence interval for b1 - b2
wtest <- spida2::wald(duncan.mod, c(0, -1, 1))[[1]]
lower <- wtest$estimate$Lower /2
upper <- wtest$estimate$Upper / 2
lines(-c(lower, upper), c(lower,upper), lwd=6, col=col)

# projection of (b1, b2) on b1-b2 axis
beta <- coef( duncan.mod )[-1]
bdiff <- beta %*% c(1, -1)/2
points(bdiff, -bdiff, pch=16, cex=1.3)
arrows(beta[1], beta[2], bdiff, -bdiff, 
       angle=8, len=0.2, col=col, lwd = 2)

# calibrate the diff axis
ticks <- seq(-0.3, 0.3, by=0.2)
ticklen <- 0.02
segments(ticks, -ticks, ticks-sqrt(2)*ticklen, -ticks-sqrt(2)*ticklen)
text(ticks-2.4*ticklen, -ticks-2.4*ticklen, ticks, srt=-45)
```


<!-- ```{r} -->
<!-- #| label: fig-duncan-beta-diff-include -->
<!-- #| echo: false -->
<!-- #| out-width: "80%" -->
<!-- #| fig-cap: "Joint 95% confidence ellipse for $(\beta_{\text{Inc}}, \beta_{\text{Educ}}), together with their 1D shadows, which give 95% confidence intervals for the separate coefficients and the linear hypothesis that the coefficients are equal." -->
<!-- knitr::include_graphics("images/Duncan-beta-diff.png") -->
<!-- ``` -->


#### Diagnostic plots
But, should Duncan be **so** happy? It is unlikely that he ran any model diagnostics or plotted
his model; we do so now. Here is the regression quartet (@fig-duncan-plot-model) for this model. Each plot shows
some trend lines, and importantly, labels some observations that stand out and might deserve attention.

```{r}
#| label: fig-duncan-plot-model
#| out-width: "100%"
#| fig-cap: "Regression quartet of diagnostic plots for the `Duncan` data. Several possibly unusual observations are labeled."
op <- par(mfrow = c(2,2), 
          mar = c(4,4,3,1)+.1)
plot(duncan.mod, lwd=2, pch=16)
par(op)
```

Some points to note:

* A few observations (minister, reporter, conductor, contractor) are flagged in multiple panels.
It turns out (@sec-duncan-influence)
that the observations for minister and reporter noted in the residuals vs. leverage plot are highly influential and largely responsible for Duncan's finding that the slopes for income and
education could be considered equal.
* The `r colorize("red")` trend line in the scale-location plot indicates that residual variance is not constant, but rather increases from both ends. This is a consequence of the fact that `prestige` is measured as a percentage, bounded at [0, 100], and the standard deviation of a percentage $p$ is proportional to $\sqrt{p \times (1-p)}$ which is maximal at $p = 0.5.

Similar, but nicer-looking diagnostic plots are provided by `performance::check_model()`
which uses `ggplot2` for graphics. These include helpful captions indicating
what should be observed for each for a good-fitting model. However, they don't have
as good facilities for labeling unusual observations as the base R `plot()` or
functions in the `car` package.

```{r}
#| label: fig-duncan-check-model
#| out-width: "100%"
#| fig-cap: "Diagnostic plots for the `Duncan` data, using `check_model()`."
check_model(duncan.mod, check=c("linearity", "qq", "homogeneity", "outliers"))
```


### Example: Canadian occupational prestige {#sec-example-prestige}

<!-- **TODO**:  Already introduced in @sec-prestige Pair this down & integrate with that -->

Following @Duncan:61, occupational prestige was studied in a Canadian context by Bernard Blishen
and others at York University, giving the dataset `carData::Prestige` which we looked at
in @sec-prestige.
It differs from the `Duncan` dataset primarily in that the main variables---prestige, income and education
were revamped to better reflect the underlying constructs in more meaningful units.

* `prestige`: Rather than a simple percentage of "good+" ratings, this uses a wider and more reliable scale
from @Pineo-Porter-1967 on a scale from 10--90.

* `income` is measured as the average income of incumbents in each occupation, in 1971 dollars, rather than
percent exceeding a given threshold ($3500)

* `education` is measured as the average education of occupational incumbents, years.

The dataset again includes `type` of occupation with the same levels `"bc"` (blue collar), `"wc"` (white collar) and `"prof"` (professional)[^reorder-type],
but in addition includes the percent of `women` in these
occupational categories.

[^reorder-type]: Note that the factor `type` in the dataset has its levels ordered alphabetically.
For analysis and graphing it is useful to reorder the levels in the natural increasing order.
An alternative is to make `type` an _ordered_ factor, but this would represent it
using polynomial contrasts for linear and quadratic trends, which seems unuseful in this context.

Our interest again is in understanding how `prestige` 
is related to census measures of the average education, income, percent women of incumbents in those
occupations, but with attention to the scales of measurement and possibly more complex relationships.

```{r prestige}
data(Prestige, package="carData")
# Reorder levels of type
Prestige$type <- factor(Prestige$type, 
                        levels=c("bc", "wc", "prof")) 
str(Prestige)
```

We fit a main-effects model using all predictors (ignoring `census`, the Canadian Census occupational code):

```{r prestige-mod}
prestige.mod <- lm(prestige ~ education + income + women + type,
                   data=Prestige)
```

`plot(model)` produces four separate plots. For a quick look, I like to arrange them in a single 2x2 figure.

```{r}
#| label: fig-plot-prestige-mod
#| out-width: "100%"
#| fig-show: hold
#| fig-cap: "Regression quartet of diagnostic plots for the `Prestige` data. Several possibly unusual observations are labeled in each plot."
op <- par(mfrow = c(2,2), 
          mar=c(4,4,3,1)+.1)
plot(prestige.mod, lwd=2, cex.lab=1.4)
par(op)
```


## Other Model plots

**TODO**: What goes here?

## Coefficient displays

The results of linear models are most often reported in tables and typically with "significance stars"
(`*, **, ***`) to indicate the outcome of hypothesis tests. These are useful for looking up precise values and you can use this format to compare a small number of competing models side-by-side.
However, as illustrated by @KastellecLeoni:2007, plots of coefficients can increase the clarity of presentation
and make it easier to draw correct conclusions. Yet, when you need to present tables, there is a variety of tools
in R that can help make them attractive in publications.

For illustration, I'll consider three models for the `Prestige` data of increasing complexity:

* `mod1` fits the main effects of the three quantitative predictors; 
* `mod2` adds the categorical variable `type` of occupation;
* `mod3` allows an interaction of `income` with `type`. 

```{r prestige-models}
mod1 <- lm(prestige ~ education + income + women,
           data=Prestige)
mod2 <- lm(prestige ~ education + women + income + type,
           data=Prestige)
mod3 <- lm(prestige ~ education + women + income * type,
           data=Prestige)
```

From our earlier analyses (@sec-prestige) we saw that the marginal relationship between `income` and `prestige` was nonlinear @fig-Prestige-scatterplot-income1), and was better represented in a linear model using
`log(income)` (@sec-log-scale) shown in @fig-Prestige-scatterplot2. However, this possibly non-linear relationship could also be explained by stratifying (@sec-stratifying) the data by `type` of occupation (@fig-Prestige-scatterplot3).


### Displaying coefficients

`summary()` gives the complete precis of a fitted model, with information about the estimated coefficients, residuals and goodness-of fit statistics like $R^2$. But if you only want to see the coefficients,
standard errors, etc. `lmtest::coeftest()` gives these results in the familiar format for console output.
`broom::tidy()` places these in a tidy format common to many modeling functions which is useful for 
futher processing (e.g., comparing models).
```{r coeftest}
lmtest::coeftest(mod1)

broom::tidy(mod1)
```

The `r package("modelsummary", cite=TRUE)` is an easy to use,
very general package to summarize data and statistical models in R. The main function
`modelsummary()` can produce highly customizable tables of coefficients
in a wide variety of output formats, including HTML, PDF, LaTeX, Markdown, and MS Word.
You can select the statistics displayed for any model term with the `estimate` and
`statistic` arguments. 
```{r}
#| label: tbl-modelsummary1
#| tbl-cap: Table of coefficients for the main effects model.
modelsummary(list("Model1" = mod1),
  coef_omit = "Intercept",
  shape = term ~ statistic,
  estimate = "{estimate} [{conf.low}, {conf.high}]",
  statistic = c("std.error", "p.value"),
  fmt = fmt_statistic("estimate" = 3, "conf.low" = 4, "conf.high" = 4),
  gof_omit = ".")
```

`gof_omit` allows you to omit or select the goodness-of-fit statistics and other model information
available from those listed by `get_gof()`:

```{r}
get_gof(mod1)
```


### Visualizing coefficients

`modelplot()` is the companion function. It allows you to plot model estimates and confidence intervals. It makes it easy to subset, rename, reorder, and customize plots using same mechanics as in `modelsummary()`.
```{r}
#| label: fig-modelplot1
#| out-width: "80%"
#| fig-cap: "Plot of coefficients and their standard error bar for the simple main effects model"
theme_set(theme_minimal(base_size = 14))

modelplot(mod1, coef_omit="Intercept", 
          color="red", size=1, linewidth=2) +
  labs(title="Raw coefficients for mod1")
```

But this plot is disappointing and misleading because it show the **raw** coefficients. 
From the plot, it looks like only `education` has a non-zero effect, but the effect of `income` is also
highly significant. The problem is that the magnitude of the coefficient $\hat{b}_{\text{education}}$
is more than 40,000 times that of the other coefficients, because education is measured years,
while income is measured in dollars. 
The 95% confidence interval for $\hat{b}_{\text{income}} = [0.0008, 0.0019]$, but this is invisible in the plot.

Before figuring out how to fix this issue, I show the comparable displays from `modelsummary()`
and `modelplot()` for all three models.
When you give `modelsummary()` a list of models, it displays their coefficients side-by-side
as shown in @tbl-modelsummary2. 


```{r}
#| label: tbl-modelsummary2
#| tbl-cap: Table of coefficients for three models.
models <- list("Model1" = mod1, "Model2" = mod2, "Model3" = mod3)
modelsummary(models,
     coef_omit = "Intercept",
     fmt = 2,
     stars = TRUE,
     shape = term ~ statistic,
     statistic = c("std.error", "p.value"),
     gof_map = c("rmse", "r.squared")
     )
```

Note that a factor predictor (like `type` here) with $d$ levels is represented by $d-1$ coefficients
in main effects and in interactions with quantitative variables. These levels are coded with _treatment contrasts_ by default. Also by default, the first level is set as the reference level in alphabetical order.
Here the reference level is blue collar (`bc`), so the coefficient `typeprof = 5.91`
indicates that professional occupations on average are rated 5.91 greater on the Prestige scale
than blue collar workers.

Note also that unlike the table,
the coefficients in @fig-modelplot1 are ordered from bottom to top, because the Y axis starts
at the lower left corner. In @fig-modelplot2  I use `scale_y_discrete()` to reverse the order.
It is also useful to add a vertical reference line at $\beta = 0$.
```{r}
#| label: fig-modelplot2
#| out-width: "90%"
#| fig-cap: "Plot of raw coefficients and their confidence intervals for all three models"
modelplot(models, 
          coef_omit="Intercept", 
          size=1.3, linewidth=2) +
  ggtitle("Raw coefficients") +
  geom_vline(xintercept = 0, linewidth=1.5) +
  scale_y_discrete(limits=rev) +
  theme(legend.position = "inside",
        legend.position.inside = c(0.85, 0.2))
```

### More useful coefficient plots

The problem with plots of raw coefficients shown in @fig-modelplot1 and @fig-modelplot2
is that the coefficients for different predictors are not directly comparable because
they are measured in different units.

One alternative is to plot the _standardized coefficients_. 
Another way is to re-scale the predictors into more comparable and meaningful units.
I illustrate these ideas below.


#### Standardized coefficients {.unnumbered}

The simplest way to do this is to transform all variables to standardized ($z$) scores.
The coefficients are then interpreted
as the standardized change in prestige for a one standard deviation change in the
predictors.
The syntax below uses `scale` to transform all the numeric variables.
Then, we re-fit the models using the standardized data.

```{r standardize}
Prestige_std <- Prestige |>
  as_tibble() |>
  mutate(across(where(is.numeric), scale))

mod1_std <- lm(prestige ~ education + income + women, 
               data=Prestige_std)
mod2_std <- lm(prestige ~ education + women + income + type, 
               data=Prestige_std)
mod3_std <- lm(prestige ~ education + women + income * type, 
               data=Prestige_std)
```

The plot in @fig-modelplot3 now shows the significant effect of income in all three models.
As well, it offers a more sensitive comparison of the coefficients of other terms
across models; for example `women` is not significant in models 1 and 2, but becomes
significant in Model 3 when the interaction of `income * type` is included.
```{r}
#| label: fig-modelplot3
#| out-width: "90%"
#| fig-cap: "Plot of standardized coefficients and their confidence intervals for all three models"
models <- list("Model1" = mod1_std, "Model2" = mod2_std, "Model3" = mod3_std)
modelplot(models, 
          coef_omit="Intercept", size=1.3) +
  ggtitle("Standardized coefficients") +
  geom_vline(xintercept = 0, linewidth = 1.5) +
  scale_y_discrete(limits=rev) +
  theme(legend.position = "inside",
        legend.position.inside = c(0.85, 0.2))
```

It turns out there is an easier way to get plots of standardized coefficients.
`modelsummary()` extracts coefficients from model objects using the `parameters` package, 
and that package offers several options for standardization: 
See [model parameters documentation](https://easystats.github.io/parameters/reference/model_parameters.default.html).
We can pass the `standardize="refit"` (or other) argument directly to `modelsummary()` or `modelplot()`, and that argument will be forwarded to `parameters`. The plot produced by the code below is identical to @fig-modelplot3 and is not shown.

```{r}
#| eval: false
modelplot(list("mod1" = mod1, "mod2" = mod2, "mod3" = mod3),
          standardize = "refit",
          coef_omit="Intercept", size=1.3) +
  ggtitle("Standardized coefficients") +
  geom_vline(xintercept = 0, linewidth=1.5) +
  scale_y_discrete(limits=rev) +
  theme(legend.position = "inside",
        legend.position.inside = c(0.85, 0.2))
```

The `r package("ggstats", cite=TRUE)` provides even nicer versions
of coefficient plots that handle factors in a more reasonable way, as levels within
the factor. `ggcoef_model()` plots a single model and `ggcoef_compare()` plots 
a list of models using sensible defaults. A small but nice feature is that it explicitly
shows the 0 value for the reference level of a factor (`type = "bc"` here)
and uses better labels for factors and their interactions.

```{r}
#| label: fig-ggcoef-compare
#| out-width: "90%"
#| fig-cap: "Model comparison plot from `ggcoef_compare()`"
models <- list(
  "Base model"      = mod1_std,
  "Add type"        = mod2_std,
  "Add interaction" = mod3_std)

ggcoef_compare(models) +
  labs(x = "Standarized Coefficient")
```

#### More meaningful units {.unnumbered}

Standardizing the variables makes the coefficients directly comparable, but it may be harder
to understand what they mean in terms of the variables. For example, the coefficient of
`income` in `mod2_std` is 0.25. A literal interpretation is that occupational prestige
is expected to increase 0.25 standard deviations for each standard deviation increase in income,
but it may be difficult to appreciate what this means.

A better substantive comparison of the coefficients could use understandable scales for the
predictors, e.g., months of education, $100,000 of income or 10% of women's participation.
Note that the effect of this is just to multiply the coefficients and their standard errors by a factor. 
The statistical conclusions of significance are unchanged.

For simplicity, I do this just for Model 1.

```{r scaled-units}
Prestige_scaled <- Prestige |>
  mutate(education = 12 * education,
         income = income / 100,
         women = women / 10)

mod1_scaled <- lm(prestige ~ education + income + women,
                  data=Prestige_scaled)
```

When we plot this with `ggcoef_model()`, there are many options to control how variables are
labeled and other details.
```{r}
#| label: fig-ggcoef-compare2
#| out-width: "90%"
#| fig-cap: "Plot of coefficients for prestige with scaled predictors for Model 1."
ggcoef_model(mod1_scaled,
  signif_stars = FALSE,
  variable_labels = c(education = "education\n(months)",
                      income = "income\n(/$100K)",
                      women = "women\n(/10%)")) +
  xlab("Coefficients for prestige with scaled predictors")
```

So, on average, each additional month of education increases the prestige rating by 0.34 units,
while an additional $100,000 of income increases it by 0.13 units. While these are significant
effects, they are not large in relation to the scale of `prestige` which ranges 14.8---87.2.

<!-- ## Spread-level plot -->


## Added-variable and related plots {#sec-avplots}

In multiple regression problems, it is most often useful to construct a scatterplot matrix
and examine the plot of the response vs. each of the predictors as well as those
of the predictors against each other.
However, the simple, _marginal_ scatterplots of a response $y$ against _each_ of several predictors $x_1, x_2, \dots$ can be misleading because each one ignores the other predictors.

To see this consider a toy dataset, `coffee`, giving measures of coffee consumption,
occupational stress and an index of heart problems in a sample of $n=20$ graduate students
and professors.

```{r}
#| label: fig-coffee-spm
#| fig-width: 7
#| fig-height: 6
#| out-width: "100%"
#| fig-cap: "Scatterplot matrix showing pairwise relations among `Heart` ($y$), `Coffee` consumption ($x_1$) and `Stress` ($x_2$), with linear regression lines and 68% data ellipses for the bivariate relations"
data(coffee, package="matlib")

scatterplotMatrix(~ Heart + Coffee + Stress, 
  data=coffee,
  smooth = FALSE,
  ellipse = list(levels=0.68, fill.alpha = 0.1),
  pch = 19, cex.labels = 2.5)
```

The message from these marginal plots in @fig-coffee-spm
seems to be that coffee is bad for your heart,
stress is bad for your heart, and stress is also strongly
related to coffee consumption. Yet, when we fit a model with both variables together,
we get the following results:

```{r fit-both}
fit.both   <- lm(Heart ~ Coffee + Stress, data=coffee)
lmtest::coeftest(fit.both)
```

The coefficients suggest that stress is indeed bad for your heart,
but the negative (though non-significant) coefficient for coffee suggests that coffee is good for you.How can this be? Does that mean I should drink more coffee, while avoiding stress?

The reason for this apparent paradox is that
the general linear model fit by `lm()` estimates all effects together and so the coefficients pertain to the _partial_ effect of a given predictor, _adjusting_ for the effects of all others.
That is, the coefficient for coffee ($\beta_{\text{Coffee}} = -0.41$) estimates the effect of coffee for
people with _same_ level of stress. In the marginal scatterplot, the positive slope for coffee
(1.10) ignores the correlation of coffee and stress.

This is an example of _confounding_ in regression when an important predictor is omitted.
Stress is positively associated with both coffee consumption and heart damage.
When stress is omitted, the coefficient for coffee is biased because it "picks up"
the relation with the omitted variable.

A solution to this problem is
the _added-variable_ plot ("AV plot", also called _partial regression_ plot, MostellerTukey-1977).
This is a multivariate analog of a
a simple marginal scatterplot, designed to visualize directly the partial relation between $y$ and the predictors
$x_1, x_2, \dots$ in a multiple regression model. 

You can think of this as a magic window that hides
the relations of all other variables with each of the $y$ and $x_i$ shown in a given added-variable plot.
This gives an unobstructed view of the net relation between $y$ and $x_i$ with the effect of
all other variables removed. In effect, it reduces the problem of viewing the complete model
in $p$-dimensional space to a sequence of $p$ 2D plots, each of which tells the story of one
predictor, unentangled from the others. This is essentially the same idea as
the partial variables plot (@sec-pvPlot) used to understand partial correlations.

The construction of an AV plot is conceptually very simple. For variable $x_i$, imagine that we fit two supplementary regressions:

* Regress $\mathbf{y}$ on $\mathbf{X_{(-i)}}$, the model matrix of all of the regressors except $x_i$. 
By definition, the residuals from this regression, 
$\mathbf{y}^\star \equiv \mathbf{y} \,\vert\, \text{others} = \mathbf{y} - \widehat{\mathbf{y}} \,\vert\, \mathbf{X_{(-i)}}$,
<!-- $r_y(x_i) = y - \widehat{y}_{\mathbf{X_{(-i)}}} \equiv y \,\vert\, \text{others}$, -->
are the part of $\mathbf{y}$ that cannot be
explained by all the other regression terms. These residuals are necessarily uncorrelated
with the other predictors.

* Regress $x_i$ on the other predictors, $\mathbf{X_{(-i)}}$ and again obtain the residuals.
These residuals, 
$\mathbf{x}_i^\star \equiv \mathbf{x}_i \,\vert\, \text{others} = \mathbf{x}_i - \widehat{\mathbf{x}}_i \,\vert\, \mathbf{X_{(-i)}}$
give the part of $x_i$ that cannot be explained by the others, and so are uncorrelated with them.

The AV plot is then just a simple scatterplot of these residuals, $\mathbf{y}^\star$
on the vertical axis, and $\mathbf{x}^\star$ on the horizontal.
In practice, it is unnecessary to run the auxilliary regressions this way [@VellemanWelsh:81].
Both can be calculated using `stats::lsfit()` roughly as follows:

```{r AVcalc}
#| eval: false
AVcalc <- function(model, variable)
X <- model.matrix(model)
response <- model.response(model)
x <- X[, -variable]
y <- cbind(X[, variable], response)
fit <- lsfit(x, y, intercept = FALSE)
resids <- residuals(fit)
return(resids)
```

Note that `y` here contains both the current predictor, $\mathbf{x}_i$ and the response $\mathbf{y}$, so the
residuals `resids` have two columns, one for $x_i \,\vert\, \text{others}$
and one for $y \,\vert\, \text{others}$.

Added-variable plots are produced using `car::avPlot()` for one predictor or
`avPlots()` for any number of model terms. The `id` argument controls
which points are identified in the plots; `n=2` labels the two points
that are furthest from the mean on the horizontal axis _and_ the two with the largest
absolute residuals. For instance, in @fig-coffee-avPlots, observations 5 and 13 are flagged
because their conditional $\mathbf{x}_i^\star$ values are extreme; observation 17 has a
large absolute residual, $\mathbf{y}^\star = \text{Heart} \,\vert\, \text{others}$.


```{r}
#| label: fig-coffee-avPlots
#| out-width: "100%"
#| fig-cap: "Added-variable plots for Coffee and Stress in the multiple regression model"
avPlots(fit.both,
  ellipse = list(levels = 0.68, fill=TRUE, fill.alpha = 0.1),
  pch = 19,
  id = list(n = 2),
  cex.lab = 1.5,
  main = "Added-variable plots for Coffee data")
```

The data ellipses for $\mathbf{x}_i^\star$ and $\mathbf{y}^\star$ summarize the conditional
(or partial) relations of the response to each predictor controlling for all other predictors
in each plot.  The essential idea is that the data ellipse for $(\mathbf{x}_i^\star, \mathbf{y}^\star)$ has the identical relation to the estimate $\hat{b}_i$ in a
multiple regression as the data ellipse of $(\mathbf{x}, \mathbf{y})$ has to the slope in
a simple regression.

### Properties of AV plots
AV plots are particularly interesting and useful for the following noteworthy properties:

* The slope of the simple regression in the AV plot for variable $x_i$ is identical to the slope
$b_i$ for that variable in the full multiple regression model.

* The residuals in this plot are the same as the residuals using all predictors. This means you can
see the degree of fit for observations directly in relation to the various predictors, which is not
the case for marginal scatterplots.

* Consequentially, the standard deviation of the (vertical) residuals in the AV plot is the same as $s = \sqrt(MSE)$ in the full model and the standard error of a coefficient is $\text{SE}(b_i) = s / \sqrt{\Sigma (\mathbf{x}_i^\star)^2}$. This is shown by the size of the shadow of the data ellipses on
the vertical axis in @fig-coffee-avPlots.

* The horizontal positions, $\mathbf{x}_i^\star$, of points adjust for all other predictors, and so we can see points at the extreme left and right as unusual in relation to the others. If these points are also badly fitted (large residuals), we can see their influence on the fitted relation in the full model. AV plots thus provide visual displays of (partial) _leverage_ and _influence_ on each of the regression coefficients.

* The correlation of $\mathbf{x}_i^\star$ and $\mathbf{y}^\star$ shown by the shape of the data ellipses is the _partial correlation_ between $\mathbf{x}_i$ and $\mathbf{y}_i$
with other predictors partialled out.

### Marginal - conditional plots
The relation of the _conditional_ data ellipses in AV plots to those in _marginal_ plots of the same variables
provides further insight into what it means to "control for" other variables.
@fig-coffee-mcplot shows the same added-variable plots for Heart disease on Stress and Coffee
as in @fig-coffee-avPlots (with a zoomed-out scaling), but here we also overlay the
marginal data ellipses for $(\mathbf{x}_i, \mathbf{y})$ (centered at the means),
and marginal regression lines for Stress and Coffee separately. Drawing arrows connecting the
original data points to their positions in the AV plot shows what happens when we condition on
or partial out the other variable.

These _marginal - conditional plots_ are produced by `car::mcPlot()` (for one regressor)
and `car::mcPlots()` (for several). The plots for the marginal and conditional relations
can be compared separately using the same scales for both, or overlaid as shown here.
The points labeled here are only those with large absolute residuals $\mathbf{y}^\star$ in the vertical
direction.

```{r}
#| label: fig-coffee-mcplot
#| fig-width: 12
#| fig-height: 7
#| out-width: "100%"
#| fig-cap: Marginal $+$ conditional (added-variable) plots for Coffee and Stress in the multiple regression predicting Heart disease. Each panel shows the 68% conditional data ellipse for $x_i^\star, y^\star$ residuals (shaded, blue) as well as the marginal 68% data ellipse for the $(x_i, y)$ variables, shifted to the origin. Arrows connect the mean-centered marginal points (red) to the residual points (blue).
mcPlots(fit.both, 
  ellipse = list(levels=0.68, fill=TRUE, fill.alpha=0.2),
  id = list(n=2),
  pch = c(16, 16),
  col.marginal = "red", col.conditional = "blue",
  col.arrows = "black",
  cex.lab = 1.5)
```


The most obvious feature of @fig-coffee-mcplot
is that Coffee has a negative slope in the conditional AV plot but a positive slope in the marginal
plot. This is an example of Simpson's paradox in a regression context: marginal and conditional relations
can have opposite signs.
\index{Simpson's paradox}

Less obvious is the relation between the marginal and AVP ellipses. In 3D, the marginal
data ellipse is the shadow of the ellipsoid for $(\mathbf{y}, \mathbf{x}_1, \mathbf{x}_2)$ on one of the coordinate planes,
while the AV plot is a slice through the ellipsoid where either $\mathbf{x}_1$ or $\mathbf{x}_2$
is held constant.
Thus, the AVP ellipse must be contained in the marginal ellipse, as we can see in @fig-coffee-mcplot.
If there are only two $x$s, then the AVP ellipse must touch the marginal ellipse at two points.

Finally, @fig-coffee-mcplot also shows how conditioning on other predictors works for individual
observations, where each point of  $(\mathbf{x}_i^\star, \mathbf{y}^\star)$ is the image of $(\mathbf{x}_i, \mathbf{y})$ along the path of the marginal regression. The variability in the response and in the focal
predictor are both reduced, leaving only the uncontaminated relation of $\mathbf{y}$ with $\mathbf{x}_i$.

These plots are similar in spirit to the ARES plot ("Adding REgressors Smoothly") proposed
by @CookWeisberg-1994, but their idea was an interactive animation,
displaying a smooth transition between the fit of a marginal model and
the fit of a larger model. They used linear interpolation,
$$
(\mathbf{x}_i, \mathbf{y})_{\text{interp}} = (\mathbf{x}_i, \mathbf{y}) + \lambda [(\mathbf{x}_i^\star, \mathbf{y}^\star) - (\mathbf{x}_i, \mathbf{y})] \comma
$$
controlled by a slider
whose value, $\lambda \in [0, 1]$, was the weight given to the smaller marginal model.
See [this animation](https://www.datavis.ca/gallery/animation/duncanAV/) for an example
using the Duncan data.

### Prestige data

For a substantive example, let's return to the model for income, education and women in the
`Prestige` data. The plot in @fig-prestige-avplot shows the strong positive relations of
income and education to prestige in the full model, and the negligible relation of
percent women. But, in the plot for income, two occupations (physicians and general managers)
with high income strongly pull the regression line down from what can be seen in
the orientation of the conditional data ellipse.

```{r}
#| label: fig-prestige-avplot
#| fig-width: 10
#| fig-height: 10
#| out-width: "100%"
#| fig-cap: "Added-variable plot for the quantitative predictors in the `Prestige` data."
prestige.mod1 <- lm(prestige ~ education + income + women,
           data=Prestige)

avPlots(prestige.mod1, 
  ellipse = list(levels = 0.68),
  id = list(n = 2, cex = 1.2),
  pch = 19,
  cex.lab = 1.5,
  main = "Added-variable plots for prestige")
```

The influential points for physicians and general managers could just be unusual, or
suggest that the relation of income to prestige is nonlinear. A rough test of this
is to fit a smoothed curve through the points in the AV plot as shown in @fig-prestige-av-income.

```{r}
#| label: fig-prestige-av-income
#| fig-width: 7
#| fig-height: 6
#| out-width: "75%"
#| fig-cap: "Added-variable plot for income, with a loess smooth."
op <- par(mar=c(4, 4, 1, 0) + 0.5)
res <- avPlot(prestige.mod1, "income",
              ellipse = list(levels = 0.68),
              pch = 19,
              cex.lab = 1.5)
smooth <- loess.smooth(res[,1], res[,2])
lines(smooth, col = "red", lwd = 2.5)
```

However, this use of AV plots to diagnose nonlinearity or suggest transformations can be misleading 
[@Cook-1996]. Curvature in these plots is an indication of _some_ model deficiency,
but unless the predictors are uncorrelated, they cannot determine the form of a possible
transformation of the predictors.

### Component + Residual plots

A plot more suited to detecting the need to transform a predictor $\mathbf{x}_i$
to a form $f(\mathbf{x}_i)$ to make it's relationship with the response $\mathbf{y}$
more nearly linear is the _component + residual plot_ ("C+R plot", also called _partial residual plot_, @LarsenMcCleary:72; @Cook:93). This plot displays
the partial residual $\mathbf{e} + \hat{b}_i \mathbf{x}_i$ on the vertical axis
against $\mathbf{x}_i$ on the horizontal, where $\mathbf{e}$ are the residuals from the full model.
A smoothed curve through the points will often suggest the form of the transformation $f()$.
The fact that the horizontal axis is $\mathbf{x}_i$ itself rather than $\mathbf{x}^\star_i$
makes it easier to see the functional form.

The C+R plot has the same desirable properties as the AV plot: The slope $\hat{b}_i$
and residuals $\mathbf{e}$ in this plot are the same as those in the full model.

C+R plots are produced by `car::crPlots()` and `car::crPlot()`.
@fig-prestige-crplot shows this just for income in the model `prestige.mod1`.
(These plots for education and women show no strong evidence of curvilinearity.)
The dashed `r colorize("blue")` line is the linear _partial fit_, $\hat{b}_i \mathbf{x}_i$,
whose slope $\hat{b}_2 = 0.0013$ is the same as that for income in `prestige.mod1`.
The solid `r colorize("red")` curve is the loess smooth through the points.
The same points are identified as noteworthy as in AV plot in @fig-prestige-av-income.

```{r}
#| label: fig-prestige-crplot
#| fig-width: 7
#| fig-height: 6
#| out-width: "75%"
#| results: hide
#| fig-keep: all
#| fig-cap: !expr paste("Component + residual plot for income in the model for the quantitative predictors of prestige. The dashed", blue, "line is the partial linear fit for income. The solid", red, "curve is the loess smooth.")
crPlot(prestige.mod1, "income",
       smooth = TRUE,
       order = 2,
       pch = 19,
       col.lines = c("blue", "red"),
       id = list(n=2, cex = 1.2),
       cex.lab = 1.5) 
```


The partial relation between prestige and income is clearly curved, so it would be appropriate
to transform income or to include a polynomial (quadratic) term and refit
the model. As suggested earlier (@sec-prestige) it makes sense statistically and substantively
to model the effect of income on a log scale, so then the slope for `log(income)`
would measure the increment in prestige for a constant _percentage increase_ in income.

The effect of percent women on prestige seen in @fig-prestige-avplot appears very small
and essentially linear. However, if we wished to examine this more closely,
we could use the C+R plot in @fig-prestige-crplot-women.

```{r}
#| label: fig-prestige-crplot-women
#| fig-width: 7
#| fig-height: 6
#| out-width: "75%"
#| results: hide
#| fig-keep: all
#| fig-cap: "Component + residual plot for women in the model for the quantitative predictors of prestige."
crPlot(prestige.mod1, "women",
       pch = 19,
       col.lines = c("blue", "red"),
       id = list(n=2, cex = 1.2),
       cex.lab = 1.5)
```

This shows a slight degree of curvature, with modestly larger values in the extremes.
If we wished to test this statistically, we could fit a model with a quadratic effect of
women, and compare that to the linear-only effect using `anova()`.
```{r prestige-mod2}
prestige.mod2 <- lm(prestige ~ education + income + poly(women,2),
           data=Prestige)

anova(prestige.mod1, prestige.mod2)
```

This model ignores the `type` of occupation ("bc", "wc", "prof")
as well as any possible interactions of type with other predictors. We examine this next,
using effect displays.

## Effect displays


For two predictors it is possible, even if awkward,
to display the fitted response surface in a 3D plot or faceted 2D views
in what I call a _full model plot_.
For more than two predictors
such displays become cumbersome if not impractical, particularly when there are interactions
in the model, when some effects are curvilinear,
or when the main substantive interest is focused understanding on one or more
main effects or interaction terms in the presence of others.
The method of _effect displays_, largely introduced by John Fox [@Fox:87;@Fox:03:effects;@FoxWeisberg2018]
is a generally useful solution to this problem. [^effects]
These plots are nearly always easier to understand than
tables of coefficients.

[^effects]: Earlier, but less general expression of these ideas go back to the use of
**adjusted means** in analysis of covariance [@Fisher-1936]
or **least squares means** or **population marginal means**
in analysis of variance [@Searle-etal:80]

The idea of effect displays is quite simple, but very general and handles models of arbitrary complexity.
Imagine that in a model we have a particular subset of predictors (_focal predictors_) whose effects
on the response variable we wish to visualize. The essence of an effect display is that we calculate
the predicted values (and standard errors) of the response for the model term(s) involving the
focal predictors (and all low-order relatives, e.g, main effects that are marginal to an interaction)
as those predictors are allowed to vary over a grid covering their range.

For a given plot, the other, non-focal variables are "controlled"
by being fixed at typical values.
For example, a quantitative predictor could be fixed at it's mean, median or some representative value.
A factor could be fixed at equal proportions of its levels or its proportions in the data.
The result, when plotted, shows the predicted effects of the focal variables,
either with multiple lines or in a faceted display, but with all the other variables controlled,
adjusted for or averaged over. For interaction effects all low-order relatives are typically included
in the fitted values for the term being graphed.

In practical terms, a scoring matrix $\mathbf{X}^\bullet$ is defined by
the focal variables varied over their ranges and the other variables held
fixed.  The fitted values for a model term are then calculated as 
$\widehat{\mathbf{y}}^\bullet = \mathbf{X}^\bullet \; \widehat{\mathbf{b}}$
using the equivalent of:
```{r}
#| eval: false
predict(model, newdata = X, se.fit = TRUE) 
```
which also calculates the standard errors
as the square roots of
$\diag (\mathbf{X}^\bullet \, \widehat{\mathbf{\Var}} (\mathbf{b}) \, \mathbf{X}^{\bullet\mathsf{T}} )$
where $\widehat{\mathbf{\Var}} (\mathbf{b})$ is the estimated covariance matrix of the coefficients.
Consequently, predictor effect values can be obtained for _any_ modelling function that has
`predict()` and `vcov()` methods. To date, effect displays are available for
over 100 different model types in various packages.

To illustrate the mechanics for the effect of education in the `prestige.mod1` model,
construct a data frame varying education, but fixing income and women at their means:

```{r effect-calc}
X <- expand.grid(
      education = seq(8, 16, 2),
      income = mean(Prestige$income),
      women = mean(Prestige$women)) |> 
  print(digits = 3)
```

`predict()` then gives the fitted values for a simple effect plot of prestige against education.
`predict.lm()` returns list, so it is necessary to massage this to a data frame for graphing.
```{r}
pred <- predict(prestige.mod1, newdata=X, se.fit = TRUE)
cbind(X, fit = pred$fit, se = pred$se.fit) |> 
  print(digits=3)
```


As @FoxWeisberg2018 note, effect displays can be combined with partial residuals to visualize _both_
fit and potential lack of fit simultaneously, by plotting residuals from a model around
2D slices of the fitted response surface. This adds the benefits of C+R plots, in that we can
see the impact of unmodeled curvilinearity and interactions in addition to those of predictor effect displays.

There are several implementations of effect displays in R, whose details, terminology
and ease of use vary. Among these,
<!-- **ggeffects** [@R-ggeffects]  -->
`r pkg("ggeffects", cite=TRUE)`
calculates adjusted predicted values under several methods for conditioning.
<!-- **marginaleffects** [@R-marginaleffects]  -->
`r pkg("marginaleffects", cite=TRUE)`
is similar and also provides estimation of marginal slopes, contrasts, odds ratios, etc. Both have `plot()` methods based on `ggplot2`. 
My favorite is the 
<!-- **effects** [@R-effects]  -->
`r pkg("effects", cite=TRUE)`
package, which alone provides partial residuals,
and is somewhat easier to use, though it uses `r pkg("lattice")` graphics.
See the vignette [Predictor Effects Graphics Gallery](https://cran.r-project.org/web/packages/effects/vignettes/predictor-effects-gallery.pdf)
for details of the computations for effect displays.

The main functions for computing fitted effects are `predictorEffect()` (for one predictor) and
`predictorEffects()` (for one or more). For a model `mod` with formula `y ~ x1 + x2 + x3 + x1:x2`, the call to
`predictorEffects(mod, ~ x1)` recognizes that an interaction is present and
calculates the fitted values for combinations of `x1` and `x2`, holding `x3` fixed at its average value.
This returns an object of
class `"eff"` which can be graphed using the `plot.eff()` method.

The effect displays for several predictors can be plotted together,
as with `avplots()` (@fig-prestige-avplot) by including them in the plot formula,
e.g., `predictorEffects(mod, ~ x1 + x3)`.
Another function, `allEffects()` calculates the effects for each high-order term in the model,
so `allEffects(mod) |> plot()` is handy for getting a visual overview of a fitted model.

### Prestige data

To illustrate effect plots, I consider a more complex model, 
allowing a quadratic effect of women, representing income on a $\log_{10}$ scale,
and allowing this to interact with type of occupation.
`Anova()` provides the Type II tests of each of the model terms.
```{r prestige-mod3}
prestige.mod3 <- lm(prestige ~ education + poly(women,2) +
                       log10(income)*type, data=Prestige)

# test model terms
Anova(prestige.mod3)
```

The fitted coefficients, standard errors and $t$-tests from `coeftest()` are shown below.
The coefficient for education means that an
increase of one year of education, holding other predictors fixed,
gives an expected increase of 2.96 in prestige.
The other coefficients are more difficult to understand.
For example, the effect of women is represented by
two coefficients for the linear and quadratic components of `poly(women, 2)`.
The interpretation of coefficients of terms involving `type` 
depend on the contrasts used. Here, with the default treatment contrasts,
they represent comparisons with `type = "bc"` as the reference level.
It is not obvious how to understand the interaction effects like
`log10(income):typeprof`.

<!-- Note that in this model, the linear effect of `women` is highly significant -->
<!-- and the quadratic effect is nearly so. -->
<!-- The coefficient for `log10(income) = 40.33` means that multiplying income by -->
<!-- 10 gives an expected increase of over 40 in the prestige of an occupation. -->
<!-- With `type = "bc"` as the reference level, the coefficient for -->
<!-- `typeprof` means that professional occupations are rated 74.3 higher -->
<!-- than blue collar ones; the interaction coefficient for -->
<!-- `log10(income):typeprof` pertains to the difference in slope -->
<!-- for professional vs. blue collar. -->


```{r prestige-mod3-coef}
lmtest::coeftest(prestige.mod3)
```

It is easiest to produce effect displays for all terms in the model using `allEffects()`,
accepting all defaults. This gives (@fig-prestige-allEffects) effect plots for the main effects of education and income and the interaction of income with type, with the non-focal variables held fixed.
Each plot shows the fitted regression relation and a default 95% pointwise confidence band using the
standard errors. Rug plots at the bottom show the locations of observations for the horizontal 
focal variable, which is useful when the observations are not otherwise plotted.

```{r}
#| label: fig-prestige-allEffects
#| fig-height: 8
#| fig-width: 8
#| out-width: "100%"
#| fig-cap: Predictor effect plot for all terms in the model with 95% confidence bands.
allEffects(prestige.mod3) |>
  plot()
```

The effect for women, holding education, income and type constant looks to be quite strong and
curved upwards.
But note that these plots use different vertical scales for prestige in each plot and the
range in the plot for women is much smaller than in the others.
The interaction is graphed showing separate curves for the three levels of type.

For a more detailed look, it is useful to make separate plots for the predictors in the model,
which allows customizing options for calculation and display.
Partial residuals for the observations are computed by using `residuals = TRUE` in the call
to `predictorEffects()`. The slope of the fitted line (in `r colorize("blue")`)
is exactly coefficient for education in the full model.
As with C+R plots, a smooth loess curve (in `r colorize("red")`) gives a visual assessment of
linearity for a given predictor.
A wide variety of graphing options are available in the call to `plot()`.
@fig-prestige-effplot-educ shows the effect display for education with partial residuals
and point identification of those points with the largest Mahalanobis distances from the centroid.


<!-- #| fig-cap: Predictor effect plot for education displaying partial residuals. The blue line shows the slice of the fitted regression surface where other variables are held fixed. The red curve shows a loess smooth of the partial residuals. -->

```{r}
#| label: fig-prestige-effplot-educ
#| out-width: "80%"
#| fig-cap: !expr paste("Predictor effect plot for education displaying partial residuals. The", blue, "line shows the slice of the fitted regression surface where other variables are held fixed. The", red, "curve shows a loess smooth of the partial residuals.")
lattice::trellis.par.set(par.xlab.text=list(cex=1.5),
                         par.ylab.text=list(cex=1.5))

predictorEffects(prestige.mod3, ~ education,
                 residuals = TRUE) |>
  plot(partial.residuals = list(pch = 16, col="blue"),
       id=list(n=4, col="black")) 
```

The effect plot for women in this model is shown in @fig-prestige-effplot-women.
This uses the same vertical scale as in @fig-prestige-effplot-educ,
showing a more modest effect of percent women.
```{r}
#| label: fig-prestige-effplot-women
#| out-width: "80%"
#| fig-cap: Predictor effect plot for women with partial residuals
predictorEffects(prestige.mod3, ~women,
                 residuals = TRUE) |>
  plot(partial.residuals = list(pch = 16, col="blue", cex=0.8),
       id=list(n=4, col="black"))
```

<!--
for income:

```{r}
#| label: fig-prestige-effplot-inc
#| out-width: "80%"
#| fig-cap: Predictor effect plot for income ...
predictorEffects(prestige.mod3, ~ income,
                 confidence.level = 0.68) |>
  plot(lines=list(multiline=TRUE, lwd=3),
       confint=list(style="bands"),
       key.args = list(x=.7, y=.35)) 
```
-->


Because of the interaction with `type`, the fitted effects for `income` are calculated
for the three types of occupation. It is easiest to compare these in the a single
plot (using `multiline = TRUE`),
rather than in separate panels as in @fig-prestige-allEffects.
Income is represented as `log10(income)` in the model `prestige.mod3`, and it is also easier to
understand the interaction by plotting income on a log scale, using the `axes` argument to
specify a transformation of the $x$ axis. I use 68% confidence bands here to make the differences
among type more apparent.

```{r}
#| label: fig-prestige-effplot-inc-log
#| out-width: "80%"
#| fig-cap: Predictor effect plot for income, plotted on a log scale.
predictorEffects(prestige.mod3, ~ income,
                 confidence.level = 0.68) |>
  plot(lines=list(multiline=TRUE, lwd=3),
       confint=list(style="bands"),
       axes=list(
          x=list(income=list(transform=list(trans=log, inverse=exp)))),
       key.args = list(x=.7, y=.35)) 
```

@fig-prestige-effplot-inc-log provides a clear interpretation of the interaction, represented by the coefficients shown above for
`log10(income):typewc` and `log10(income):typeprof` in the model. Averaging over three occupation types,
prestige increases linearly with log income with a coefficient of 40.33. This means that increasing income by 10% (say)
gives an increase of $40.33 / 10 = 4.033$ in prestige. The slope for professional workers
is less steep: the coefficient for  `log10(income):typeprof` is -17.725.
For these workers compared with blue collar jobs, prestige increases 1.77 less with a 10% increase in income. The difference in slopes for blue collar and white collar jobs is negligible.


<!-- ## Outliers, leverage influence -->
<!-- was: r child="child/06-leverage.qmd" -->

## Outliers, leverage and influence {#sec-leverage}

In small to moderate samples, "unusual" observations can have dramatic effects on a fitted
regression model, as we saw in the analysis of Davis's data on
reported and measured weight (@sec-davis) where one erroneous observations hugely
altered the fitted line. As well, it turns out that two observations in Duncan's data
are unusual enough that removing them alters his conclusion that income and education
have nearly equal effects on occupational prestige.

An observation can be unusual in three archetypal ways, with different consequences:

* Unusual in the response $y$, but typical in the predictor(s), $\mathbf{x}$ --- a badly fitted case with a large absolute residual, but with $x$ not far from the mean, as in @fig-ch02-davis-reg2. This case does not do much harm to the fitted model.

* Unusual in the predictor(s) $\mathbf{x}$, but typical in $y$ --- an otherwise well-fitted point. This case also does litle
harm, and in fact can be considered to improve precision, a "good leverage" point.

* Unusual in **both** $\mathbf{x}$ and $y$ --- This is the case, a "bad leverage" point, revealed in the analysis of Davis's data, @fig-ch02-davis-reg1, where the one erroneous point for women was highly influential, pulling the regression line towards it and affecting the estimated coefficient as well as all the fitted values. In addition, subsets of observations can be _jointly_ influential, in that their effects combine, or can mask each other's influence.

Influential cases are the ones that matter most. As suggested above, to be influential an observation must be
unusual in **both** $\mathbf{x}$ and $y$, and affects the estimated coefficients, thereby also altering the
predicted values for all observations.
A heuristic formula capturing the relations among leverage, "outlyingness" on $y$ and influence is

$$
\text{Influence}_{\text{coefficients}} \;=\; X_\text{leverage} \;\times\; Y_\text{residual}
$$
As described below, leverage is proportional to the squared distance $(x_i - \bar{x})^2$ of an observation $x_i$
from its mean in simple regression and to the squared Mahalanobis distance
in the general case. The $Y_\text{residual}$ is best measured by a _studentized_ residual,
obtained by omitting each case $i$ in turn and calculating its residual from the coefficients obtained
from the remaining cases.

### The leverage-influence quartet {#sec-lev-inf-quartet}

These ideas can be illustrated in the "leverage-influence quartet" by considering a standard simple linear regression
for a sample and then adding one additional point reflecting the three situations described above.
Below, I generate a sample of $N = 15$ points with $x$ uniformly distributed between (40, 60)
and $y \sim 10 + 0.75 x + \mathcal{N}(0, 1.25^2)$, duplicated four times.

```{r levdemo1}
library(tidyverse)
library(car)
set.seed(42)
N <- 15
case_labels <- paste(1:4, c("OK", "Outlier", "Leverage", "Influence"))
levdemo <- tibble(
	case = rep(case_labels, 
	           each = N),
	x = rep(round(40 + 20 * runif(N), 1), 4),
	y = rep(round(10 + .75 * x + rnorm(N, 0, 1.25), 4)),
	id = " "
)

mod <- lm(y ~ x, data=levdemo)
coef(mod)
```

The additional points, one for each situation are set to the values below.

* `Outlier`: (52, 60) a low leverage point, but an outlier (`O`) with a large residual
* `Leverage`: (75, 65) a "good" high leverage point (`L`) that fits well with the regression line
* `Influence`: (70, 40) a "bad" high leverage point (`OL`) with a large residual.

```{r levdemo2}
extra <- tibble(
  case = case_labels,
  x  = c(65, 52, 75, 70),
  y  = c(NA, 65, 65, 40),
  id = c("  ", "O", "L", "OL")
)

#' Join these to the data
both <- bind_rows(levdemo, extra) |>
  mutate(case = factor(case))
```

We can plot these four situations with `ggplot2` in panels faceted by `case` as shown below.
The standard version of this plot shows the regression line for the 
`r colorize("original data", "blue")` and that for the `r colorize("ammended data", "red")`
with the additional point. Note that we use the `levdemo` dataset in `geom_smooth()` for the
regression line with the original data, but specify `data = both` for that with the additional point.

```{r}
#| label: fig-levdemo
#| out-width: 100%
#| fig-cap: !expr paste('Leverage influence quartet with data 50% ellipses. Case (1) original data; (2) adding one low-leverage outlier, "O"; (3) adding one "good" leverage point, "L"; (4) adding one "bad" leverage point, "OL". The dashed', blue, 'line is the fitted line for the original data, while the solid', red, 'line reflects the additional point. The data ellipses show the effect of the additional point on precision.')
ggplot(levdemo, aes(x = x, y = y)) +
  geom_point(color = "blue", size = 2) +
  geom_smooth(data = both, 
              method = "lm", formula = y ~ x, se = FALSE,
              color = "red", linewidth = 1.3, linetype = 1) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE,
              color = "blue", linewidth = 1, linetype = "longdash" ) +
  stat_ellipse(data = both, level = 0.5, color="blue", type="norm", linewidth = 1.4) +
  geom_point(data=extra, color = "red", size = 4) +
  geom_text(data=extra, aes(label = id), nudge_x = -2, size = 5) +
  facet_wrap(~case, labeller = label_both) +
  theme_bw(base_size = 14)
```


The standard version of this graph shows only the fitted regression lines in each panel.
As can be seen, the fitted line doesn't change very much in panels (2) and (3); only
the bad leverage point, "OL" in panel (4) is harmful.
Adding data ellipses to each panel immediately makes it clear that there is another
part to this story--- the effect of the unusual point on _precision_ (standard errors)
of our estimates of the coefficients.

Now, we see _directly_ that there is a big difference in impact between
the low-leverage outlier [panel (2)] and the high-leverage, small-residual case [panel (3)],
even though their effect on coefficient estimates is negligible.
In panel (2), the single outlier inflates the estimate of residual variance (the size of the
vertical slice of the data ellipse at $\bar{x}$), while in panel (3) this is decreased.

To allow direct comparison and make the added value of the data ellipse more apparent, we overlay the data ellipses from
@fig-levdemo in a single graph, shown in @fig-levdemo2.
Here, we can also see why the high-leverage
point "L" (added in panel (c) of @fig-levdemo) is called a "good leverage" point.
By increasing the standard deviation of $x$, it makes the data ellipse somewhat more elongated,
giving increased precision of our estimates of $\vec{\beta}$. 

```{r}
#| label: fig-levdemo2
#| out-width: 80%
#| code-fold: true
#| fig-cap: "Data ellipses in the Leverage-influence quartet. This graph overlays the data ellipses
#|     and additional points from the four panels of @fig-levdemo2. It can be seen that only the
#|     OL point affects the slope, while the O and L points affect precision of the estimates in opposite
#|     directions."
colors <- c("black", "blue", "darkgreen", "red")
with(both,
     {dataEllipse(x, y, groups = case, 
          levels = 0.68,
          plot.points = FALSE, add = FALSE,
          center.pch = "+",
          col = colors,
          fill = TRUE, fill.alpha = 0.1)
     })

case1 <- both |> filter(case == "1 OK")
points(case1[, c("x", "y")], cex=1)

points(extra[, c("x", "y")], 
       col = colors,
       pch = 16, cex = 2)

text(extra[, c("x", "y")],
     labels = extra$id,
     col = colors, pos = 2, offset = 0.5)
```

#### Measuring leverage

Leverage is thus an index of the _potential_ impact of an observation on the model due to its'
atypical value in the X space of the predictor(s). It is commonly measured by the "hat" value, $h_i$,
so called because it puts the hat $\mathbf{\widehat{(\bullet)}}$ on $\mathbf{y}$, i.e., the vector of fitted values can 
be expressed as

\begin{align*}
\mathbf{\hat{y}} 
   &= \underbrace{\mathbf{X}(\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}}_\mathbf{H}\mathbf{y} \\[2ex]
   &= \mathbf{H\:y} \period
\end{align*}


<!-- \begin{align*} -->
<!-- \hat{\mathbf{y}} & =  \mathbf{H} \mathbf{y} \\ -->
<!--                  & =  [\mathbf{X} (\mathbf{X}^\mathsf{T} \mathbf{X})^{-1} \mathbf{X}^\mathsf{T}] \; \mathbf{y} \; , -->
<!-- \end{align*} -->


Here, $h_i \equiv h_{ii}$ are the diagonal elements of the Hat matrix $\mathbf{H}$. In simple regression,
hat values are proportional to the squared distance of the observation $x_i$ from the mean,
$h_i \propto (x_i - \bar{x})^2$,
$$
h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\Sigma_i (x_i - \bar{x})^2} \; ,
$$ {#eq-hat-univar}

and range from $1/n$ to 1, with an average value $\bar{h} = 2/n$. Consequently, observations with
$h_i$ greater than $2 \bar{h}$ or $3 \bar{h}$ are commonly considered to be of high leverage.

With $p \ge 2$ predictors, an analogous relationship holds, but the correlations among the
predictors must be taken into account.
It is demonstrated below that $h_i \propto D^2 (\mathbf{x} - \bar{\mathbf{x}})$,
the Mahalanobis squared distance of $\mathbf{x}$ from the centroid $\bar{\mathbf{x}}$[^1].

The generalized version of @eq-hat-univar is

$$
h_i = \frac{1}{n} + \frac{1}{n-1} D^2 (\mathbf{x} - \bar{\mathbf{x}}) \; ,
$$ {#eq-hat-multivar}

where $D^2 (\mathbf{x} - \bar{\mathbf{x}}) = (\mathbf{x} - \bar{\mathbf{x}})^\mathsf{T} \mathbf{S}_X^{-1} (\mathbf{x} - \bar{\mathbf{x}})$. From @sec-data-ellipse, it follows that contours of constant leverage correspond to
data ellipses or ellipsoids of the predictors in $\mathbf{x}$, whose boundaries, assuming normality, correspond to quantiles of the $\chi^2_p$ distribution

[^1]: See this [Stats StackExchange discussion](https://bit.ly/45x2T0Q) for a proof.

**Example**:

To illustrate @eq-hat-multivar, I generate $N = 100$ points from a bivariate normal distribution with means
$\mu = (30, 30)$, variances = 10, and a correlation $\rho = 0.7$ and add two noteworthy points
that show an apparently paradoxical result.

```{r hatval1}
set.seed(421)
N <- 100
r <- 0.7
mu <- c(30, 30)
cov <- matrix(c(10,   10*r,
                10*r, 10), ncol=2)

X <- MASS::mvrnorm(N, mu, cov) |> as.data.frame()
colnames(X) <- c("x1", "x2")

# add 2 points
X <- rbind(X,
           data.frame(x1 = c(28, 38),
                      x2 = c(42, 35)))
```

The Mahalanobis squared distances of these points can be calculated using `heplots::Mahalanobis()`,
and their corresponding hatvalues found using `hatvalues()`
for any linear model using both `x1` and `x2`.

```{r hatval2}
X <- X |>
  mutate(Dsq = heplots::Mahalanobis(X)) |>
  mutate(y = 2*x1 + 3*x2 + rnorm(nrow(X), 0, 5),
         hat = hatvalues(lm(y ~ x1 + x2))) 

```

Plotting `x1` and `x2` with data ellipses shows the relation of leverage to squared distance from the mean.
The `r colorize("blue", "blue")` point looks to be farther from the `r colorize("mean", "green")`, but 
the `r colorize("red", "red")` point is actually very much further by Mahalanobis squared distance, which takes the correlation into account; it thus has much greater leverage.


```{r echo=-1}
#| label: fig-hatvalues-demo1
#| fig-show: "hold"
#| fig-cap: !expr paste("Data ellipses for a bivariate normal sample with correlation 0.7, and two additional noteworthy points. The", blue, "point looks to be farther from the mean, but the", red, "point is actually more than 5 times further by Mahalanobis squared distance, and thus has much greater leverage.")
par(mar = c(4, 4, 1, 1) + 0.1)
dataEllipse(X$x1, X$x2, 
            levels = c(0.40, 0.68, 0.95),
            fill = TRUE, fill.alpha = 0.05,
            col = "darkgreen",
            xlab = "X1", ylab = "X2")
points(X[1:nrow(X) > N, 1:2], pch = 16, 
       col=c("red", "blue"), cex = 2)
X |> slice_tail(n = 2) |>      # last two rows
  points(pch = 16, col=c("red", "blue"), cex = 2)
```

The fact that hatvalues are proportional to leverage can be seen by plotting one against the other.
I highlight the two noteworthy points in their colors from @fig-hatvalues-demo1 to illustrate
how much greater leverage the `r colorize("red", "red")` point has compared to the `r colorize("blue", "blue")`
point.

```{r}
#| label: fig-hatvalues-demo2
#| fig-cap: "Hat values are proportional to squared Mahalanobis distances from the mean."
plot(hat ~ Dsq, data = X,
     cex = c(rep(1, N), rep(2, 2)), 
     col = c(rep("black", N), "red", "blue"),
     pch = 16,
     ylab = "Hatvalue",
     xlab = "Mahalanobis Dsq")
```

Look back at these two points in @fig-hatvalues-demo1. Can you guess how much further the `r colorize("red", "red")`
point is from the mean than the `r colorize("blue", "blue")` point? You might be surprised that its'
$D^2$ and leverage are about five times as great!

```{r hatval3}
X |> slice_tail(n=2)
```

#### Outliers: Measuring residuals

From the discussion in @sec-leverage, outliers for the response $y$ are those observations for which the
residual $e_i = y_i - \hat{y}_i$ are unusually large in magnitude. However, as demonstrated in @fig-levdemo,
a high-leverage point will pull the fitted line towards it, reducing its' residual and thus making them
look less unusual.

The standard approach [@CookWeisberg:82; @HoaglinWelsch1978] is to consider a _deleted residual_ $e_{(-i)}$, conceptually as that
obtained by re-fitting the model with observation $i$ omitted and obtaining the fitted value $\hat{y}_{(-i)}$
from the remaining $n-1$ observations,
$$
e_{(-i)} = y_i - \hat{y}_{(-i)} \; .
$$
The (externally) _studentized residual_ is then obtained by dividing $e_{(-i)}$ by it's estimated standard error, giving
$$
e^\star_{(-i)} = \frac{e_{(-i)}}{\text{sd}(e_{(-i)})} = \frac{e_i}{\sqrt{\text{MSE}_{(-i)}\; (1 - h_i)}} \; .
$$

This is just the ordinary residual $e_i$ divided by a factor that increases with the residual variance
but decreases with leverage.  It can be shown that these studentized residuals follow a $t$ distribution
with $n - p -2$ degrees of freedom, so a value $|e^\star_{(-i)}| > 2$ can be considered large enough to
pay attention to.  
In practice for classical linear models, it is unnecessary to actually re-fit the model $n$ times.
@VellemanWelsh:81 show that all these leave-one-out quantities can be calculated from the
model fitted to the full data set and the hat (projection) matrix
$\mathbf{H} = (\mathbf{X}\trans \mathbf{X})^{-1} \mathbf{X}\trans$ from which $\widehat{\mathbf{b}} = \mathbf{H} \mathbf{y}$.


#### Measuring influence

As described at the start of this section, the actual influence of a given case depends multiplicatively
on its' leverage and residual. But how can we measure it?

The essential idea introduced above, is to 
delete the observations one at a time, each time refitting the regression model on the remaining $n–1$ observations.
Then, for observation $i$ compare the results using all $n$ observations to those with the $i^{th}$ observation deleted to see how much influence the observation has on the analysis. 

The simplest such measure, called DFFITS, compares the predicted value for case $i$ with what would be obtained
when that observation is excluded.

\begin{align*}
\text{DFFITS}_i & = & \frac{\hat{y}_i - \hat{y}_{(-i)}}{\sqrt{\text{MSE}_{(-i)}\; h_i}} \\
   & = & e^\star_{(-i)} \times \sqrt{\frac{h_i}{1-h_i}} \;\; .
\end{align*}

The first equation gives the signed difference in fitted values in units of the standard deviation of that difference weighted by leverage;
the second version [@Belsley-etal:80] represents that as a product of residual and leverage.
A rule of thumb is that an observation is deemed to be influential if $| \text{DFFITS}_i | > 2 \sqrt{(p+1) / n}$.

Influence can also be assessed in terms of the change in the estimated coefficients
$\mathbf{b} = \widehat{\mathbf{\beta}}$ versus their values $\mathbf{b}_{(-i)}$
when case $i$ is removed.
Cook's distance, $D_i$, summarizes the size of the difference as a weighted sum of squares
of the differences $\mathbf{d} =\mathbf{b} - \mathbf{b}_{(-i)}$ [@Cook:77].

$$
D_i = \mathbf{d}\trans \, (\mathbf{X}\trans \mathbf{X}) \,\mathbf{d} / (p+1) \hat{\sigma}^2
$$
This can be re-expressed in terms of the components of residual and leverage

$$
D_i = \frac{e^{\star 2}_{(-i)}}{p+1} \times \frac{h_i}{(1- h_i)}
$$

Cook's distance is in the metric of an $F$ distribution with $p$  and $n − p$  degrees of freedom,
so values $D_i > 4/n$ are considered large.

### Influence plots

The most common plot to detect influence is a bubble plot of the studentized residuals
versus hat values, with the size (area) of the plotting symbol proportional to Cook's $D$.
These plots are constructed using `car::influencePlot()` which also fills the
bubble symbols with color whose opacity is proportional to Cook's $D$.

This is shown in @fig-levdemo-infl for the demonstration dataset
constructed in @sec-lev-inf-quartet. In this plot, notable cutoffs for hatvalues
at $2 \bar{h}$ and $3 \bar{h}$ are shown by dashed vertical lines
and horizontal cutoffs for studentized residuals are shown at values of $\pm 2$.


The demonstration data of @sec-lev-inf-quartet has four copies of the same
$(x, y)$ data, three of which have an unusual observation.
The influence plot in @fig-levdemo-infl subsets the data to give
the $19 = 15 + 4$ unique observations, including
the three unusual cases.
As can be seen, the high "Leverage" point has 
has less influence than the point labeled "Influence", which has moderate
leverage but a large absolute residual.


```{r}
#| label: fig-levdemo-infl
#| code-fold: true
#| code-summary: See the code
#| fig-cap: Influence plot for the demonstration data. The areas of the bubble symbols are proportional to Cook's $D$. The impact of the three unusual points on Cook's $D$ is clearly seen.
once <- both[c(1:16, 62, 63, 64),]      # unique observations
once.mod <- lm(y ~ x, data=once)
inf <- influencePlot(once.mod, 
                     id = list(cex = 0.01),
                     fill.alpha = 0.5,
                     cex.lab = 1.5)
# custom labels
unusual <- bind_cols(once[17:19,], inf) |> 
  print(digits=3)
with(unusual, {
  casetype <- gsub("\\d ", "", case)
  text(Hat, StudRes, label = casetype,
       pos = c(4, 2, 3), cex=1.5)
})
```


### Duncan data {#sec-duncan-influence}

Let's return to the `Duncan` data used as an example in @sec-example-duncan where a few points stood out as unusual in the basic diagnostic plots (@fig-duncan-plot-model).
The influence plot in @fig-duncan-infl helps to make sense of these noteworthy observations.
The default method for identifying points in `influencePlot()`
labels points with any of large studentized residuals, hat-values or Cook's distances.

```{r}
#| label: fig-duncan-infl
#| fig-cap: Influence plot for the model predicting occupational prestige in Duncan's data. Cases with large studentized residuals, hat-values or Cook's distances are labeled.
inf <- influencePlot(duncan.mod, id = list(n=3),
                     cex.lab = 1.5)
```

`influencePlot()` returns (invisibly) the influence diagnostics for the cases identified in the
plot. It is often useful to look at data values for these cases to understand why each of these
was flagged.

```{r}
merge(Duncan, inf, by="row.names", all.x = FALSE) |> 
  arrange(desc(CookD)) |> 
  print(digits=3)
```

* _minister_ has by far the largest influence, because it has an extremely positive residual and a large hat value. Looking at the data, we see that ministers have very low income, so their prestige is under-predicted. The large hat value reflects the fact that ministers have low income combined with very high education.

* _conductor_ has the next largest Cook's $D$. It has a large hat value because its combination of relatively high income and low education is unusual in the data.

* Among the others, _reporter_ has a relatively large negative residual---its prestige is far less than the model predicts---but its low leverage make it not highly influential.
_railroad engineer_ has an extremely large hat value because its income is very high in relation to education. But this case is well-predicted and has a small residual, so its leverage is not large.

### Influence in added-variable plots

The properties of added-variable plots discussed in @sec-avplots make them also useful for understanding why cases are influential because they control for other predictors in each plot, and therefore
show the _partial_ contributions of each observation to hat values and residuals. As a consequence, we
can see directly the how individual cases become individually or jointly influential.

The Duncan data provides a particularly instructive example of this. @fig-duncan-av-influence
shows the AV plots for both income and education in the model `duncan.mod`, with some annotations
added. I want to focus here on the _joint_ influence of the occupations minister and conductor
which were seen to be the most influential in @fig-duncan-infl. The `r colorize("green")`
vertical lines show their residuals in each panel and the `r colorize("red")` lines show the regressions
when these two observations are deleted.

The basic AV plots are produced using the call to `avPlots()` below. 
To avoid clutter, I use the argument `id = list(method = "mahal", n=3)` so that
only the three points with the greatest Mahalanobis distances from the centroid in each plot
are labeled. These are the cases with the largest leverage seen in @fig-duncan-infl.

```{r}
#| label: duncan-av-infl
#| eval: false
avPlots(duncan.mod,
  ellipse = list(levels = 0.68, fill = TRUE, fill.alpha = 0.1),
  id = list(method = "mahal", n=3),
  pch = 16, cex = 0.9,
  cex.lab = 1.5)
```


```{r}
#| label: fig-duncan-av-influence
#| echo: false
#| out-width: "100%"
#| fig-cap: Added variable plots for the Duncan model, highlighting the impact of the observations for minister and conductor in each plot. The green lines show the residuals for these observations. The red line in each panel shows the regression line omitting these observations.
knitr::include_graphics("images/duncan-av-influence.png")
```

The two cases---minister and conductor---are the most highly influential, but as we can see in 
@fig-duncan-av-influence their influence combines because they are at opposite sides of the horizontal axis and their residuals are of opposite signs. They act together to decrease the slope for income and increase that for education.

```{r}
#| label: duncan-av-infl-code
#| eval: false
#| code-fold: true
#| code-summary: Code for income AV plot
res <- avPlot(duncan.mod, "income",
              ellipse = list(levels = 0.68),
              id = list(method = "mahal", n=3),
              pch = 16,
              cex.lab = 1.5) |>
  as.data.frame()
fit <- lm(prestige ~ income, data = res)
info <- cbind(res, fitted = fitted(fit), 
             resids = residuals(fit),
             hat = hatvalues(fit),
             cookd = cooks.distance(fit))

# noteworthy points in this plot
big <- which(info$cookd > .20)
with(info, {
  arrows(income[big], fitted[big], income[big], prestige[big], 
         angle = 12, length = .18, lwd = 2, col = "darkgreen")
  })

# line w/o the unusual points
duncan.mod2 <- update(duncan.mod, subset = -c(6, 16))
bs <- coef(duncan.mod2)["income"]
abline(a=0, b=bs, col = "red", lwd=2)
```

Duncan's hypothesis that the slopes for income and education were equal thus fails when these
two observations are deleted. The slope for income then becomes 2.6 times that of education.

```{r}
#| label: duncan.mod2
duncan.mod2 <- update(duncan.mod, subset = -c(6, 16))
coef(duncan.mod2)
```


**Package summary**

```{r}
#| echo: false
#| results: asis
#cat("**Packages used here**:\n\n")
write_pkgs(file = .pkg_file)
```

<!-- ## References {.unnumbered} -->