05-linear_models.qmd

```{r include=FALSE}
source("R/common.R")
knitr::opts_chunk$set(fig.path = "figs/ch05/")
```

::: {.content-visible unless-format="pdf"}
{{< include latex/latex-commands.qmd >}}
:::

# Overview of Linear models {#sec-linear-models}

Although this book is primarily about multivariate models, it is useful to have an overview of the range of available
techniques for univariate response models
to see their uses and to appreciate how easily univariate models generalize to multivariate ones.
Hence, this chapter reviews the characteristics of the standard univariate methods for explaining or predicting
a single outcome variable from a set of predictors.

The key ideas are:

* For a single quantitative outcome variable $\mathbf{y}$, methods of linear regression and analysis of variance (ANOVA) are comprised within a single framework of the **general linear model** (GLM). Regression and ANOVA differ only in that the predictors in the former are quantitative, while those in the latter are discrete factors. They can all be fit using
`lm()`.

* These models extend directly to the multivariate case of $q > 1$ outcomes, $\mathbf{Y} = (\mathbf{y}_1, \mathbf{y}_2, \dots \mathbf{y}_q)$, and are also fit using `lm()`.

* A binary outcome $y = (0, 1)$ and categorical outcomes like marital status ("never married", "married", "separated", "divorced") can be handled within a different extension, the **generalized** linear model as logistic regression
or multinomial regression, fit using `glm()`.

* All of these models involve linear combinations of predictors (weighted sums) fit to optimize some criterion, for example minimizing some function of the residuals or maximizing some measure of fit.

* Models and data can be more easily understood with graphics, and many statistical ideas have a visual representation in geometry.

@fig-techniques summarizes a variety of methods for linear models, classified by number of predictors and number of response variables, and whether these are quantitative vs. discrete. For the present purposes, the key columns are the first two, for the case
of one or more quantitative outcome variables. 

<!-- **TODO**: This should be a table, but that's not working -->

```{r}
#| label: fig-techniques
#| echo: false
#| out-width: "100%"
#| fig-cap: "Techniques for linear models classified by number of predictors and number of response variables, and whether these are quantitative vs. discrete"
knitr::include_graphics("images/techniques-table.png")
```

When the _predictors_ are also quantitative, simple regression ($p=1$) generalizes to multivariate regression with two or more outcomes ($q > 1$). For example we might want to predict weight and body mass index _jointly_ from a person's height.

The situation is more interesting when there are $p>1$ predictors. The most common multivariate generalization is 
_multivariate multiple regression_ (MMRA), where each outcome is regressed on the predictors, as if done separately for each outcome,
but using multivariate tests that take correlations among the predictors into account.
Other methods for this case include canonical correlation analysis, which tries to explain all relations between $\mathbf{Y}$ and
a set of $\mathbf{x}$s through maximally correlated linear combinations of each.

When the predictor variables are all discrete or categorical, such as gender or level of education, methods like the simple $t$-test,
one-way ANOVA and factorial ANOVA with $q=1$ outcome measures all have simple extensions to the case of $q>1$ outcomes.

Not shown in @fig-techniques are rows for cases where predictor variables include 
a mixture quantitative and discrete factors. These have two "flavors", depending on our emphasis.

* **Analysis of covariance** (ANCOVA) is the
term for models where the main emphasis on on comparing means of different groups,
but where there are one or more quantitative predictors (for example a pre-test score
on the outcome variable) that need to be taken into account, so we can assess the differences
among groups _controlling_ or _adjusting_ for those predictors, which are called
_covariates_. (Strictly speaking, ANCOVA models typically assume that the regression
slopes are the same for all groups.)

* **Homogeneity of regression** treats the same type of data, 
but the main interest is on the regression relation between $\mathbf{y}$ and a set of
quantitative $\mathbf{x}$s, but there is also one or more grouping factors.
In this case, we want to determine if the _same regression_ relation applies in all groups
or whether the groups differ in their intercepts and/or slopes. 
Another term for this class of questions is **moderator** effects, which refers to
interactions between the quantitative predictors and group factors.

::: {.callout-note title="History Corner"}

Why are there so many different names for regression vs. ANOVA concepts, statistics and techniques? In regression, we use notation like $x_1, x_2, ...$ to refer to _predictors_ in a model, while in ANOVA, factors $A, B, ...$ are called _main effects_. In regression applications, we often test _linear hypotheses_, are interested in _coefficients_ and
evaluate a model with an $R^2$ statistic, while in ANOVA we may test _contrasts_ among factor levels, and use $F$-tests
to evaluate models.

Well, like twins separated at birth, they grew up in homes in different places and with different parents, who were each free to choose their own names, not recognizing their shared DNA.

Methods of regression began in evolutionary biology with Francis Galton's 
[-@Galton:1886; -@Galton:1889] studies of heritability of traits, trying to understand how
strongly the physical characteristics of one generation of living things resembled those in the next generation.
From a study of the diameters of sweet peas in parent plants and their size in the next generation, and another on the relationship between heights of human parents and their offspring, he developed the fundamental ideas
of regression. Karl Pearson [-@Pearson:1896]

In contrast, analysis of variance methods were raised on farms, notably the Rothamsted Experimental Station, where R. A. Fisher
analyzed vast amounts of data on crop experiments designed to determine the conditions (soil condition, fertilizer treatments) that gave the greatest yields, while controlling for extraneous determiners (plots of planting).
With multiple factors determining the outcome, 
Fisher [-@Fisher1923], in an experiment on yields of different varieties of potatoes
given various manure treatments,
devised the method of breaking down the total variance into portions attributable to each factor
and presented the first ANOVA table. The method became well-known after Fisher's [-@Fisher:25]
_Statistical Methods for Research Workers_.

The great synthesis of regression and ANOVA did not take place until the 1960s.
At that time, methods for computing were beginning to move from programmable desk calculators
to mainframe computers, largely using a collection of separate FORTRAN programs, designed for
regression, one-way ANOVA, two-way ANOVA, models with interactions, and so forth.
To complete an analysis, as a graduate student I often had to use three or more different programs.

Then, something remarkable happened on two fronts: theory and computation.
First, in quick succession textbooks by @Scheffe1960, @Graybill1961, @Winer1962 ...
began to layout a general theory of linear models that encompassed all of these separate models,
giving the "General Linear Model" a well-deserved name.
Second, two symposiums, one at IBM Yorkdown Heights (@IBM1965) and the other at the University of Georgia (BradshawFindley1967)
resulted in the first general programs to handle all these cases in an understandable way.

A bit of matrix algebra thrown into the mix showed that most of the ideas for univariate models
could be extended to multiple response variables, and so the "Multivariate Linear Model" was born.
R. Darrell Bock [@Bock1963;@Bock1964] sketched a flowchart of the computational steps, which was implemented at the University of Chicago
by Jeremy Finn [-@Finn1967] in the MULTIVARIANCE program. A group at the University of North Carolina
headed by Elliot Cramer developed their MANOVA program [@Clyde-etal-1966] and Willard Dixon [-@Dixon1965] at UCLA developed the BMD programs incorporating these ideas.
The ANOVA and regression twins had finally become part of a larger family.

<!--
...
Graybill, Winer, Bock  ... -> Symposium 1967

Birth of the general linear models : https://files.eric.ed.gov/fulltext/ED026737.pdf

Bock1966: flowchart -> Jeremy Finn (1967) MULTIVARIANCE
Elliot Cramer (UNC) -> Clyde, Cramer & Shering (1966) MANOVA
Dixon, UCLA (1965): BMD
-->


:::

```{r child="child/05-linear-combinations.qmd"}
```

## The General Linear Model {#sec-GLM}

To establish notation and terminology, it is worthwhile to state the the general linear model formally.
For convenience, I use vector and matrix notation.
This expresses a response variable, $\mathbf{y} = (y_1, y_2, \dots , y_n)^\mathsf{T}$ for $n$ observations, as a sum of terms involving $p$ _regressors_, $\mathbf{x}_1, \mathbf{x}_2, \dots , \mathbf{x}_p$, each of length $n$.

<!-- $$ -->
<!-- \begin{align*} -->
<!-- \mathbf{y} & = \beta_0 + \beta_1 \mathbf{x}_1 + \beta_2 \mathbf{x}_2 + \cdots + \beta_p \mathbf{x}_p + \mathbf{\epsilon} \\ -->
<!--            & = \left[ \mathbf{1},\; \mathbf{x}_1,\; \mathbf{x}_2,\; \dots ,\; \mathbf{x}_p \right] \; \boldsymbol{\beta} + \boldsymbol{\epsilon} \\ -->
<!--            & = \mathbf{X}_{n \times (p+1)}\; \boldsymbol{\beta}_{(p+1) \times 1} + \boldsymbol{\epsilon} -->
<!-- \end{align*} -->
<!-- $$ -->


\begin{align*}
\mathbf{y} & = \beta_0 + \beta_1 \mathbf{x}_1 + \beta_2 \mathbf{x}_2 + \cdots + \beta_p \mathbf{x}_p + \mathbf{\epsilon} \\
           & = \left[ \mathbf{1},\; \mathbf{x}_1,\; \mathbf{x}_2,\; \dots ,\; \mathbf{x}_p \right] \; \boldsymbol{\beta} + \boldsymbol{\epsilon} \\
\end{align*} {#eq-glm}

<!-- & = \LARGE{\sizedmat{X}{n \times (p+1)}\; \sizedmat{\boldsymbol{\beta}}{(p+1) \times 1} + \boldsymbol{\epsilon}} -->

or, expressed in matrices,

$$
\Large{\sizedmat{y}{n \times 1} = \sizedmat{X}{n \times (p+1)}\; \sizedmat{\boldsymbol{\beta}}{(p+1) \times 1} + \boldsymbol{\epsilon}}
$$

The matrix $\mathbf{X}$ is called the _model matrix_ and contains the numerical representations of the predictor variables called _regressors_. The essential thing about a linear model
is that it is linear in the **parameters** $\beta_i$. That is,
the predicted value of $\mathbf{y}$ is a linear combination of some 
$\mathbf{x}_i$ with weights $\beta_i$. An example of a nonlinear model is
the exponential growth model,
$y = \beta_0 + e^{\beta_1 x}$, where the parameter $\beta_1$ appears as an exponent.[^logs]

[^logs]: Taking logarithms of both sides would yield the linear model, $log(y) = c + \beta_1 x$.

* These can be quantitative variables like `age`, `salary` or `years` of education. But they can also be transformed versions, like `sqrt(age)` or `log(salary)`.

* A quantitative variable can be represented by more than one model regressor, for example if it is expressed
as a polynomial like `poly(age, degree=2)` or a natural spline like `ns(salary, df=5)`. The model matrix portion
for such terms contains one column for each degree of freedom (`df`) and there are `df` coefficients in the corresponding portion of
$\boldsymbol{\beta}$.

* A categorical or discrete predictor-- a factor variable in R--
with $d$ levels is expressed as $d - 1$ columns in $\mathbf{X}$. Typically these are contrasts or comparisons
between a baseline or reference level and each of the remaining ones, but any set of $d - 1$ linearly
independent contrasts can be used by assigning to `contrasts(factor)`. For example, 
`contrasts(factor) <- contr.treatment(4)` for a 4-level factor assigns 3 contrasts representing comparisons
with a baseline level, typically the first (in alphabetic order). For an _ordered_ factor, such as one
for political knowledge with levels "low", "medium", "high", `contrasts.poly()` returns the coefficients
of orthogonal polynomial contrasts representing linear and quadratic trends.

* Interactions between predictors are represented as the direct products of the corresponding columns of $\mathbf{X}$.
This allows the effect of one predictor on the response to depend on values of other predictors.
For example, the interaction of two quantitative variables, $\mathbf{x}_1, \mathbf{x}_2$ is represented by the product
$\mathbf{x}_1 \times \mathbf{x}_2$. More generally, for variables or 
factors $A$ and $B$ with degrees of freedom $\text{df}_A$ and $\text{df}_B$ the regressors in $\mathbf{X}$ are the $\text{df}_A \times \text{df}_B$
products of each column for $A$ with each column for $B$.


### Model formulas {#sec-model-formulas}

Statistical models in R, such as those fit by `lm()`, `glm()` and many other modelling function in R are expressed in a simple notation that was developed by @WilkinsonRogers1973
for the GENSTAT software system at the Rothamsted Research Station. It solves the problem
of having a compact way to specify any model consisting of any combinations of quantitative and discrete factor variables, interactions of these and arbitrary transformations of these.

In this, a **model formula** take the forms

```{r eval=FALSE}
response ~ terms
response ~ term1 + term2 + ...
```

where the left-hand side, `response` specifies the response variable in the model
and the right-hand side specifies the `terms` in the model specifying the columns in the $\mathbf{X}$ matrix of @eq-glm; the coefficients $\beta$ are implied and not represented explicitly in the formula.

The notation `y ~ x` is read as "`y` _is modeled by_ `x`".
The left-hand side is usually a variable name (such as `height`), but it could be an expression that
evaluates to the the response, such as `log(salary)` or `weight/height^2` which represents the
body mass index.

On the right-hand side (RHS), the usual arithmetic operator, `+, -, *, /, ^` have special
meanings as described below. The most fundamental is that `y ~ a + b` 
is interpreted as "`y` is modeled by `a` _and_ `b`"; that is, the sum of linear terms for `a` and `b`.

Some examples for regression-like models using only quantitative variables,
`x, x1, x2, x3, ...` are shown below:

```{r eval=FALSE}
y ~ x                      # simple linear regression
y ~ x - 1                  # no intercept: regression through the origin 
y ~ x + I(x^2)             # quadratic model
y ~ poly(x, 3)             # cubic model
y ~ x1 * x2                # crossing: x1 + x2  +  x1 : x2
y ~ x1 + x2 + x3           # multiple regression
y ~ (x1 + x2 + x3)^2       # response surface: all quadratics & two-way interactions
log(y) ~ x1 + poly(x, 2)   # arbitrary transformation of response
y1 + y2 ~ x1 + x2 + x3     # response is sum of y1 and y2
```

The intercept $\beta_0$ is automatically included in the model without need to specify it
explicitly. The minus sign, `-` on the right-hand side removes terms from the model,
so a model with no intercept $\beta_0 = 0$ can be specifies as `y ~ X -1` (or perhaps more
naturally, `y ~ 0 + X`). 

Function calls on the RHS, such as `poly(x, 3)`
are evaluated directly, but to use a special model operator, like `^` must be "protected"
by wrapping the term in `I()`, meaning "identity" or "inhibit". Thus, the model
`y ~ x + I(x^2)` means the quadratic model $y = \beta_0 + \beta_1 x + \beta_2 x^2$.
This differs from the model `y ~ poly(x, 2)` in that the former uses the raw `x, x^2` values
(which are necessarily positively correlated)
while `poly()` converts these to orthogonal polynomial scores, which are uncorrelated
(and therefore free from problems of collinearity).

Factor variables are treated specially in linear models, but have simple
notations in R formulas. The following examples use `A, B, C` to represent
discrete factors with two or more levels.
```{r eval=FALSE}
y ~ A                 # one-way ANOVA
y ~ A + B             # two-way, main effects only
y ~ A * B             # full two-way, with interaction
y ~ A + B + A:B       # same, in long-hand
y ~ x + A             # one-way ANCOVA
y ~ (A + B + C)^2     # three-way ANOVA, incl. all two-way interactions
```

#### Crossing {.unnnumbered}
The `*` operator has special meaning used to specify the crossing of variables and factors and `:` specifies interactions (products of variables).
So, the model `y ~ x1 * x2` is expanded to give `y ~ x1 + x2 + x1:x2`
and the interaction term `x1:x2` is calculated as $x_1 \times x_2$.
In algebraic notation (omitting the error term) this works out to the model,

\begin{eqnarray*}
y & = & \beta_0 + \beta_1 x_1 +  \beta_2 x_2 +  \beta_1 x_1 *  \beta_2 x_2 \\
  & = & \beta_0 + (\beta_1 + \beta_2 x_2) x_1 + \beta_2 x_2 \comma \\
\end{eqnarray*}

which means that the coefficient for $x_1$ in the model is not constant for all values of
$x_2$, but rather changes with the value of $x_2$. If $\beta_2 > 0$, the slope for
$x_1$ increases with $x_2$ and vice-versa.

`y ~ A * B` for factors is similar, expanding to `y ~ A + B + A:B`, but the columns in the model matrix
represent contrasts among the factor levels as describe above. The main effects, `A` and `B` 
come from contrasts among the means of their factor levels and the interaction term `A:B`
reflects _differences_ among means of `A` across the levels of `B` (and vice-versa).

The model formula `y ~ x + A` specifies an ANCOVA model with different intercepts for the levels of `A`,
but with a common slope for `x`. Adding an interaction of `x:A` in the model `y ~ x * A`
allow separate slopes and intercepts for the groups.


#### Powers {.unnnumbered}

The `^` exponent operator indicates _powers of a term expression_ to a specified degree. Thus the term
`(A + B)^2` is identical to `(A + B) * (A + B)` which expands to the main effects of `A`, `B` and
their interaction, also identical to `A * B`. In general, the product of parenthesized terms
expands as in ordinary algebra,

```{r eval=FALSE}
y ~ (A + B) * (C + D) -> A + B + C + D + A:C + A:D + B:C + B:D
```


Powers get more interesting with more terms, so `(A + B + C)^2` is the same as `(A + B + C) * (A + B + C)`,
which includes  main effects of `A`, `B` and `C` as well as all two-way interactions,
`A:B, A:C` and `B:C`.
The model formula `(A + B + C)^3` expands to include all two-way interactions and the three-way
interaction `A:B:C`.

```{r eval=FALSE}
(A + B + C)^3 -> A + B + C + A:B + A:C + B:C + A:B:C
```

In this context `-` can be use to remove terms, as shown in the following examples

```{r eval=FALSE}
(A + B + C)^2 <-> (A + B + C)^3 - A:B:C
(A + B + C)^3 - B:C - A:B:C  <-> A + B + C + A:B + A:C 
```

Finally, the symbol `.` on the right-hand side specifies _all terms_ in the current dataset other than the
response. Thus if you have a data.frame containing `y, x1, x2, ..., x6`, you can specify a model
with all variables except `x6` as predictors as
```{r eval=FALSE}
y ~ . - x6
```

To test what we've covered above, 

* What do you think the model formula `y ~ .^2` means in a data set containing variables `x1, x2, x3,` and `x4`?

* What about the formula with `y ~ .^2 - A:B:C:D` with factors `A, B, C, D`?

You can work out questions like these or explore model formulae
using `terms()` for a `"formula"` object.
The labels of these terms can then be concatenated to a string and turned back into a formula using `as.formula()`:
```{r terms-formula}
f <- formula(y ~ (x1 + x2 + x3 + x4)^2)
terms = attr(terms(f), "term.labels")

terms |> paste(collapse = " + ")
# convert back to a formula
as.formula(sprintf("y ~ %s", paste(terms, collapse=" + "))) 
```

### Model matrices 

As noted above, a model formula is used to generate the $n \times (p+1)$ model matrix, $\mathbf{X}$, typically containing the column of 1s for the intercept $\beta_0$ in the model, followed by 
$p$ columns representing the regressors $\mathbf{x}_1, \mathbf{x}_2, \dots , \mathbf{x}_p$.
Internally, `lm()` uses `stats::model.matrix()` and you can use this to explore how factors,
interactions and other model terms are represented in a model object.

For a small example, here are a few observations representing income (`inc`) and `type` of occupation.
`model.matrix()` takes a _one-sided_ formula with the terms on the right-hand side.
The main effect model looks like this:

```{r model-mat1}
set.seed(42)
inc <- round(runif(n=9, 20, 40))
type <- rep(c("bc", "wc", "prof"), each =3)

mm <- model.matrix(~ inc + type) 
data.frame(type, mm)
```

As you can see, `type`, with 2 degrees of freedom is represented by two dummy (0/1) variables,
`typeprof` and `typewc`. Together, these represent _treatment contrasts_ (comparisons) between the
_baseline_ group `type=="bc"`, which is coded (0, 0) and each of the others: `type=="prof"`,
coded (1, 0) and `type=="wc"`, codes (0, 1).
By default, the baseline level is the first in alphabetic or numerical order, but
you can change this using `relevel()`.

In a model with the interaction `inc * type`, additional columns are constructed as the product
of `inc` with each of the columns for `type`.

```{r model-mat2}
model.matrix(~ inc * type)
```
 
### Contrasts {#sec-contrasts}


## Regression

## ANOVA

## ANCOVA

## Regression trees

<!-- ## References {.unnumbered} -->

**Package summary**

```{r}
#| echo: false
#| results: asis
cat("**Packages used here**:\n\n")
write_pkgs(file = .pkg_file)
```