-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path03b - Geocentric Models.qmd
1207 lines (855 loc) · 81.6 KB
/
03b - Geocentric Models.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: 03b - Geocentric Models
author: "Abdullah Mahmood"
date: "last-modified"
format:
html:
theme: cosmo # united is darker
css: style.css
highlight-style: atom-one
mainfont: Palatino
fontcolor: black
monobackgroundcolor: white
monofont: "Menlo, Lucida Console, Liberation Mono, DejaVu Sans Mono, Bitstream Vera Sans Mono, Courier New, monospace"
fontsize: 13pt
linestretch: 1.4
number-sections: true
number-depth: 3
toc: true
toc-location: right
code-fold: false
code-copy: true
cap-location: bottom
format-links: false
embed-resources: true
anchor-sections: true
html-math-method:
method: mathjax
url: "https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"
editor: source
jupyter: main
---
### Imports
```{python}
from init import *
```
```{python}
d = pd.read_csv("data/Howell1.csv", sep=';')
d2 = d[d['age'] >= 18]
```
## Linear Prediction
What we’ve done above is a Gaussian model of height in a population of adults. But it doesn’t really have the usual feel of “regression” to it. Typically, we are interested in modeling how an outcome is related to some other variable, a predictor variable. If the predictor variable has any statistical association with the outcome variable, then we can use it to predict the outcome. When the predictor variable is built inside the model in a particular way, we’ll have linear regression.
So now let’s look at how height in these Kalahari foragers (the outcome variable) covaries with weight (the predictor variable). We’ll reconsider this example later from a more causal perspective. Right now, let's focus on the mechanics of estimating an association between two variables.
Go ahead and plot adult height and weight against one another:
```{python}
def plot_scatter():
plt.plot(d2['weight'], d2['height'], 'o', fillstyle='none')
plt.ylabel("Height (cm)")
plt.xlabel("Weight (kg)")
utils.inline_plot(plot_scatter)
```
There’s obviously a relationship: Knowing a person’s weight helps you predict height.
To make this vague observation into a more precise quantitative model that relates values of `weight` to plausible values of `height`, we need some more technology. How do we take our Gaussian model from the previous section and **incorporate predictor variables**?
------------------------------------------------------------------------
#### What is "regression"?
Many diverse types of models are called “regression.” The term has come to mean using one or more predictor variables to model the distribution of one or more outcome variables. The original use of term, however, arose from anthropologist Francis Galton’s (1822–1911) observation that the sons of tall and short men tended to be more similar to the population mean, hence **regression to the mean**.
The causal reasons for regression to the mean are diverse. In the case of height, the causal explanation is a key piece of the foundation of population genetics. But this phenomenon arises statistically whenever individual measurements are assigned a common distribution, leading to shrinkage as each measurement informs the others. In the context of Galton’s height data, attempting to predict each son’s height on the basis of only his father’s height is folly. Better to use the population of fathers. This leads to a prediction for each son which is similar to each father but “shrunk” towards the overall mean. Such predictions are routinely better. This same regression/shrinkage phenomenon applies at higher levels of abstraction and forms one basis of multilevel modeling.
------------------------------------------------------------------------
### The Linear Model Strategy
The strategy is to make the parameter for the mean of a Gaussian distribution, $μ$, into a linear function of the predictor variable and other, new parameters that we invent. This strategy is often simply called the **linear model**. The linear model strategy instructs the model to assume that the **predictor variable has a constant and additive relationship to the mean of the outcome**. The model then computes the posterior distribution of this constant relationship.
What this means, recall, is that the machine considers every possible combination of the parameter values. With a linear model, some of the parameters now stand for the strength of association between the mean of the outcome, $μ$, and the value of some other variable. For each combination of values, the machine computes the posterior probability, which is a measure of relative plausibility, given the model and data. So the posterior distribution ranks the infinite possible combinations of parameter values by their logical plausibility. As a result, the posterior distribution provides relative plausibilities of the different possible strengths of association, given the assumptions you programmed into the model. We ask the model: “Consider all the lines that relate one variable to the other. Rank all of these lines by plausibility, given these data.” The model answers with a posterior distribution.
Here’s how it works, in the simplest case of only one predictor variable. We’ll wait until the next chapter to confront more than one predictor. Recall the basic Gaussian model:
$$\begin{align*}
h_{i} &\sim \text{Normal}(\mu, \sigma)\tag{Likelihood}\\
\mu &\sim \text{Normal}(178,20)\tag{μ prior} \\
\sigma &\sim \text{Uniform}(0,50)\tag{σ prior} \\
\end{align*}$$
Now how do we get weight into a Gaussian model of height? Let $x$ be the name for the column of weight measurements, `d2['weight']`. Let the average of the $x$ values be $\bar{x}$, “ex bar”. Now we have a predictor variable $x$, which is a list of measures of the same length as $h$. To get weight into the model, we define the mean $μ$ as a function of the values in $x$. This is what it looks like, with explanation to follow:
$$\begin{align*}
h_{i} &\sim \text{Normal}(\mu_i, \sigma)\tag{Likelihood}\\
\mu_i &= \alpha + \beta(x_i - \bar{x})\tag{Linear Model}\\
\alpha &\sim \text{Normal}(178,20)\tag{α prior} \\
\beta &\sim \text{Normal}(0,10)\tag{β prior} \\
\sigma &\sim \text{Uniform}(0,50)\tag{σ prior} \\
\end{align*}$$
#### Probability of the data
Let’s begin with just the probability of the observed height, the first line of the model. This is nearly identical to before, except now there is a little index $i$ on the $μ$ as well as the $h$. You can read $h_i$ as “each $h$” and $μ_i$ as “each $μ$.” The mean $μ$ now depends upon unique values on each row $i$. So the little $i$ on $μ_i$ indicates that the mean depends upon the row.
#### Linear model
The mean $μ$ is no longer a parameter to be estimated. Rather, as seen in the second line of the model, $μ_i$ is constructed from other parameters, $α$ and $β$, and the observed variable $x$. This line is not a stochastic relationship—there is no $\sim$ in it, but rather an $=$ in it—because the definition of $μ_i$ is deterministic. That is to say that, once we know $α$ and $β$ and $x_i$, we know $μ_i$ with certainty.
The value $x_i$ is just the weight value on row $i$. It refers to the same individual as the height value, $h_i$, on the same row. The parameters $α$ and $β$ are more mysterious. Where did they come from? We made them up. The parameters $μ$ and $σ$ are necessary and sufficient to describe a Gaussian distribution. But $α$ and $β$ are instead devices we invent for manipulating $μ$, allowing it to vary systematically across cases in the data.
You’ll be making up all manner of parameters as your skills improve. One way to understand these made-up parameters is to think of them as targets of learning. Each parameter is something that must be described in the posterior distribution. So when you want to know something about the data, you ask your model by inventing a parameter for it. This will make more and more sense as you progress. Here’s how it works in this context. The second line of the model definition is just:
$$\mu_i = \alpha + \beta(x_i - \bar{x})$$
What this tells the regression golem is that you are asking two questions about the mean of the outcome.
1) What is the expected height when $x_i = \bar{x}$? The parameter $α$ answers this question, because when $x_i = \bar{x}$, $μ_i = α$. For this reason, $α$ is often called the **intercept**. But we should think not in terms of some abstract line, but rather in terms of the meaning with respect to the observable variables.
2) What is the change in expected height, when $x_i$ changes by 1 unit? The parameter $β$ answers this question. It is often called a “**slope**,” again because of the abstract line. Better to think of it as a rate of change in expectation.
Jointly these two parameters ask the model to find a line that relates $x$ to $h$, a line that passes through $α$ when $x_i = \bar{x}$ and has slope $β$. That is a task that models are very good at. It’s up to you, though, to be sure it’s a good question.
------------------------------------------------------------------------
#### Nothing special or natural about linear models
Note that there’s nothing special about the linear model, really. You can choose a different relationship between $α$ and $β$ and $μ$. For example, the following is a perfectly legitimate definition for $μ_i$:
$$\mu_i = \alpha \space \text{exp}(-\beta x_i)$$
This does not define a linear regression, but it does define a regression model. The linear relationship we are using instead is conventional, but nothing requires that you use it. It is very common in some fields, like ecology and demography, to use functional forms for $μ$ that come from theory, rather than the geocentrism of linear models. Models built out of substantive theory can dramatically outperform linear models of the same phenomena. We’ll revisit this point later.
------------------------------------------------------------------------
#### Units and regression models
Readers who had a traditional training in physical sciences will know how to carry units through equations of this kind. For their benefit, here’s the model again (omitting priors for brevity), now with units of each symbol added.
$$\begin{align*}
h_{i}\text{cm} &\sim \text{Normal}(\mu_i\text{cm}, \sigma\text{cm})\\
\mu_i\text{cm} &= \alpha\text{cm} + \beta\frac{\text{cm}}{\text{kg}}(x_i \text{kg} - \bar{x}\text{kg})\\
\end{align*}$$
So you can see that $β$ must have units of cm/kg in order for the mean $μ_i$ to have units of cm. One of the facts that labeling with units clears up is that a parameter like $β$ is a kind of rate—centimeters per kilogram. There’s also a tradition called *dimensionless analysis* that advocates constructing variables so that they are unit-less ratios. In this context, for example, we might divide height by a reference height, removing its units. Measurement scales are arbitrary human constructions, and sometimes the unit-less analysis is more natural and general.
------------------------------------------------------------------------
#### Priors
The remaining lines in the model define distributions for the **unobserved variables**. These variables are commonly known as parameters, and their distributions as priors. There are three parameters: $α$, $β$, and $σ$. You’ve seen priors for $α$ and $σ$ before, although $α$ was called $μ$ back then.
The prior for $β$ deserves explanation. Why have a Gaussian prior with mean zero? This prior places just as much probability below zero as it does above zero, and when $β=0$, weight has no relationship to height. To figure out what this prior implies, we have to simulate the prior predictive distribution. There is no other reliable way to understand.
The goal is to simulate heights from the model, using only the priors. First, let’s consider a range of weight values to simulate over. The range of observed weights will do fine. Then we need to simulate a bunch of lines, the lines implied by the priors for $α$ and $β$. Here’s how to do it:
We have to plot line corresponding to 100 pairs of $α$ and $β$ values:
```{python}
def plot_prior_pred():
N = 100 # 100 lines
x = np.linspace(np.min(d2['weight']), np.max(d2['weight']), 100)
xbar = d2['weight'].mean()
a = stats.norm.rvs(loc=178, scale = 20, size=N)
b = stats.norm.rvs(loc=0, scale = 10, size=N)
fig, axs = plt.subplots(ncols=2, figsize=(9, 4))
for i in range(N):
axs[0].plot(x, a[i] + b[i] * (x - xbar), color="black", alpha=0.2, linewidth=0.5)
axs[0].set_title(r'$\beta \sim \text{Normal}(0,10)$')
# Sample from new prior
a = stats.norm.rvs(loc=178, scale=20, size=N)
b = stats.lognorm.rvs(scale=np.exp(0), s=1, size=N)
for i in range(N):
axs[1].plot(x, a[i] + b[i] * (x - xbar), color="black", alpha=0.2, linewidth=0.5)
axs[1].set_title(r'$\text{log}(\beta) \sim \text{Normal}(0,1)$')
for ax in axs:
ax.axhline(272, color="black", linewidth=0.5)
ax.axhline(0, color="black", linestyle="--", linewidth=0.5)
ax.set(xlabel="Weight", ylabel="Height", ylim=(-100, 400))
fig.tight_layout()
utils.inline_plot(plot_prior_pred)
```
Prior predictive simulation for the height and weight model. Left: Simulation using the $β \sim \text{Normal}(0,10)$ prior. Right: A more sensible $log(β)\sim \text{Normal}(0,1)$ prior.
For reference, I’ve added a dashed line at zero—no one is shorter than zero—and the “Wadlow” line at 272 cm for the world’s tallest person. The pattern doesn’t look like any human population at all. It essentially says that the relationship between weight and height could be absurdly positive or negative. Before we’ve even seen the data, this is a bad model. Can we do better?
We know that average height increases with average weight, at least up to a point. Let’s try restricting it to positive values. The easiest way to do this is to define the prior as **Log-Normal** instead.
Defining $β$ as Log-Normal(0,1) means to claim that the logarithm of $β$ has a Normal(0,1) distribution. Plainly:
$$\beta \sim \text{Log-Normal}(0,1)$$
SciPy provides the `stats.lognorm.rvs` and `stats.lognorm.pdf` densities for working with log-normal distributions. We can simulate this relationship to see what this means for $\beta$:
```{python}
def plot_dens():
b = stats.lognorm.rvs(s=1, scale=np.exp(0), size=int(1e4))
bw = utils.bw_nrd0(b)
az.plot_kde(b, bw=bw*0.1)
plt.xlabel(r"$\mu$")
plt.ylabel("Density")
plt.xlim(-0.5,5)
utils.inline_plot(plot_dens)
```
If the logarithm of $β$ is normal, then $β$ itself is strictly positive. The reason is that exp(x) is greater than zero for any real number $x$. This is the reason that Log-Normal priors are commonplace. They are an easy way to enforce positive relationships. So what does this earn us? Do the prior predictive simulation again, now with the Log-Normal prior:
```{python}
N = 100
a = stats.norm.rvs(loc=178, scale=20, size=N)
b = stats.lognorm.rvs(scale=np.exp(0), s=1, size=N)
```
Plotting as before produces the right-hand plot above. This is much more sensible. There is still a rare impossible relationship. But nearly all lines in the joint prior for $α$ and $β$ are now within human reason.
We’re fussing about this prior, even though as you’ll see in the next section there is so much data in this example that the priors end up not mattering. We fuss for two reasons. First, there are many analyses in which no amount of data makes the prior irrelevant. In such cases, non-Bayesian procedures are no better off. They also depend upon structural features of the model. Paying careful attention to those features is essential. Second, thinking about the priors helps us develop better models, maybe even eventually going beyond geocentrism.
------------------------------------------------------------------------
#### What’s the correct prior?
People commonly ask what the correct prior is for a given analysis. The question sometimes implies that for any given set of data, there is a uniquely correct prior that must be used, or else the analysis will be invalid. This is a mistake. There is no more a uniquely correct prior than there is a uniquely correct likelihood. Statistical models are machines for inference. Many machines will work, but some work better than others. Priors can be wrong, but only in the same sense that a kind of hammer can be wrong for building a table.
In choosing priors, there are simple guidelines to get you started. Priors encode states of information before seeing data. So priors allow us to explore the consequences of beginning with different information. In cases in which we have good prior information that discounts the plausibility of some parameter values, like negative associations between height and weight, we can encode that information directly into priors. When we don’t have such information, we still usually know enough about the plausible range of values. And you can vary the priors and repeat the analysis in order to study how different states of initial information influence inference. Frequently, there are many reasonable choices for a prior, and all of them produce the same inference. And conventional Bayesian priors are conservative, relative to conventional non-Bayesian approaches. We’ll see how this conservatism arises later.
Making choices tends to make novices nervous. There’s an illusion sometimes that default procedures are more objective than procedures that require user choice, such as choosing priors. If that’s true, then all “objective” means is that everyone does the same thing. It carries no guarantees of realism or accuracy.
------------------------------------------------------------------------
#### Prior predictive simulation and *p*-hacking
A serious problem in contemporary applied statistics is “*p*-hacking,” the practice of adjusting the model and the data to achieve a desired result. The desired result is usually a *p*-value less then 5%. The problem is that when the model is adjusted in light of the observed data, then *p*-values no longer retain their original meaning. False results are to be expected. We don’t pay any attention to *p*-values in this book. But the danger remains, if we choose our priors conditional on the observed sample, just to get some desired result. The procedure we’ve performed here is to choose priors conditional on pre-data knowledge of the variables—their constraints, ranges, and theoretical relationships. This is why the actual data are not shown in the earlier section. We are judging our priors against general facts, not the sample. We’ll look at how the model performs against the real data next.
------------------------------------------------------------------------
### Finding The Posterior Distribution
The code needed to approximate the posterior is a straightforward modification of the kind of code you’ve already seen. All we have to do is incorporate our new model for the mean into the model specification inside the Stan code and be sure to add a prior for the new parameter, $β$. Let’s repeat the model definition:
$$\begin{align*}
h_{i} &\sim \text{Normal}(\mu_i, \sigma)\\
\mu_i &= \alpha + \beta(x_i - \bar{x})\\
\alpha &\sim \text{Normal}(178,20)\\
\beta &\sim \text{Log-Normal}(0,1)\\
\sigma &\sim \text{Uniform}(0,50)\\
\end{align*}$$
```{python}
m4_3 = '''
data {
int<lower=0> N;
vector[N] height;
vector[N] weight;
real xbar;
}
parameters {
real a;
real<lower=0> b;
real<lower=0, upper=50> sigma;
}
model {
vector[N] mu;
mu = a + b * (weight - xbar);
// Likelihood Function
height ~ normal(mu, sigma);
// Priors
a ~ normal(178, 20);
b ~ lognormal(0, 1);
sigma ~ uniform(0, 50);
}
'''
data = {'N': len(d2),
'xbar': d2['weight'].mean(),
'height': d2['height'].tolist(),
'weight': d2['weight'].tolist()}
m4_3_model = utils.StanQuap('stan_models/m4_3', m4_3, data, algorithm = 'Newton')
m4_3_model.precis().round(2)
m4_3_model.vcov_matrix().round(3)
```
------------------------------------------------------------------------
#### Everything that depends upon parameters has a posterior distribution
In the model above, the parameter $μ$ is no longer a parameter, since it has become a function of the parameters $α$ and $β$. But since the parameters $α$ and $β$ have a joint posterior, so too does $μ$. Later in the chapter, you’ll work directly with the posterior distribution of $μ$, even though it’s not a parameter anymore. Since parameters are uncertain, everything that depends upon them is also uncertain. This includes statistics like $μ$, as well as model-based predictions, measures of fit, and everything else that uses parameters. By working with samples from the posterior, all you have to do to account for posterior uncertainty in any quantity is to compute that quantity for each sample from the posterior. The resulting quantities, one for each posterior sample, will approximate the quantity’s posterior distribution.
------------------------------------------------------------------------
#### Logs and exps, oh my
My experience is that many natural and social scientists have naturally forgotten whatever they once knew about logarithms. Logarithms appear all the time in applied statistics. You can usefully think of $y=log(x)$ as assigning to $y$ the order of magnitude of $x$. The function $x=exp(y)$ is the reverse, turning a magnitude into a value. These definitions will make a mathematician shriek. But much of our computational work relies only on these intuitions.
These definitions allow the Log-Normal prior for $β$ to be coded another way. Instead of defining a parameter $β$, we define a parameter that is the logarithm of $β$ and then assign it a normal distribution. Then we can reverse the logarithm inside the linear model. It looks like this:
```{python}
m4_3b = '''
data {
int<lower=0> N;
vector[N] height;
vector[N] weight;
real xbar;
}
parameters {
real a;
real log_b;
real<lower=0, upper=50> sigma;
}
model {
vector[N] mu;
mu = a + exp(log_b) * (weight - xbar);
// Likelihood Function
height ~ normal(mu, sigma);
// Priors
a ~ normal(178, 20);
log_b ~ normal(0, 1);
sigma ~ uniform(0, 50);
}
'''
data = {'N': len(d2),
'xbar': d2['weight'].mean(),
'height': d2['height'].tolist(),
'weight': d2['weight'].tolist()}
m4_3b_model = utils.StanQuap('stan_models/m4_3b', m4_3b, data, algorithm = 'Newton')
m4_3b_model.precis().round(2)
m4_3b_model.vcov_matrix().round(3)
```
```{python}
np.exp(m4_3b_model.precis()['Mean'][1])
```
Note the `exp(log_b)` in the definition of `mu`. This is the same model as `m4_3`. It will make the same predictions. But instead of $β$ in the posterior distribution, you get $\text{log}(β)$. It is easy to translate between the two, because $β=\text{exp}(\text{log}(β))$. In code form: `b = np.exp(log_b)`.
------------------------------------------------------------------------
### Interpreting the Posterior Distribution
One trouble with statistical models is that they are hard to understand. Once you’ve fit the model, it can only report posterior distribution. This is the right answer to the question you asked. But it’s your responsibility to process the answer and make sense of it.
There are two broad categories of processing: (1) reading tables and (2) plotting simulations. For some simple questions, it’s possible to learn a lot just from tables of marginal values. But most models are very hard to understand from tables of numbers alone. A major difficulty with tables alone is their apparent simplicity compared to the complexity of the model and data that generated them. Once you have more than a couple of parameters in a model, it is very hard to figure out from numbers alone how all of them act to influence prediction. This is also the reason we simulate from priors. Once you begin adding **interaction terms** or **polynomials**, it may not even be possible to guess the direction of influence a predictor variable has on an outcome.
So throughout this book, I emphasize plotting posterior distributions and posterior predictions, instead of attempting to understand a table. Plotting the implications of your models will allow you to inquire about things that are hard to read from tables:
1) Whether or not the model fitting procedure worked correctly
2) The *absolute* magnitude, rather than merely *relative* magnitude, of a relationship between outcome and predictor
3) The uncertainty surrounding an average relationship
4) The uncertainty surrounding the implied predictions of the model, as these are distinct from mere parameter uncertainty
In addition, once you get the hang of processing posterior distributions into plots, you can ask any question you can think of, for any model type. And readers of your results will appreciate a figure much more than they will a table of estimates.
So in the remainder of this section, I first spend a little time talking about tables of estimates. Then I move on to show how to plot estimates that always incorporate information from the full posterior distribution, including correlations among parameters.
------------------------------------------------------------------------
#### What do parameters mean?
A basic issue with interpreting model-based estimates is in knowing the meaning of parameters. There is no consensus about what a parameter means, however, because different people take different philosophical stances towards models, probability, and prediction. The perspective in this book is a common Bayesian perspective: *Posterior probabilities of parameter values describe the relative compatibility of different states of the world with the data, according to the model*. These are small world numbers. So reasonable people may disagree about the large world meaning, and the details of those disagreements depend strongly upon context. Such disagreements are productive, because they lead to model criticism and revision, something that models cannot do for themselves. Later, we'll see that parameters can refer to observable quantities—data—as well as unobservable values. This makes parameters even more useful and their interpretation even more context dependent.
------------------------------------------------------------------------
#### Tables of Marginal Distributions
With the new linear regression trained on the Kalahari data, we inspect the marginal posterior distributions of the parameters:
```{python}
m4_3_model.precis().round(2)
```
The first row gives the quadratic approximation for $α$, the second the approximation for $β$, and the third approximation for $σ$. Let’s try to make some sense of them.
Let’s focus on `b` ($β$), because it’s the new parameter. Since $β$ is a slope, the value 0.90 can be read as *a person 1 kg heavier is expected to be 0.90 cm taller*. 89% of the posterior probability lies between 0.84 and 0.97. That suggests that $β$ values close to zero or greatly above one are highly incompatible with these data and this model. It is most certainly not evidence that the relationship between weight and height is linear, because the model only considered lines. It just says that, if you are committed to a line, then lines with a slope around 0.9 are plausible ones.
Remember, the numbers in the default `precis` output aren’t sufficient to describe the quadratic posterior completely. For that, we also require the variance-covariance matrix. You can see the covariances among the parameters with `vcov`:
```{python}
np.round(m4_3_model.vcov_matrix(), 3)
```
Very little covariation among the parameters in this case. Constructing a pairs plot shows both the marginal posteriors and the covariance. Later, you’ll see that the lack of covariance among the parameters results from **centering**.
```{python}
def plot_pair():
posterior = m4_3_model.extract_samples(n=1000)
az.plot_pair(posterior, var_names=['a','b','sigma'], figsize=[6, 6], marginals=True, scatter_kwargs={'alpha': 0.2})
utils.inline_plot(plot_pair)
```
#### **Plotting Posterior Inference Against the Data**
It’s almost always much more useful to plot the posterior inference against the data. Not only does plotting help in interpreting the posterior, but it also provides an informal check on model assumptions. When the model’s predictions don’t come close to key observations or patterns in the plotted data, then you might suspect the model either did not fit correctly or is rather badly specified. But even if you only treat plots as a way to help in interpreting the posterior, they are invaluable. For simple models like this one, it is possible (but not always easy) to just read the table of numbers and understand what the model says. But for even slightly more complex models, especially those that include interaction effects, interpreting posterior distributions is hard. Combine with this the problem of incorporating the information in `vcov` into your interpretations, and the plots are irreplaceable.
We’re going to start with a simple version of that task, superimposing just the **posterior mean** values over the height and weight data. Then we’ll slowly add more and more information to the prediction plots, until we’ve used the entire posterior distribution.
We’ll start with just the raw data and a single line. The code below plots the raw data, computes the posterior mean values for `a` and `b`, then draws the implied line:
```{python}
def plot_post_infer():
plt.plot(d2['weight'], d2['height'], 'o', fillstyle='none')
posterior = pd.DataFrame(m4_3_model.extract_samples())
a_map = posterior["a"].mean()
b_map = posterior["b"].mean()
xbar = d2['weight'].mean()
plt.plot(d2['weight'], a_map + b_map * (d2['weight'] - xbar), linewidth=1.5)
plt.ylabel("Height (cm)")
plt.xlabel("Weight (kg)")
utils.inline_plot(plot_post_infer)
```
Height in centimeters (vertical) plotted against weight in kilograms (horizontal), with the line at the posterior mean plotted in blue.
Each point in this plot is a single individual. The blue line is defined by the mean slope $β$ and mean intercept $α$. This is not a bad line. It certainly looks highly plausible. But there are an infinite number of other highly plausible lines near it. Let’s draw those too.
#### Adding Uncertainty Around the Mean
The posterior mean line is just the posterior mean, the most plausible line in the infinite universe of lines the posterior distribution has considered. Plots of the average line, like the plot above, are useful for getting an impression of the magnitude of the estimated influence of a variable. But they do a poor job of communicating uncertainty. Remember, the posterior distribution considers every possible regression line connecting height to weight. It assigns a relative plausibility to each. This means that each combination of $α$ and $β$ has a posterior probability. It could be that there are many lines with nearly the same posterior probability as the average line. Or it could be instead that the posterior distribution is rather narrow near the average line.
So how can we get that uncertainty onto the plot? Together, a combination of $α$ and $β$ define a line. And so we could sample a bunch of lines from the posterior distribution. Then we could display those lines on the plot, to visualize the uncertainty in the regression relationship.
To better appreciate how the posterior distribution contains lines, we work with all of the samples from the model. Let’s take a closer look at the samples now:
```{python}
posterior = pd.DataFrame(m4_3_model.extract_samples())
posterior.head()
```
Each row is a correlated random sample from the joint posterior of all three parameters, using the covariances provided by the model. The paired values of `a` and `b` on each row define a line. The average of very many of these lines is the posterior mean line. But the scatter around that average is meaningful, because it alters our confidence in the relationship between the predictor and the outcome.
So now let’s display a bunch of these lines, so you can see the scatter. This lesson will be easier to appreciate, if we use only some of the data to begin. Then you can see how adding in more data changes the scatter of the lines. So we’ll begin with just the first 10 cases in `d2`. The following code extracts the first 10 cases and re-estimates the model:
```{python}
N = 10
dN = d2[:N]
data = {'N': len(dN),
'xbar': dN['weight'].mean(),
'height': dN['height'].tolist(),
'weight': dN['weight'].tolist()}
mN = utils.StanQuap('stan_models/m4_3', m4_3, data, algorithm = 'Newton')
```
Now let’s plot 20 of these lines, to see what the uncertainty looks like.
```{python}
# extract 20 samples from the posterior
post = pd.DataFrame(mN.extract_samples(n=20))
def plot_post_infer():
plt.scatter(dN['weight'], dN['height'], facecolor="none", edgecolor="C0")
for i in range(20):
a, b = post['a'][i], post['b'][i]
x = np.linspace(d2['weight'].min(), d2['weight'].max(), 10)
y = a + b * (x - dN['weight'].mean())
plt.plot(x, y, "C1-", alpha=0.5)
plt.ylabel("Height (cm)")
plt.xlabel("Weight (kg)")
plt.tight_layout()
utils.inline_plot(plot_post_infer)
```
By plotting multiple regression lines, sampled from the posterior, it is easy to see both the highly confident aspects of the relationship and the less confident aspects. The cloud of regression lines displays greater uncertainty at extreme values for weight.
The other plots in below show the same relationships, but for increasing amounts of data. Just re-use the code from before, but change `N = 10` to some other value. Notice that the cloud of regression lines grows more compact as the sample size increases. This is a result of the model growing more confident about the location of the mean.
```{python}
def plot_reg_rel():
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(9, 9), sharey=True, sharex=True)
for N, ax in zip((10, 50, 150, 352), axes.flatten()):
weight, height = d2['weight'][:N], d2['height'][:N]
xbar = np.mean(weight)
data = {'N': len(height),
'xbar': xbar,
'height': height.tolist(),
'weight': weight.tolist()}
mN = utils.StanQuap('stan_models/m4_3', m4_3, data, algorithm = 'Newton')
post = pd.DataFrame(mN.extract_samples(n=20))
ax.scatter(weight, height, facecolor="none", edgecolor="C0")
for i in range(20):
a, b = post['a'][i], post['b'][i]
x = np.linspace(d2['weight'].min(), d2['weight'].max(), 10)
y = a + b * (x - xbar)
ax.plot(x, y, "C1-", alpha=0.5)
ax.set(xlabel="Weight (kg)", ylabel="Height (cm)", title=f"N = {N}")
fig.tight_layout()
utils.inline_plot(plot_reg_rel)
```
#### Plotting Regression Intervals and Contours
The cloud of regression lines in the plot above is an appealing display, because it communicates uncertainty about the relationship in a way that many people find intuitive. But it’s more common, and often much clearer, to see the uncertainty displayed by plotting an interval or contour around the average regression line. In this section, I’ll walk you through how to compute any arbitrary interval you like, using the underlying cloud of regression lines embodied in the posterior distribution.
Focus for the moment on a single weight value, say 50 kilograms. You can quickly make a list of 10,000 values of $μ$ for an individual who weighs 50 kilograms, by using your samples from the posterior:
```{python}
post = pd.DataFrame(m4_3_model.extract_samples())
mu_at_50 = post["a"] + post["b"] * (50 - d2.weight.mean())
```
The code above takes its form from the equation for $μ_i$:
$$\mu_i = \alpha + \beta(x_i - \bar{x})$$
The value of $x_i$ in this case is 50. Go ahead and take a look inside the result, `mu_at_50`. It’s a vector of predicted means, one for each random sample from the posterior. Since joint `a` and `b` went into computing each, the variation across those means incorporates the uncertainty in and correlation between both parameters. It might be helpful at this point to actually plot the density for this vector of means:
```{python}
def plot_dens():
bw = utils.bw_nrd0(mu_at_50.values)
az.plot_kde(mu_at_50.values, bw=bw*0.5)
plt.xlabel(r"$\mu \mid Weight = 50$")
plt.ylabel("Density")
utils.inline_plot(plot_dens)
```
The quadratic approximate posterior distribution of the mean height, $μ$, when weight is 50 kg. This distribution represents the relative plausibility of different values of the mean.
Since the components of $μ$ have distributions, so too does $μ$. And since the distributions of $α$ and $β$ are Gaussian, so too is the distribution of $μ$ (adding Gaussian distributions always produces a Gaussian distribution).
Since the posterior for $μ$ is a distribution, you can find intervals for it, just like for any posterior distribution. To find the 89% compatibility interval of $μ$ at 50 kg, we can use the quantile or `precis` function as usual:
```{python}
utils.precis(pd.DataFrame({'mu': mu_at_50}))
```
What these numbers mean is that the central 89% of the ways for the model to produce the data place the average height between about 159 cm and 160 cm (conditional on the model and data), assuming the weight is 50 kg.
That’s good so far, but we need to repeat the above calculation for every weight value on the horizontal axis, not just when it is 50 kg. We want to draw 89% intervals around the average slope.
This is made simple by strategic use of the `link` function. What `link` will do is take our `StanQuap` approximation, sample from the posterior distribution, and then compute $μ$ for each case in the data and sample from the posterior distribution. Here’s what it looks like for the data you used to fit the model:
```{python}
def lm(post, data):
weight = data
xbar = d2['weight'].mean() # ensure that xbar is based on the dataset supplied during fit
a, b = post['a'].reshape(-1,1), post['b'].reshape(-1,1)
return a + b * (weight - xbar)
mu = utils.link(m4_3_model, lm_func=lm, data=d2['weight'].to_numpy())
print(mu)
```
We end up with a big matrix of values of $μ$. Each row is a sample from the posterior distribution. The default is 1000 samples, but you can use as many or as few as you like. Each column is a case (row) in the data. There are 352 rows in `d2`, corresponding to 352 individuals. So there are 352 columns in the matrix `mu` above.
Now what can we do with this big matrix? Lots of things. The function `link` provides a posterior distribution of $μ$ for each case we feed it. So above we have a distribution of $μ$ for each individual in the original data. We actually want something slightly different: a distribution of $μ$ for each unique weight value on the horizontal axis. It’s only slightly harder to compute that, by just passing `link` some new data:
```{python}
# define sequence of weights to compute predictions for
# these values will be on the horizontal axis
weight_seq = np.arange(25, 71)
# use link to compute mu
# for each sample from posterior
# and for each weight in weight.seq
mu = utils.link(m4_3_model, lm_func=lm, data=weight_seq)
mu
```
And now there are only 46 rows in `mu`, because we fed it 46 different values for weight. To visualize what you’ve got here, let’s plot the distribution of $μ$ values at each height.
```{python}
def plot_mu():
plt.plot(weight_seq, mu[:100,:].T, "C0.", alpha=0.1)
plt.xlabel("Weight")
plt.ylabel("Height")
utils.inline_plot(plot_mu)
```
The first 100 values in the distribution of $μ$ at each weight value.
At each weight value in `weight_seq`, a pile of computed $μ$ values are shown. Each of these piles is a Gaussian distribution, like that in the density plot for `mu_at_50`. You can see now that the amount of uncertainty in $μ$ depends upon the value of weight. And this is the same fact we saw in the 2x2 plot of uncertainty in the regression relationship.
The final step is to summarize the distribution for each weight value:
```{python}
# summarize the distribution of mu
mu_df = pd.DataFrame(
{str(weight): mu[:, i] for i, weight in enumerate(weight_seq)}
)
mu_summary = utils.precis(mu_df, index_name='Weight')
mu_summary
```
Now `mu_summary` contains the average $μ$ at each weight value and 89% lower and upper bounds for each weight value. They are just different kinds of summaries of the distributions in `mu`, with each column being for a different weight value. These summaries are only summaries. The “estimate” is the entire distribution.
You can plot these summaries on top of the data:
```{python}
def plot_mu():
plt.plot(d2['weight'], d2['height'], 'o', fillstyle='none')
plt.plot(weight_seq, mu_summary['Mean'], 'k')
plt.fill_between(weight_seq, mu_summary['5.5%'], mu_summary['94.5%'], alpha=0.2)
plt.xlim(d2['weight'].min()-2, d2['weight'].max()+2)
plt.ylabel("Height (cm)")
plt.xlabel("Weight (kg)")
utils.inline_plot(plot_mu)
```
The !Kung height data again, now with 89% compatibility interval of the mean indicated by the shaded region. Compare this region to the distributions of points in the above plot.
Using this approach, you can derive and plot **posterior prediction means and intervals** for quite complicated models, for any data you choose. It’s true that it is possible to use analytical formulas to compute intervals like this. But without much training in probability theory and integral calculus $∫$’s, this would be challenging to do. We can quickly learn to generate and summarize samples derived from the posterior distribution. So while the mathematics would be a more elegant approach, and there is some additional insight that comes from knowing the mathematics, the pseudo-empirical approach presented here is very flexible and allows a much broader audience to pull insight from their statistical modeling. And again, when you start estimating models with MCMC, this is really the only approach available. So it’s worth learning now.
To summarize, here’s the recipe for generating predictions and intervals from the posterior of a fit model.
1) Use `link` to generate distributions of posterior values for $μ$. The default behavior of `link` is to use the original data, so you have to pass it a list of new horizontal axis values you want to plot posterior predictions across.
2) Use summary functions like `numpy.mean` or `arviz.hdi` or `precis` to find averages and lower and upper bounds of $μ$ for each value of the predictor variable.
3) Finally, use plotting functions to draw the lines and intervals. Or you might plot the distributions of the predictions, or do further numerical calculations with them.
This recipe works for every model we fit in this repo. As long as you know the structure of the model—how parameters relate to the data—you can use samples from the posterior to describe any aspect of the model’s behavior.
------------------------------------------------------------------------
#### Overconfident intervals
The compatibility interval for the regression line in the plot above clings tightly to the MAP line. Thus there is very little uncertainty about the *average height as a function of average weight*. But you have to keep in mind that these inferences are always conditional on the model. Even a very bad model can have very tight compatibility intervals. It may help if you think of the regression line as saying: *Conditional on the assumption that height and weight are related by a straight line, then this is the most plausible line, and these are its plausible bounds*.
------------------------------------------------------------------------
#### How `link` works
The function `link` is not really very sophisticated. All it is doing is using the linear model function you provide as an argument to compute the value of the linear model. It does this for each sample from the posterior distribution, for each case in the data. You could accomplish the same thing for any model, fit by any means, by performing these steps yourself. This is how it’d look for `m4_3_model`.
```{python}
post = m4_3_model.extract_samples(n=1000, dict_out=True)
xbar = d2['weight'].mean()
mu_link = lambda weight: post['a'].reshape(-1,1) + post['b'].reshape(-1,1) * (weight - xbar)
weight_seq = np.arange(25, 71)
mu = mu_link(weight_seq)
mu_mean = mu.mean(axis=0)
mu_CI = np.quantile(mu, q=[0.055, 0.945], axis=0)
```
And the values in `mu_mean` and `mu_CI` should be very similar (allowing for simulation variance) to what you got the automated way, using `link`.
Knowing this manual method is useful both for (1) understanding and (2) sheer power. Whatever the model you find yourself with, this approach can be used to generate posterior predictions for any component of it. Automated tools like `link` save effort, but they are never as flexible as the code you can write yourself.
------------------------------------------------------------------------
#### Prediction Intervals
Now let’s walk through generating an 89% prediction interval for actual heights, not just the average height, $μ$. This means we’ll incorporate the standard deviation $σ$ and its uncertainty as well. Remember, the first line of the statistical model here is:
$$h_i \sim \text{Normal}(\mu_i, \sigma)$$
What you’ve done so far is just use samples from the posterior to visualize the uncertainty in $μ_i$, the linear model of the mean. But actual predictions of heights depend also upon the distribution in the first line. The Gaussian distribution on the first line tells us that the model expects observed heights to be distributed around $μ$, not right on top of it. And the spread around $μ$ is governed by $σ$. All of this suggests we need to incorporate $σ$ in the predictions somehow.
Here’s how you do it. Imagine simulating heights. For any unique weight value, you sample from a Gaussian distribution with the correct mean $μ$ for that weight, using the correct value of $σ$ sampled from the same posterior distribution. If you do this for every sample from the posterior, for every weight value of interest, you end up with a collection of simulated heights that embody the uncertainty in the posterior as well as the uncertainty in the Gaussian distribution of heights. This is how we can do it:
```{python}
m4_3_pred = '''
data {
int<lower=1> N;
vector[N] height;
vector[N] weight;
real xbar;
int<lower=1> N_tilde;
vector[N_tilde] weight_tilde;
}
parameters {
real a;
real<lower=0> b;
real<lower=0, upper=50> sigma;
}
model {
vector[N] mu;
mu = a + b * (weight - xbar);
// Likelihood Function
height ~ normal(mu, sigma);
// Priors
a ~ normal(178, 20);
b ~ lognormal(0, 1);
sigma ~ uniform(0, 50);
}
generated quantities {
vector[N_tilde] y_tilde;
for (i in 1:N_tilde) {
y_tilde[i] = normal_rng(a + b * (weight_tilde[i] - xbar), sigma);
}
}
'''
data = {'N': len(d2),
'xbar': d2['weight'].mean(),
'height': d2['height'].tolist(),
'weight': d2['weight'].tolist(),
'N_tilde': len(weight_seq),
'weight_tilde': weight_seq.tolist()}
m4_3_pred_model = utils.StanQuap('stan_models/m4_3_pred', m4_3_pred, data, algorithm = 'Newton')
sim_height = m4_3_pred_model.sim(data=data)['y_tilde']
sim_height
```
This matrix is much like the earlier one, `mu`, but it contains simulated heights, not distributions of plausible average height, `µ`.
We can summarize these simulated heights in the same way we summarized the distributions of `µ`:
```{python}
sim_height_df = pd.DataFrame(
{str(weight): sim_height[:, i] for i, weight in enumerate(weight_seq)}
)
sim_height_df = utils.precis(sim_height_df, index_name='Weight')
sim_height_df
```
Now `sim_height_df` contains the 89% posterior prediction interval of observable (according to the model) heights, across the values of weight in `weight_seq`.
Let’s plot everything we’ve built up: (1) the average line, (2) the shaded region of 89% plausible $μ$, and (3) the boundaries of the simulated heights the model expects.
```{python}
def plot_post_sim():
plt.plot(d2.weight, d2.height, 'o', fillstyle='none', label='Observed Data', color='k')
plt.plot(weight_seq, mu_summary['Mean'], 'k', label='Mean')
plt.fill_between(weight_seq, mu_summary['5.5%'], mu_summary['94.5%'], alpha=0.2)
plt.fill_between(weight_seq, sim_height_df['5.5%'], sim_height_df['94.5%'], alpha=0.2)
utils.inline_plot(plot_post_sim)
```
89% prediction interval for height, as a function of weight. The solid line is the average line for the mean height at each weight. The two shaded regions show different 89% plausible regions. The narrow shaded interval around the line is the distribution of $μ$. The wider shaded region represents the region within which the model expects to find 89% of actual heights in the population, at each weight.
The wide shaded region in the figure represents the area within which the model expects to find 89% of actual heights in the population, at each weight. There is nothing special about the value 89% here. You could plot the boundary for other percents, such as 67% and 97% (also both primes), and add those to the plot. Doing so would help you see more of the shape of the predicted distribution of heights.
Notice that the outline for the wide shaded interval is a little rough. This is the simulation variance in the tails of the sampled Gaussian values. If it really bothers you, increase the number of samples you take from the posterior distribution.
With extreme percentiles, it can be very hard to get out all of the roughness. Luckily, it hardly matters, except for aesthetics. Moreover, it serves to remind us that all statistical inference is approximate. The fact that we can compute an expected value to the 10th decimal place does not imply that our inferences are precise to the 10th decimal place.
------------------------------------------------------------------------
#### Two kinds of uncertainty
In the procedure above, we encountered both **uncertainty in parameter values** and **uncertainty in a sampling process**. These are distinct concepts, even though they are processed much the same way and end up blended together in the **posterior predictive simulation**. The *posterior distribution is a ranking of the relative plausibilities of every possible combination of parameter values*. The distribution of simulated outcomes, like height, is instead a distribution that includes sampling variation from some process that generates Gaussian random variables. This sampling variation is still a model assumption. It’s no more or less objective than the posterior distribution. Both kinds of uncertainty matter, at least sometimes. But it’s important to keep them straight, because they depend upon different model assumptions. Furthermore, it’s possible to view the **Gaussian likelihood as a purely epistemological assumption (a device for estimating the mean and variance of a variable), rather than an ontological assumption about what future data will look like**. In that case, it may not make complete sense to simulate outcomes.
------------------------------------------------------------------------
#### Rolling your own `sim`
Just like with `link`, it’s useful to know a little about how `sim` operates. For every distribution like `stats.norm.pdf`, there is a companion simulation (random number generator `rng`) function. For the Gaussian distribution, the companion is `stats.norm.rvs`, and it simulates sampling from a Gaussian distribution. What we want Python to do is simulate a height for each set of samples, and to do this for each value of weight. The following will do it:
```{python}
post = m4_3_model.extract_samples(n=1000)
weight_seq = np.arange(25, 71)
xbar = d2['weight'].mean()
sim_height = stats.norm.rvs(
loc=post['a'] + post['b'] * (weight_seq.reshape(-1,1) - xbar),
scale=post['sigma']
).T
sim_height_df = pd.DataFrame(
{str(weight): sim_height[:, i] for i, weight in enumerate(weight_seq)}
)
sim_height_df = utils.precis(sim_height_df, index_name='Weight')
sim_height_df
```
The values in `sim_height` are practically identical to the ones computed in the main text and displayed in the plot above.
------------------------------------------------------------------------
## Curves from Lines
In the next notebook, you’ll see how to use linear models to build regressions with more than one predictor variable. But before then, it helps to see how to model the outcome as a curved function of a predictor. The models so far all assume that a straight line describes the relationship. But there’s nothing special about straight lines, aside from their simplicity.
We’ll consider two commonplace methods that use linear regression to build curves. The first is **polynomial regression**. The second is **B-splines**. Both approaches work by transforming a single predictor variable into several synthetic variables. But splines have some clear advantages. Neither approach aims to do more than describe the function that relates one variable to another. Causal inference, which we’ll consider much more beginning in the next note, wants more.
### Polynomial Regression
Polynomial regression uses powers of a variable—squares and cubes—as extra predictors. This is an easy way to build curved associations. Polynomial regressions are very common, and understanding how they work will help scaffold later models. To understand how polynomial regression works, let’s work through an example, using the full !Kung data, not just the adults:
```{python}
d = pd.read_csv("data/Howell1.csv", sep=';')
d.head()
```
Go ahead and plot height and weight. The relationship is visibly curved, now that we’ve included the non-adult individuals.
```{python}
plt.plot(d.weight, d.height, 'o', fillstyle='none', label='Observed Data', color='k')
plt.xlabel('Weight (kg)')
plt.ylabel('Height (cm)');
```
The most common polynomial regression is a parabolic model of the mean. Let $x$ be standardized body weight. Then the parabolic equation for the mean height is:
$$\mu_i = \alpha + \beta_1 x_i + \beta_2 x^2_i$$
The above is a parabolic (second order) polynomial. The $α+β_1 x_i$ part is the same linear function of x in a linear regression, just with a little “1” subscript added to the parameter name, so we can tell it apart from the new parameter. The additional term uses the square of $x_i$ to construct a parabola, rather than a perfectly straight line. The new parameter $β_2$ measures the curvature of the relationship.
Fitting these models to data is easy. Interpreting them can be hard. We’ll begin with the easy part, fitting a parabolic model of height on weight. The first thing to do is to **standardize the predictor variable**. We’ve done this in previous examples. But this is especially helpful for working with polynomial models. When predictor variables have very large values in them, there are sometimes numerical glitches. Even well-known statistical software can suffer from these glitches, leading to mistaken estimates. These problems are very common for polynomial regression, because the square or cube of a large number can be truly massive. Standardizing largely resolves this issue. It should be your default behavior.
To define the parabolic model, just modify the definition of $μ_i$. Here’s the model: $$\begin{align*}
h_{i} &\sim \text{Normal}(\mu_i, \sigma)\tag{Likelihood}\\
\mu_i &= \alpha + \beta_1 x_i + \beta_2 x^2_i\tag{Quadratic Model}\\
\alpha &\sim \text{Normal}(178,20)\tag{α prior} \\
\beta_1 &\sim \text{Log-Normal}(0,1)\tag{β1 prior} \\
\beta_2 &\sim \text{Normal}(0,1)\tag{β2 prior} \\
\sigma &\sim \text{Uniform}(0,50)\tag{σ prior} \\
\end{align*}$$
The confusing issue here is assigning a prior for $β_2$, the parameter on the squared value of $x$. Unlike $β_1$, we don’t want a positive constraint. In the practice problems at the end of the chapter, you’ll use prior predictive simulation to understand why. These polynomial parameters are in general very difficult to understand. But prior predictive simulation does help a lot.
Approximating the posterior is straightforward. Just modify the definition of `mu` so that it contains both the linear and quadratic terms. But in general it is better to pre-process any variable transformations—you don’t need the computer to recalculate the transformations on every iteration of the fitting procedure. So I’ll also build the square of `weight_s` as a separate variable:
```{python}
d['weight_s'] = (d.weight.values - d.weight.mean())/d.weight.std()
d['weight_s2'] = d.weight_s**2
with pm.Model() as m4_5_quad:
weight = pm.ConstantData("w", d.weight_s, dims="obs_id")
weight_s = pm.ConstantData("w_s", d.weight_s2, dims="obs_id")
a = pm.Normal("a", mu=178, sigma=20)
b1 = pm.LogNormal("b1", mu=0, sigma=1)
b2 = pm.Normal('b2', mu=0, sigma=1)
sigma = pm.Uniform("sigma", 0, 50)
mu = pm.Deterministic('mu', a + b1 * weight + b2 * weight_s, dims="obs_id")
height = pm.Normal("height", mu=mu, sigma=sigma, observed=d.height.values, dims="obs_id")
```
```{python}
with m4_5_quad:
custom_step_m4_5 = utils.QuadraticApproximation([a, b1, b2, sigma], m4_5_quad)
trace_quap_quad = pm.sample(draws=10_000, chains=1, tune=0, step=custom_step_m4_5, progressbar=False)
```
```{python}
pm.model_to_graphviz(m4_5_quad)
```
Now, since the relationship between the outcome `height` and the predictor `weight` depends upon two slopes, `b1` and `b2`, it isn’t so easy to read the relationship off a table of coefficients:
```{python}
az.summary(trace_quap_quad, var_names=["~mu"], hdi_prob=0.89, kind="stats", round_to=2)
```
The parameter $α$ (a) is still the intercept, so it tells us the expected value of height when weight is at its mean value. But it is no longer equal to the mean height in the sample, since there is no guarantee it should in a polynomial regression. And those $β_1$ and $β_2$ parameters are the linear and square components of the curve. But that doesn’t make them transparent.
You have to plot these model fits to understand what they are saying. So let’s do that. We’ll calculate the mean relationship and the 89% intervals of the mean and the predictions, like in the previous section. Here’s the working code:
```{python}
weight_seq = np.linspace(-2.2, 2, 30)
with m4_5_quad:
pm.set_data({"w": weight_seq, 'w_s': weight_seq**2})
trace_quap_quad = pm.sample_posterior_predictive(
trace_quap_quad,
var_names=["height", "mu"],
return_inferencedata=True,
predictions=True,
extend_inferencedata=True,
progressbar=False
)
```
```{python}
fig, ax = plt.subplots()
hdi_1 = az.plot_hdi(weight_seq, trace_quap_quad.predictions["height"],
hdi_prob=0.89, smooth=False)
hdi_2 = az.plot_hdi(weight_seq, trace_quap_quad.predictions["mu"],
hdi_prob=0.89, smooth=False)
mean_line, = ax.plot(weight_seq, trace_quap_quad.predictions["mu"].mean(axis=1)[0], 'k', label='Mean')
obs_data, = ax.plot(d.weight_s, d.height, 'o', fillstyle='none', label='Observed Data', color='k')
hdi_patch1 = plt.Line2D([0], [1], color='C0', lw=3, label="Posterior Predictive Sample")
hdi_patch2 = plt.Line2D([0], [1], color='C1', lw=3, label="Uncertainty in Mean")
ax.set(xlabel="Weight",
ylabel="Height (cm)",
xlim=(weight_seq.min(), weight_seq.max()))
ax.legend(handles=[hdi_patch1, hdi_patch2, mean_line, obs_data]);
```
Let build a higher-order polynomial regression, a cubic regression on weight. The model is:
$$\begin{align*}
h_{i} &\sim \text{Normal}(\mu_i, \sigma)\tag{Likelihood}\\
\mu_i &= \alpha + \beta_1 x_i + \beta_2 x^2_i + \beta_3 x^3_i\tag{Cubic Model}\\
\alpha &\sim \text{Normal}(178,20)\tag{α prior} \\
\beta_1 &\sim \text{Log-Normal}(0,1)\tag{β1 prior} \\
\beta_2 &\sim \text{Normal}(0,1)\tag{β2 prior} \\
\beta_3 &\sim \text{Normal}(0,1)\tag{β3 prior} \\
\sigma &\sim \text{Uniform}(0,50)\tag{σ prior} \\
\end{align*}$$
```{python}
d['weight_s3'] = d.weight_s**3
with pm.Model() as m4_5_cubic:
weight = pm.ConstantData("w", d.weight_s, dims="obs_id")
weight_s = pm.ConstantData("w_s", d.weight_s2, dims="obs_id")
weight_s3 = pm.ConstantData("w_s3", d.weight_s3, dims="obs_id")
a = pm.Normal("a", mu=178, sigma=20)
b1 = pm.LogNormal("b1", mu=0, sigma=1)
b2 = pm.Normal('b2', mu=0, sigma=1)
b3 = pm.Normal('b3', mu=0, sigma=1)
sigma = pm.Uniform("sigma", 0, 50)
mu = pm.Deterministic('mu', a + b1 * weight + b2 * weight_s + b3 * weight_s3, dims="obs_id")
height = pm.Normal("height", mu=mu, sigma=sigma, observed=d.height.values, dims="obs_id")
```
```{python}
with m4_5_cubic:
custom_step_m4_5 = utils.QuadraticApproximation([a, b1, b2, b3, sigma], m4_5_cubic)
trace_quap_cubic = pm.sample(draws=10_000, chains=1, tune=0, step=custom_step_m4_5, progressbar=False)
```
```{python}
weight_seq = np.linspace(-2.2, 2, 30)
with m4_5_cubic:
pm.set_data({"w": weight_seq, 'w_s': weight_seq**2, 'w_s3': weight_seq**3})
trace_quap_cubic = pm.sample_posterior_predictive(
trace_quap_cubic,
var_names=["height", "mu"],
return_inferencedata=True,
predictions=True,
extend_inferencedata=True,
progressbar=False
)
```
```{python}
fig, ax = plt.subplots()
hdi_1 = az.plot_hdi(weight_seq, trace_quap_cubic.predictions["height"],
hdi_prob=0.89, smooth=False)
hdi_2 = az.plot_hdi(weight_seq, trace_quap_cubic.predictions["mu"],
hdi_prob=0.89, smooth=False)
mean_line, = ax.plot(weight_seq, trace_quap_cubic.predictions["mu"].mean(axis=1)[0], 'k', label='Mean')
obs_data, = ax.plot(d.weight_s, d.height, 'o', fillstyle='none', label='Observed Data', color='k')
hdi_patch1 = plt.Line2D([0], [1], color='C0', lw=3, label="Posterior Predictive Sample")
hdi_patch2 = plt.Line2D([0], [1], color='C1', lw=3, label="Uncertainty in Mean")
ax.set(xlabel="Weight",
ylabel="Height (cm)",
xlim=(weight_seq.min(), weight_seq.max()))
ax.legend(handles=[hdi_patch1, hdi_patch2, mean_line, obs_data]);
```
This cubic curve is even more flexible than the parabola, so it fits the data even better.
But it’s not clear that any of these models make a lot of sense. They are good geocentric descriptions of the sample, yes. But there are two problems. First, a better fit to the sample might not actually be a better model. Second, the model contains no biological information. We aren’t learning any causal relationship between height and weight.
**Linear, additive, funky**. The parabolic model of $μ_i$ above is still a “linear model” of the mean, even though the equation is clearly not of a straight line. Unfortunately, the word “linear” means different things in different contexts, and different people use it differently in the same context. What “linear” means in this context is that $μ_i$ is a linear function of any single parameter. Such models have the advantage of being easier to fit to data. They are also often easier to interpret, because they assume that parameters act independently on the mean. They have the disadvantage of being used thoughtlessly. When you have expert knowledge, it is often easy to do better than a linear model. These models are geocentric devices for describing partial correlations. We should feel embarrassed to use them, just so we don’t become satisfied with the phenomenological explanations they provide.
**Converting back to natural scale**. The plots in above have standard units on the horizontal axis. These units are sometimes called **z-scores**. But suppose you fit the model using standardized variables, but want to plot the estimates on the original scale. All that’s really needed is to transform the `xticks` when you plot the raw data:
```{python}
fig, ax = plt.subplots()
hdi_1 = az.plot_hdi(weight_seq, trace_quap_cubic.predictions["height"],
hdi_prob=0.89, smooth=False)
hdi_2 = az.plot_hdi(weight_seq, trace_quap_cubic.predictions["mu"],
hdi_prob=0.89, smooth=False)
mean_line, = ax.plot(weight_seq, trace_quap_cubic.predictions["mu"].mean(axis=1)[0], 'k', label='Mean')
obs_data, = ax.plot(d.weight_s, d.height, 'o', fillstyle='none', label='Observed Data', color='k')
hdi_patch1 = plt.Line2D([0], [1], color='C0', lw=3, label="Posterior Predictive Sample")
hdi_patch2 = plt.Line2D([0], [1], color='C1', lw=3, label="Uncertainty in Mean")
ax.set(xlabel="Weight",
ylabel="Height (cm)",
xlim=(weight_seq.min(), weight_seq.max()))