-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path01 - The Garden of Forking Data.qmd
1650 lines (1080 loc) · 118 KB
/
01 - The Garden of Forking Data.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: 01 - The Garden of Forking Data
author: "Abdullah Mahmood"
date: "last-modified"
format:
html:
theme: cosmo # united is darker
css: style.css
highlight-style: atom-one
mainfont: Palatino
fontcolor: black
monobackgroundcolor: white
monofont: "Menlo, Lucida Console, Liberation Mono, DejaVu Sans Mono, Bitstream Vera Sans Mono, Courier New, monospace"
fontsize: 13pt
linestretch: 1.4
number-sections: true
number-depth: 3
toc: true
toc-location: right
code-fold: false
code-copy: true
cap-location: bottom
format-links: false
embed-resources: true
anchor-sections: true
html-math-method:
method: mathjax
url: "https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"
editor: source
jupyter: main
---
**Sources**:
- [Statistical Rethinking - Chapter 2: Small Worlds and Large Worlds](https://www.routledge.com/Statistical-Rethinking-A-Bayesian-Course-with-Examples-in-R-and-STAN/McElreath/p/book/9780367139919)
- [Video: Statistical Rethinking 2023 - 02 - The Garden of Forking Data](https://youtu.be/R1vcdhPBlXA?si=rL3BOz9hHxkPt79m)
#### Imports
```{python}
from init import *
```
## Introduction
The **small world** is the self-contained logical world of the model. Within the small world , all possibilities are nominated. There are no pure surprises. Within the small world of the model, it is important to be able to verify the model’s logic, making sure that it performs as expected under favorable assumptions. Bayesian models have some advantages in this regard, as they have reasonable claims to optimality: No alternative model could make better use of the information in the data and support better decisions, assuming the small world is an accurate description of the real world.
The **large world** is the broader context in which one deploys a model. In the large world, there may be events that were not imagined in the small world. Moreover, the model is always an incomplete representation of the large world, and so will make mistakes, even if all kinds of events have been properly nominated. The logical consistency of a model in the small world is no guarantee that it will be optimal in the large world. But it is certainly a warm comfort.
In this chapter, we will begin to build Bayesian models. The way that Bayesian models learn from evidence is arguably optimal in the small world. When their assumptions approximate reality, they also perform well in the large world. But large world performance has to be demonstrated rather than logically deduced. Passing back and forth between these two worlds allows both formal methods, like Bayesian inference, and informal methods, like peer review, to play an indispensable role.
This chapter focuses on the small world. It explains probability theory in its essential form: counting the ways things can happen. Bayesian inference arises automatically from this perspective. Then the chapter presents the stylized components of a Bayesian statistical model, a model for learning from data. Then it shows you how to animate the model, to produce estimates.
## The Garden of Forking Data
Our goal in this section will be to build Bayesian inference up from humble beginnings, so there is no superstition about it. Bayesian inference is really just counting and comparing of possibilities. Consider by analogy Jorge Luis Borges’ short story “The Garden of Forking Paths.” The story is about a man who encounters a book filled with contradictions. In most books, characters arrive at plot points and must decide among alternative paths. A protagonist may arrive at a man’s home. She might kill the man, or rather take a cup of tea. Only one of these paths is taken—murder or tea. But the book within Borges’ story explores all paths, with each decision branching outward into an expanding garden of forking paths.
This is the same device that Bayesian inference offers. In order to make good inference about what actually happened, it helps to consider everything that could have happened. A Bayesian analysis is a garden of forking data, in which alternative sequences of events are cultivated. As we learn about what did happen, some of these alternative sequences are pruned. In the end, what remains is only what is logically consistent with our knowledge.
This approach provides a quantitative ranking of hypotheses, a ranking that is maximally conservative, given the assumptions and data that go into it. The approach cannot guarantee a correct answer, on large world terms. But it can guarantee the best possible answer, on small world terms, that could be derived from the information fed into it.
Consider the following toy example.
### Counting Possibilities
Suppose there’s a bag, and it contains four marbles. These marbles come in two colors: blue and white. We know there are four marbles in the bag, but we don’t know how many are of each color. We do know that there are five possibilities: (1) \['W', 'W', 'W', 'W'\], (2) \['B', 'W', 'W', 'W'\], (3) \['B', 'B', 'W', 'W'\], (4) \['B', 'B', 'B', 'W'\], (5) \['B', 'B', 'B', 'B'\]. These are the only possibilities consistent with what we know about the contents of the bag. Call these five possibilities the ***conjectures***.
Our goal is to figure out which of these conjectures is most plausible, given some evidence about the contents of the bag. We do have some evidence: A sequence of three marbles is pulled from the bag, one at a time, replacing the marble each time and shaking the bag before drawing another marble. The sequence that emerges is: \['B', 'W', 'B'\], in that order. These are the data.
So now let’s plant the garden and see how to use the data to infer what’s in the bag. Let’s begin by considering just the single conjecture, \['B', 'W', 'W', 'W'\], that the bag contains one blue and three white marbles. On the first draw from the bag, one of four things could happen, corresponding to one of four marbles in the bag. So we can visualize the possibilities branching outward:
{fig-align="center" width="170"}
Notice that even though the three white marbles look the same from a data perspective— we just record the color of the marbles, after all—they are really different events. This is important, because it means that there are three more ways to see W than to see B.
Now consider the garden as we get another draw from the bag. It expands the garden out one layer:
{fig-align="center" width="270"}
Now there are 16 possible paths through the garden, one for each pair of draws. On the second draw from the bag, each of the paths above again forks into four possible paths. Why?
Because we believe that our shaking of the bag gives each marble a fair chance at being drawn, regardless of which marble was drawn previously. The third layer is built in the same way, and the full garden is shown below. There are $4^3 = 64$ possible paths in total.
{fig-align="center" width="270"}
As we consider each draw from the bag, some of these paths are logically eliminated. The first draw tuned out to be 'B', recall, so the three white paths at the bottom of the garden are eliminated right away. If you imagine the real data tracing out a path through the garden, it must have passed through the one blue path near the origin. The second draw from the bag produces 'W', so three of the paths forking out of the first blue marble remain. As the data trace out a path, we know it must have passed through one of those three white paths (after the first blue path), but we don’t know which one, because we recorded only the color of each marble. Finally, the third draw is 'B'. Each of the remaining three paths in the middle layer sustain one blue path, leaving a total of three ways for the sequence to appear, assuming the bag contains \['B', 'W', 'W', 'W'\].
{fig-align="center" width="290"}
Figure above shows the garden again, now with logically eliminated paths grayed out. We can’t be sure which of those three paths the actual data took. But as long as we’re considering only the possibility that the bag contains one blue and three white marbles, we can be sure that the data took one of those three paths. Those are the only paths consistent with both our knowledge of the bag’s contents (four marbles, white or blue) and the data ('B', 'W', 'B').
This demonstrates that there are three (out of 64) ways for a bag containing \['B', 'W', 'W', 'W'\] to produce the data . We have no way to decide among these three ways. The inferential power comes from comparing this count to the numbers of ways each of the other conjectures of the bag’s contents could produce the same data. For example, consider the conjecture \['W', 'W', 'W', 'W'\]. There are zero ways for this conjecture to produce the observed data, because even one is logically incompatible with it. The conjecture \['B', 'B', 'B', 'B'\] is likewise logically incompatible with the data. So we can eliminate these two conjectures, because neither provides even a single path that is consistent with the data.
The figure below displays the full garden now, for the remaining three conjectures: \['B', 'W', 'W', 'W'\], \['B', 'B', 'W', 'W'\], and \['B', 'B', 'B', 'W'\]. The upper-left wedge displays the same garden as earlier. The upper-right shows the analogous garden for the conjecture that the bag contains three blue marbles and one white marble. And the bottom wedge shows the garden for two blue and two white marbles. Now we count up all of the ways each conjecture could produce the observed data. For one blue and three white, there are three ways, as we counted already. For two blue and two white, there are eight paths forking through the garden that are logically consistent with the observed sequence. For three blue and one white, there are nine paths that survive.
{alt="garden_of_forking_data" fig-align="center" width="270"}
To summarize, we’ve considered five different conjectures about the contents of the bag, ranging from zero blue marbles to four blue marbles. For each of these conjectures, we’ve counted up how many sequences, paths through the garden of forking data, could potentially produce the observed data, ('B', 'W', 'B'):
| Conjecture | Ways to Produce (B, W, B) |
|----------------|---------------------------|
| \[B, W, W, W\] | 0 x 4 x 0 = 0 |
| \[B, B, W, W\] | 1 x 3 x 1 = 3 |
| \[B, B, B, W\] | 3 x 1 x 3 = 9 |
| \[B, B, B, B\] | 4 x 0 x 4 = 0 |
Notice that the number of ways to produce the data, for each conjecture, can be computed by first counting the number of paths in each “ring” of the garden and then by multiplying these counts together. This is just a computational device. It tells us the same thing as the figure above but without having to draw the garden. The fact that numbers are multiplied during calculation doesn’t change the fact that this is still just counting of logically possible paths. This point will come up again, when you meet a formal representation of Bayesian inference.
So what good are these counts? By comparing these counts, we have part of a solution for a way to rate the relative plausibility of each conjectured bag composition. But it’s only a part of a solution, because in order to compare these counts we first have to decide *how many ways each conjecture could itself be realized.* We might argue that when we have no reason to assume otherwise, we can just consider *each conjecture equally plausible and compare the counts directly*. But often we do have reason to assume otherwise.
### **Combining Other Information**
We may have additional information about the relative plausibility of each conjecture. This information could arise from knowledge of how the contents of the bag were generated. It could also arise from previous data. Whatever the source, it would help to have a way to combine different sources of information to update the plausibilities. Luckily there is a natural solution: Just multiply the counts.
To grasp this solution, suppose we’re willing to say each conjecture is equally plausible at the start. So we just compare the counts of ways in which each conjecture is compatible with the observed data. This comparison suggests that \['B', 'B', 'B', 'W'\] is slightly more plausible than \['B', 'B', 'W', 'W'\], and both are about three times more plausible than \['B', 'W', 'W', 'W'\]. Since these are our initial counts, and we are going to update them next, let’s label them *priors*.
Now suppose we draw another marble from the bag to get another observation: 'B'. Now you have two choices. You could start all over again, making a garden with four layers to trace out the paths compatible with the data sequence \['B', 'W', 'B', 'B'\]. Or you could take the previous counts—the prior counts—over conjectures (0, 3, 8, 9, 0) and just update them in light of the new observation. It turns out that these two methods are mathematically identical, as long as the *new observation is logically independent of the previous observations*.
Here’s how to do it. First we count the numbers of ways each conjecture could produce the new observation, 'B'. Then we multiply each of these new counts by the prior numbers of ways for each conjecture. In table form:
| Conjecture | Ways to Produce B | Prior Counts | New Count |
|----------------|-------------------|--------------|------------|
| \[W, W, W, W\] | 0 | 0 | 0 x 0 = 0 |
| \[B, W, W, W\] | 1 | 3 | 1 x 3 = 3 |
| \[B, B, W, W\] | 2 | 8 | 2 x 8 = 16 |
| \[B, B, B, W\] | 3 | 9 | 3 x 9 = 27 |
| \[B, B, B, B\] | 4 | 0 | 4 x 0 = 0 |
The new counts in the right-hand column above summarize all the evidence for each conjecture. As new data arrive, and provided those data are independent of previous observations, then the number of logically possible ways for a conjecture to produce all the data up to that point can be computed just by multiplying the new count by the old count.
This updating approach amounts to nothing more than asserting that (1) when we have previous information suggesting there are $W_{prior}$ ways for a conjecture to produce a previous observation $D_{prior}$ and (2) we acquire new observation $D_{new}$ that the same conjecture can produce in $W_{new}$ ways, then (3) the number of ways the conjecture can account for both $D_{prior}$ as well as $D_{new}$ is just product of $W_{prior} \times W_{new}$. For example, in the table above the conjecture \['B', 'B', 'W', 'W'\] has $W_{prior}=8$ ways to produce $D_{prior}=$ \['B', 'W', 'B'\]. It also has $W_{new} = 2$ ways to produce the new observation $D_{new}=$ 'B'. So there are $8 \times 2 = 16$ ways for the conjecture to produce both $D_{prior}$ and $D_{new}$. Why multiply? Multiplication is just a shortcut to enumerating and counting up all of the paths through the garden that could produce all the observations.
In this example, the prior data and new data are of the same type: marbles drawn from the bag. But in general, the prior data and new data can be of different types. Suppose for example that someone from the marble factory tells you that blue marbles are rare. So for every bag containing \['B', 'B', 'B', 'W'\], they made two bags containing \['B', 'B', 'W', 'W'\] and three bags containing \['B', 'W', 'W', 'W'\]. They also ensured that every bag contained at least one blue and one white marble. We can update our counts again:
| Conjecture | Prior Counts | Factory Count | New Count |
|----------------|--------------|---------------|-------------|
| \[W, W, W, W\] | 0 | 0 | 0 x 0 = 0 |
| \[B, W, W, W\] | 3 | 3 | 3 x 3 = 9 |
| \[B, B, W, W\] | 16 | 2 | 16 x 2 = 32 |
| \[B, B, B, W\] | 27 | 1 | 27 x 1 = 27 |
| \[B, B, B, B\] | 0 | 0 | 4 x 0 = 0 |
Now the conjecture \['B', 'B', 'W', 'W'\] is most plausible, but barely better than \['B', 'B', 'B', 'W'\]. Is there a threshold difference in these counts at which we can safely decide that one of the conjectures is the correct one? You’ll spend the next chapter exploring that question.
------------------------------------------------------------------------
**Original Ignorance**
Which assumption should we use, when there is no previous information about the conjectures? The most common solution is to assign an equal number of ways that each conjecture could be correct, before seeing any data. This is sometimes known as the **principle of indifference:** When there is no reason to say that one conjecture is more plausible than another, weigh all of the conjectures equally. This book does not use nor endorse “ignorance” priors. As we’ll see in later chapters, the structure of the model and the scientific context always provide information that allows us to do better than ignorance.
For the sort of problems we examine in this book, the principle of indifference results in inferences very comparable to mainstream non-Bayesian approaches, most of which contain implicit equal weighting of possibilities. For example a typical non-Bayesian confidence interval weighs equally all of the possible values a parameter could take, regardless of how implausible some of them are. In addition, many non-Bayesian procedures have moved away from equal weighting, through the use of penalized likelihood and other methods.
------------------------------------------------------------------------
### **From Counts to Probability**
It is helpful to think of this strategy as adhering to a principle of honest ignorance: *When we don't know what caused the data, potential causes that may produce the data in more ways are more plausible*. This leads us to count paths through the garden of forking data. We’re counting the implications of assumptions.
It’s hard to use these counts though, so we almost always standardize them in a way that transforms them into probabilities. Why is it hard to work with the counts? First, since relative value is all that matters, the size of the counts 3, 8, and 9 contain no information of value. They could just as easily be 30, 80, and 90. The meaning would be the same. It’s just the relative values that matter. Second, as the amount of data grows, the counts will very quickly grow very large and become difficult to manipulate. By the time we have 10 data points, there are already more than one million possible sequences. We’ll want to analyze data sets with thousands of observations, so explicitly counting these things isn’t practical.
Luckily, there’s a mathematical way to compress all of this. Specifically, we define the updated plausibility of each possible composition of the bag, after seeing the data, as:
$$
\begin{align*}
\text{Plausibility of ['B', 'W', 'W', 'W']} &\text{ after seeing ['B', 'W', 'B']} \\
&\propto \\
\text{ways ['B', 'W', 'W', 'W']} &\text{ can produce ['B', 'W', 'B']} \\
&\times \\
\text{prior plausibility} &\text{ ['B', 'W', 'W', 'W']}
\end{align*}
$$
That little $\propto$ means *proportional to.* We want to compare the plausibility of each possible composition. So it’ll be helpful to define $p$ as the proportion of the marbles that is blue. For \['B', 'W', 'W', 'W'\], $p = 1/4 = 0.25$. And so we can not write:
$$
\text{plausibility of } p \text{ after } D_{new} \propto \text{ways } p \text{ can produce } D_{new} \times \text{prior plausibility of } p
$$
The above just means that for any value $p$ can take, we judge the plausibility of that value $p$ as proportional to the number of ways it can get through the garden of forking data. This expression just summarizes the calculations you did in the tables of the previous section.
Finally, we construct probabilities by standardizing the plausibility so that the sum of the plausibilities for all possible conjectures will be one. All you need to do in order to standardize is to add up all of the products, one for each value $p$ can take, and then divide each product by the sum of products:
$$
\text{plausibility of } p \text{ after } D_{new} = \frac{\text{ways }p\text{ can produce } D_{new} \times \text{prior plausibility }p}{\text{sum of products}}
$$
A worked example is needed for this to really make sense. So consider again the table from before, now updated using our definitions of $p$ and "plausibility":
| Possible Compositions | p | Ways to Produce Data | New Count |
|-----------------------|------|----------------------|-----------|
| \[W, W, W, W\] | 0 | 0 | 0.00 |
| \[B, W, W, W\] | 0.25 | 3 | 0.15 |
| \[B, B, W, W\] | 0.5 | 8 | 0.40 |
| \[B, B, B, W\] | 0.75 | 9 | 0.45 |
| \[B, B, B, B\] | 1 | 0 | 0.00 |
You can quickly compute these plausibilities in Python:
```{python}
ways = np.array([0, 3, 8, 9, 0])
ways/np.sum(ways)
```
The values in `ways` are the products mentioned before. And `np.sum(ways)` is the denominator "sum of products" in the expression near the top of the page.
These plausibilities are also *probabilities—*they are non-negative (zero or positive) real numbers that sum to one. And all of the mathematical things you can do with probabilities you can also do with these values. Specifically, each piece of the calculation has a direct partner in applied probability theory. These partners have stereotyped names, so it’s worth learning them, as you’ll see them again and again.
- A conjectured proportion of blue marbles, *p*, is usually called a **parameter** value. It's just a way of indexing possible explanations of the data.
- The relative number of ways that a value *p* can produce the data is usually called a **likelihood**. It is derived by enumerating all the possible data sequences that could have happened and then eliminating those sequences inconsistent with the data.
- The prior plausibility of any specific *p* is usually called the **prior probability**.
- The new, updated plausibility of any specific *p* is usually called the **posterior probability**.
In the next major section, you’ll meet the more formal notation for these objects and see how they compose a simple statistical model.
------------------------------------------------------------------------
#### Analytical Solution
- A single toss of the globe has a probability $p$ of producing a water ($W$) observation. It has a probability $1−p$ of producing a land ($L$) observation.
- Each toss of the globe is independent of the others.
Results suggest the following analytical solution $$W,L = (Rp)^W \times ((1 - p)R)^L$$ where $R$ is the number of possible sides in a globes, in this case 4
```{python}
def plausibility_count(p, W, L, resolution=4):
"""
p: proportion of water
W: number of water observations in the sample
L: number of land observations in sample
resolution: number of globe sides
"""
return (resolution * p) ** W * ((1 - p)*resolution) ** L
```
```{python}
RESOLUTION = 4
# B W B B
W = 3
L = 1
p_water = np.linspace(0,1,5)
n_possible_ways = np.vectorize(plausibility_count)(p_water, W, L)
n_possible_ways
```
------------------------------------------------------------------------
#### Randomization
When you shuffle a deck of cards or assign subjects to treatments by flipping a coin, it is common to say that the resulting deck and treatment assignments are *randomized*. What does it mean to randomize something? It just means that we have processed the thing so that we know almost nothing about its arrangement. Shuffling a deck of cards changes our state of knowledge, so that we no longer have any specific information about the ordering of cards. However, the bonus that arises from this is that, if we really have shuffled enough to erase any prior knowledge of the ordering, then the order the cards end up in is very likely to be one of the many orderings with high **information entropy**.
------------------------------------------------------------------------
## Building a Model
By working with probabilities instead of raw counts, Bayesian inference is made much easier, but it looks much harder. So in this section, we follow up on the garden of forking data by presenting the conventional form of a Bayesian statistical model. The toy example we’ll use here has the anatomy of a typical statistical analysis, so it’s the style that you’ll grow accustomed to. But every piece of it can be mapped onto the garden of forking data. The logic is the same.
Suppose you have a globe representing our planet, the Earth. This version of the world is small enough to hold in your hands. You are curious how much of the surface is covered in water. You adopt the following strategy: You will toss the globe up in the air. When you catch it, you will record whether or not the surface under your right index finger is water or land. Then you toss the globe up in the air again and repeat the procedure. This strategy generates a sequence of samples from the globe. The first nine samples might look like:
$$
W \space L \space W \space W \space W \space L \space W \space L \space W
$$
where W indicates water and L indicates land. So in this example you observe six W (water) observations and three L (land) observations. Call this sequence of observations the *data*.
To get the logic moving, we need to make assumptions, and these assumptions constitute the model. Designing a simple Bayesian model benefits from a design loop with three steps:
1. **Data Story:** Motivate the model by narrating how the data might arise.
2. **Update**: Educate your model by feeding it the data.
3. **Evaluate**: All statistical models require supervision, leading to model revision.
The next sections walk through these steps, in the context of the globe tossing evidence.
### A Data Story
Bayesian data analysis usually means producing a story for how the data came to be. This story may be *descriptive*, specifying associations that can be used to predict outcomes, given observations. Or it may be *causal*, a theory of how some events produce other events. Typically, any story you intend to be causal may also be descriptive. But many descriptive stories are hard to interpret causally. But all data stories are complete, in the sense that they are sufficient for specifying an algorithm for simulating new data. In the next chapter, you’ll see examples of doing just that, as simulating new data is useful not only for model criticism but also for model construction.
You can motivate your data story by trying to explain how each piece of data is born. This usually means describing aspects of the underlying reality as well as the sampling process. The data story in this case is simply a restatement of the sampling process:
1. The true proportion of water covering the globe is $p$.
2. A single toss of the globe has a probability $p$ of producing a water (W) observation. It has a probability $1 - p$ of producing a land (L) observation.
3. Each toss of the globe is independent of the others.
The data story is then translated into a formal probability model. This probability model is easy to build, because the construction process can be usefully broken down into a series of component decisions. Before meeting these components, however, it’ll be useful to visualize how a Bayesian model behaves. After you’ve become acquainted with how such a model learns from data, we’ll pop the machine open and investigate its engineering.
------------------------------------------------------------------------
#### The value of storytelling
The data story has value, even if you quickly abandon it and never use it to build a model or simulate new observations. Indeed, it is important to eventually discard the story, because many different stories correspond to the same model. As a result, showing that a model does a good job does not in turn uniquely support our data story. Still, the story has value because in trying to outline the story, often one realizes that additional questions must be answered. Most data stories are much more specific than are the verbal hypotheses that inspire data collection. Hypotheses can be vague, such as “it’s more likely to rain on warm days.” When you are forced to consider sampling and measurement and make a precise statement of how temperature predicts rain, many stories and resulting models will be consistent with the same vague hypothesis. Resolving that ambiguity often leads to important realizations and model revisions, before any model is fit to data.
------------------------------------------------------------------------
### Bayesian Updating
Our problem is one of using the evidence—the sequence of globe tosses—to decide among different possible proportions of water on the globe. These proportions are like the conjectured marbles inside the bag, from earlier in the chapter. Each possible proportion may be more or less plausible, given the evidence. A Bayesian model begins with one set of plausibilities assigned to each of these possibilities. These are the prior plausibilities. Then it updates them in light of the data, to produce the posterior plausibilities. This updating process is a kind of learning, called **Bayesian updating**. The details of this updating—how it is mechanically achieved—can wait until later in the chapter. For now, let’s look only at how such a machine behaves.
For the sake of the example only, let’s program our Bayesian machine to initially assign the same plausibility to every proportion of water, every value of p. We’ll do better than this later. Now look at the top-left plot below. The dashed horizontal line represents this initial plausibility of each possible value of $p$. After seeing the first toss, which is a “W,” the model updates the plausibilities to the solid line. The plausibility of $p=0$ has now fallen to exactly zero—the equivalent of “impossible.” Why? Because we observed at least one speck of water on the globe, so now we know there is some water. The model executes this logic automatically. You don’t have it instruct it to account for this consequence. Probability theory takes care of it for you, because it is essentially counting paths through the garden of forking data, as in the previous section.
Likewise, the plausibility of $p > 0.5$ has increased. This is because there is not yet any evidence that there is land on the globe, so the initial plausibilities are modified to be consistent with this. Note however that the relative plausibilities are what matter, and there isn’t yet much evidence. So the differences in plausibility are not yet very large. In this way, the amount of evidence seen so far is embodied in the plausibilities of each value of $p$.
In the remaining plots in below, the additional samples from the globe are introduced to the model, one at a time. Each dashed curve is just the solid curve from the previous plot, moving left to right and top to bottom. Every time a “W” is seen, the peak of the plausibility curve moves to the right, towards larger values of $p$. Every time an “L” is seen, it moves the other direction. The maximum height of the curve increases with each sample, meaning that fewer values of $p$ amass more plausibility as the amount of evidence increases. As each new observation is added, the curve is updated consistent with all previous observations.
Notice that every updated set of plausibilities becomes the initial plausibilities for the next observation. Every conclusion is the starting point for future inference. However, this updating process works backwards, as well as forwards. Given the final set of plausibilities in the bottom-right plot, and knowing the final observation (W), it is possible to mathematically divide out the observation, to infer the previous plausibility curve. So the data could be presented to your model in any order, or all at once even. In most cases, you will present the data all at once, for the sake of convenience. But it’s important to realize that this merely represents abbreviation of an iterated learning process.
```{python}
def globe_toss_plot(observations):
def plot_beta_posterior(ax, k, n, **kwargs):
p = np.linspace(0,1)
# Beta Continuous - Probability density function
pmf = stats.beta.pdf(p, k+1, n-k+1)
return ax.plot(p, pmf, **kwargs)
fig, axes = plt.subplots(3, 3, figsize=(8, 8))
for i, ax in enumerate(axes.flat):
obs_prev, obs_new = observations[:i], observations[:i+1]
n, k = len(obs_prev), np.sum(obs_prev)
plot_beta_posterior(ax, k, n, linestyle='--')
n, k = len(obs_new), np.sum(obs_new)
color = 'C1' if observations[i] == 1 else 'C0'
plot_beta_posterior(ax, k, n, color=color, linewidth=2, alpha=.5)
obs_str = ''.join("W" if x == 1 else "L" for x in obs_new)
ax.set(title=obs_str)
ax.set_yticks([])
fig.supxlabel('Proportion of Water')
fig.supylabel('Plausibility')
# Globe Tossing Simulation
# W = 1, L = 0
observations = np.array([1,0,1,1,1,0,1,0,1])
utils.inline_plot(globe_toss_plot, observations)
```
------------------------------------------------------------------------
#### Sample Size and Reliable Inference
It is common to hear that there is a minimum num- ber of observations for a useful statistical estimate. For example, there is a widespread superstition that 30 observations are needed before one can use a Gaussian distribution. Why? In non-Bayesian statistical inference, procedures are often justified by the method’s behavior at very large sample sizes, so-called *asymptotic* behavior. As a result, performance at small samples sizes is questionable.
In contrast, Bayesian estimates are valid for any sample size. This does not mean that more data isn’t helpful—it certainly is. Rather, the estimates have a clear and valid interpretation, no matter the sample size. But the price for this power is dependency upon the initial plausibilities, the prior. If the prior is a bad one, then the resulting inference will be misleading. There’s no free lunch, when it comes to learning about the world. A Bayesian golem must choose an initial plausibility, and a non-Bayesian golem must choose an estimator. Both golems pay for lunch with their assumptions.
------------------------------------------------------------------------
### Evaluate
The Bayesian model learns in a way that is demonstrably optimal, provided that it accurately describes the real, large world. This is to say that your Bayesian machine guarantees perfect inference within the small world. No other way of using the available information, beginning with the same state of information, could do better.
Don’t get too excited about this logical virtue, however. The calculations may malfunction, so results always have to be checked. And if there are important differences between the model and reality, then there is no logical guarantee of large world performance. And even if the two worlds did match, any particular sample of data could still be misleading. So it’s worth keeping in mind at least two cautious principles.
First, the model’s certainty is no guarantee that the model is a good one. As the amount of data increases, the globe tossing model will grow increasingly sure of the proportion of water. This means that the curves in the plot above will become increasingly narrow and tall, restricting plausible values within a very narrow range. But models of all sorts—Bayesian or not—can be very confident about an inference, even when the model is seriously misleading. This is because the inferences are conditional on the model. What your model is telling you is that, given a commitment to this particular model, it can be very sure that the plausible values are in a narrow range. Under a different model, things might look differently.
Second, it is important to supervise and critique your model’s work. Consider again the fact that the updating in the previous section works in any order of data arrival. We could shuffle the order of the observations, as long as six W’s and three L’s remain, and still end up with the same final plausibility curve. That is only true, however, because the model assumes that order is irrelevant to inference. When something is irrelevant to the machine, it won’t affect the inference directly. But it may affect it indirectly, because the data will depend upon order. So it is important to check the model’s inferences in light of aspects of the data it does not know about. Such checks are an inherently creative enterprise, left to the analyst and the scientific community.
For now, note that the goal is not to test the truth value of the model’s assumptions. We know the model’s assumptions are never exactly right, in the sense of matching the true data generating process. Therefore there’s no point in checking if the model is true. Failure to conclude that a model is false must be a failure of our imagination, not a success of the model. Moreover, models do not need to be exactly true in order to produce highly precise and useful inferences. All manner of small world assumptions about error distributions and the like can be violated in the large world, but a model may still produce a perfectly useful estimate. This is because models are essentially information processing machines, and there are some surprising aspects of information that cannot be easily captured by framing the problem in terms of the truth of assumptions.
Instead, the objective is to check the model’s adequacy for some purpose. This usually means asking and answering additional questions, beyond those that originally constructed the model. Both the questions and answers will depend upon the scientific context. So it’s hard to provide general advice. There will be many examples, throughout the book, and of course the scientific literature is replete with evaluations of the suitability of models for different jobs—prediction, comprehension, measurement, and persuasion.
------------------------------------------------------------------------
#### Deflationary Statistics
It may be that Bayesian inference is the best general purpose method of inference known. However, Bayesian inference is much less powerful than we’d like it to be. There is no approach to inference that provides universal guarantees. No branch of applied mathematics has unfettered access to reality, because math is not discovered, like the proton. Instead it is invented, like the shovel.
------------------------------------------------------------------------
## Components of the Model
Now that you’ve seen how the Bayesian model behaves, it’s time to open up the machine and learn how it works. Consider three different things that we counted in the previous sections.
1. The number of ways each conjecture could produce an observation
2. The accumulated number of ways each conjecture could produce the entire data
3. The initial plausibility of each conjectured cause of the data
Each of these things has a direct analog in conventional probability theory. And so the usual way we build a statistical model involves choosing distributions and devices for each that represent the relative numbers of ways things can happen.
In this section, you’ll meet these components in some detail and see how each relates to the counting you did earlier in the chapter. The job in front of us is really nothing more than naming all of the variables and defining each. We’ll take these tasks in turn.
### Variables
Variables are just symbols that can take on different values. In a scientific context, variables include things we wish to infer, such as proportions and rates, as well as things we might observe, the data. In the globe tossing model, there are three variables.
The first variable is our target of inference, $p$, the proportion of water on the globe. This variable cannot be observed. Unobserved variables are usually called **parameters**. But while $p$ itself is unobserved, we can infer it from the other variables.
The other variables are the observed variables, the counts of water and land. Call the count of water $W$ and the count of land $L$. The sum of these two variables is the number of globe tosses: $N= W + L$.
### Definitions
Once we have the variables listed, we then have to define each of them. In defining each, we build a model that relates the variables to one another. Remember, the goal is to count all the ways the data could arise, given the assumptions. This means, as in the globe tossing model, that for each possible value of the unobserved variables, such as p, we need to define the relative number of ways—the probability—that the values of each observed variable could arise. And then for each unobserved variable, we need to define the prior plausibility of each value it could take. I appreciate that this is all a bit abstract. So here are the specifics, for the globe.
#### Observed Variable
For the count of water $W$ and land $L$, we define how plausible any combination of $W$ and $L$ would be, for a specific value of $p$. This is very much like the marble counting we did earlier in the chapter. Each specific value of p corresponds to a specific plausibility of the data, as in the plot above.
So that we don’t have to literally count, we can use a mathematical function that tells us the right plausibility. In conventional statistics, a distribution function assigned to an observed variable is usually called a **likelihood**. That term has special meaning in non- Bayesian statistics, however. We will be able to do things with our distributions that non- Bayesian models forbid. So I will sometimes avoid the term likelihood and just talk about distributions of variables. But when someone says, “likelihood,” they will usually mean a distribution function assigned to an observed variable.
In the case of the globe tossing model, the function we need can be derived directly from the data story. Begin by nominating all of the possible events. There are two: *water* ($W$) and *land* ($L$). There are no other events. The globe never gets stuck to the ceiling, for example. When we observe a sample of $W$’s and $L$’s of length $N$ (nine in the actual sample), we need to say how likely that exact sample is, out of the universe of potential samples of the same length. That might sound challenging, but it’s the kind of thing you get good at very quickly, once you start practicing.
In this case, once we add our assumptions that (1) every toss is independent of the other tosses and (2) the probability of W is the same on every toss, probability theory provides a unique answer, known as the *binomial distribution*. This is the common “coin tossing” distribution. And so the probability of observing $W$ waters and $L$ lands, with a probability $p$ of water on each toss, is:
$$
\text{Posterior probability of }p = Pr(W,L \mid p) = \frac{(W + L + 1)!}{W!L!} p^W(1-p)^L
$$
The function is read as: *The counts of “water”* $W$ *and “land’* $L$ *are distributed binomially, with probability* $p$ *of “water” on each toss.*
And the binomial distribution formula is built into `scipy.stats`, so you can easily compute the likelihood of the data—six $W$’s in nine tosses—under any value of $p$ with:
```{python}
stats.binom.pmf(k=6, n=9, p=0.5)
```
That number is the relative number of ways to get six water, holding $p$ at 0.5 and $N= W + L$ at nine. So it does the job of counting relative number of paths through the garden. Change the 0.5 to any other value, to see how the value changes.
Later, we’ll see that the binomial distribution is rather special, because it represents the **maximum entropy** way to count binary events. “Maximum entropy” might sound like a bad thing. Isn’t entropy disorder? Doesn’t “maximum entropy” mean the death of the universe? Actually it means that the distribution contains no additional information other than: There are two events, and the probabilities of each in each trial are $p$ and $1−p$.
------------------------------------------------------------------------
#### Binomial & Beta Distribution
The binomial *probability mass function* (pmf) has the form:
$$
P(k, n \mid p) = \binom{n}{k} p^k (1-p)^{n-k}
$$
This is an analytical function that gives us the pmf as the limit as number of possibilities $\rightarrow \infty$, where $\binom{n}{k}$ is a normalizing constant to make the distribution sum to 1. The binomial function is a likelihood function that models/describes the number of successes (likelihood of observing) $k$ in $n$ independent, Bernoulli trials, where each trial has a probability $p$ of success. It is a discrete distribution and is used when the data consists of counts (e.g., number of heads in coin tosses).
The p*robability density function* (pdf) for beta is $f(x, a, b)$, where $\alpha = k + 1$ and $\beta = n - k + 1$.
Beta probability density function (pdf) has the form:
$$
P(p \mid \alpha, \beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} p^{\alpha-1} (1 − p)^{\beta-1}
$$
This function gives us the pdf and is used as a posterior function for a continuous distribution.
The beta distribution is a conjugate prior of the binomial likelihood. The beta distribution is used to model uncertainty about a probability $p$, and describes the probability distribution of $p$ itself.
When you use a beta prior with a binomial likelihood, the posterior distribution of $p$ is also a beta distribution. This is called **conjugacy**:
$$
\text{Posterior} \propto \text{Likelihood} \times \text{Prior}
$$
For a binomial likelihood and beta prior, the posterior is:
$$
\text{Beta}(\alpha + k, \beta + n - k)
$$
- Use the **binomial distribution** when:
- You have a fixed number of trials $n$.
- You want to calculate the probability of observing $k$ successes.
- The probability $p$ is known or fixed.
- Use the **beta distribution** when:
- You want to model uncertainty about the probability $p$.
- You are performing Bayesian inference and need a prior or posterior distribution for $p$.
- You are combining a binomial likelihood with a beta prior to obtain a posterior distribution.
------------------------------------------------------------------------
### A central role for likelihood
The most influential assumptions in both Bayesian and many non-Bayesian models are the distributions assigned to data, the likelihood functions. The likelihoods influence inference for every piece of data, and as sample size increases, the likelihood matters more and more. This helps to explain why Bayesian and non-Bayesian inferences are often so similar. If we had to explain Bayesian inference using only one aspect of it, we should describe likelihood, not priors.
------------------------------------------------------------------------
#### Unobserved variables.
The distributions we assign to the observed variables typically have their own variables. In the binomial above, there is $p$, the probability of sampling water. Since $p$ is not observed, we usually call it a **parameter**. Even though we cannot observe $p$, we still have to define it.
Later, there will be more parameters in your models. In statistical modeling, many of the most common questions we ask about data are answered directly by parameters:
- What is the average difference between treatment groups?
- How strong is the association between a treatment and an outcome?
- Does the effect of the treatment depend upon a covariate?
- How much variation is there among groups?
You’ll see how these questions become extra parameters inside the distribution function we assign to the data.
You’ll see how these questions become extra parameters inside the distribution function we assign to the data.
For every parameter you intend your Bayesian machine to consider, you must provide a distribution of prior plausibility, its prior. A Bayesian machine must have an initial plausibility assignment for each possible value of the parameter, and these initial assignments do useful work. When you have a previous estimate to provide to the machine, that can become the prior, as in the steps in in the plot above. Back in the associated code of the plot, the machine did its learning one piece of data at a time. As a result, each estimate becomes the prior for the next step. But this doesn’t resolve the problem of providing a prior, because at the dawn of time, when $N= 0$, the machine still had an initial state of information for the parameter $p$: a flat line specifying equal plausibility for every possible value.
So where do priors come from? They are both engineering assumptions, chosen to help the machine learn, and scientific assumptions, chosen to reflect what we know about a phenomenon. The flat prior in the plot above is very common, but it is hardly ever the best prior. Later chapters will focus on prior choice a lot more.
There is a school of Bayesian inference that emphasizes choosing priors based upon the personal beliefs of the analyst. While this **subjective Bayesian** approach thrives in some statistics and philosophy and economics programs, it is rare in the sciences. Within Bayesian data analysis in the natural and social sciences, the prior is considered to be just part of the model. As such it should be chosen, evaluated, and revised just like all of the other components of the model. In practice, the subjectivist and the non-subjectivist will often analyze data in nearly the same way.
None of this should be understood to mean that any statistical analysis is not inherently subjective, because of course it is—lots of little subjective decisions are involved in all parts of science. It’s just that priors and Bayesian data analysis are no more inherently subjective than are likelihoods and the repeat sampling assumptions required for significance testing. Anyone who has visited a statistics help desk at a university has probably experienced this subjectivity—statisticians do not in general exactly agree on how to analyze anything but the simplest of problems. The fact that statistical inference uses mathematics does not imply that there is only one reasonable or useful way to conduct an analysis. Engineering uses math as well, but there are many ways to build a bridge.
Beyond all of the above, there’s no law mandating we use only one prior. If you don’t have a strong argument for any particular prior, then try different ones. Because the prior is an assumption, it should be interrogated like other assumptions: by altering it and checking how sensitive inference is to the assumption. No one is required to swear an oath to the assumptions of a model, and no set of assumptions deserves our obedience.
------------------------------------------------------------------------
#### Prior as Probability Distribution
You could write the prior in the example here as:
$$
Pr(p) = \frac{1}{1-0} = 1
$$
The prior is a probability distribution for the parameter. In general, for a uniform prior from a to b, the probability of any point in the interval is $1/(b−a)$. If you’re bothered by the fact that the probability of every value of $p$ is 1, remember that every probability distribution must sum (integrate) to 1. The expression $1/(b−a)$ ensures that the area under the flat line from a to b is equal to 1. There will be more to say about this later.
------------------------------------------------------------------------
#### Datum or parameter?
It is typical to conceive of data and parameters as completely different kinds of entities. Data are measured and known; parameters are unknown and must be estimated from data. Usefully, in the Bayesian framework the distinction between a datum and a parameter is not so fundamental. Sometimes we observe a variable, but sometimes we do not. In that case, the same distribution function applies, even though we didn’t observe the variable. As a result, the same assumption can look like a “likelihood” or a “prior,” depending upon context, without any change to the model. Much later, you’ll see how to exploit this deep identity between certainty (data) and uncertainty (parameters) to incorporate measurement error and missing data into your modeling.
------------------------------------------------------------------------
#### Prior, prior pants on fire
Historically, some opponents of Bayesian inference objected to the arbitrariness of priors. It’s true that priors are very flexible, being able to encode many different states of information. If the prior can be anything, isn’t it possible to get any answer you want? Indeed it is. Regardless, after a couple hundred years of Bayesian calculation, it hasn’t turned out that people use priors to lie. If your goal is to lie with statistics, you’d be a fool to do it with priors, because such a lie would be easily uncovered. Better to use the more opaque machinery of the likelihood. Or better yet—don’t actually take this advice!—massage the data, drop some “outliers,” and otherwise engage in motivated data transformation.
It is true though that choice of the likelihood is much more conventionalized than choice of prior. But conventional choices are often poor ones, smuggling in influences that can be hard to discover. In this regard, both Bayesian and non-Bayesian models are equally harried, because both traditions depend heavily upon likelihood functions and conventionalized model forms. And the fact that the non-Bayesian procedure doesn’t have to make an assumption about the prior is of little comfort. This is because non-Bayesian procedures need to make choices that Bayesian ones do not, such as choice of estimator or likelihood penalty. Often, such choices can be shown to be equivalent to some Bayesian choice of prior or rather choice of loss function.
------------------------------------------------------------------------
### A Model is Born
With all the above work, we can now summarize our model. The observed variables $W$ and $L$ are given relative counts through the binomial distribution. So we can write, as a shortcut:
$$
W \sim \text{Binomial}(N,p)
$$
where $N =W +L$. The above is just a convention for communicating the assumption that the relative counts of ways to realize $W$ in $N$ trials with probability $p$ on each trial comes from the binomial distribution.
The unobserved parameter p similarly gets:
$$
p \sim \text{Uniform}(0,1)
$$
This means that $p$ has a uniform—flat—prior over its entire possible range, from zero to one. This is obviously not the best we could do, since we know the Earth has more water than land, even if we do not know the exact proportion yet.
Next, let’s see how to use these assumptions to generate inference.
## Making the Model Go
Once you have named all the variables and chosen definitions for each, a Bayesian model can update all of the prior distributions to their purely logical consequences: the **posterior distribution**. For every unique combination of data, likelihood, parameters, and prior, there is a unique posterior distribution. This distribution contains the relative plausibility of different parameter values, conditional on the data and model. The posterior distribution takes the form of the probability of the parameters, conditional on the data. In this case, it would be $Pr(p \mid W,L)$, the probability of each possible value of $p$, conditional on the specific W and L that we observed.
### Bayes' Theorem
The mathematical definition of the posterior distribution arises from Bayes’ theorem. This is the theorem that gives Bayesian data analysis its name. But the theorem itself is a trivial implication of probability theory. Here’s a quick derivation of it, in the context of the globe tossing example. Really this will just be a re-expression of the garden of forking data derivation from earlier in the chapter. What makes it look different is that it will use the rules of probability theory to coax out the updating rule. But it is still just counting.
The **joint probability** of the data $W$ and $L$ and any particular value of $p$ is: $$Pr(W, L, p) = Pr(W, L \mid p)Pr(p)$$
This just says that the probability of $W$, $L$ and $p$ is the product of $Pr(W,L \mid p)$ and the prior probability $Pr(p)$. This is like saying that the probability of rain and cold on the same day is equal to the probability of rain, when it’s cold, times the probability that it’s cold. This much is just definition. But it’s just as true that:
$$
Pr(W, L, p) = Pr(p \mid W, L)Pr(W,L)
$$
This is just the reverse of which probability is conditional, on the right-hand side. It is still a true definition. It’s like saying that the probability of rain and cold on the same day is equal to the probability that it’s cold, when it’s raining, times the probability of rain.
Now since both right-hand sides above are equal to the same thing, $Pr(W,L,p)$, they are also equal to one another:
$$
Pr(W, L \mid p)Pr(p) = Pr(p \mid W, L)Pr(W,L)
$$
So we can now solve for the thing that we want, $Pr(p|W,L)$:
$$
Pr(p \mid W, L) = \frac{Pr(W, L \mid p)Pr(p)}{Pr(W,L)}
$$
This is Bayes’ theorem. It says that the probability of any particular value of $p$, considering the data, is equal to the product of the relative plausibility of the data, conditional on $p$, and the prior plausibility of $p$, divided by this thing $Pr(W,L)$, which I’ll call the average probability of the data.
$$
\text{Posterior} = \frac{\text{Probability of the data} \times \text{Prior}}{\text{Average Probability of the data}}
$$
The **average probability of the data**, $Pr(W,L)$, can be confusing. It is commonly called the **“evidence”** or the “**average likelihood**,” neither of which is a transparent name. The probability $Pr(W,L)$ is literally the average probability of the data. Averaged over what? Averaged over the prior. It’s job is just to standardize the posterior, to ensure it sums (integrates) to one. In mathematical form:
$$
Pr(W,L) = E(Pr(W,L \mid p)) = \int Pr(W,L\mid p)Pr(p)dp
$$
The operator $E$ means to take an *expectation*. Such averages are commonly called *marginals* in mathematical statistics, and so you may also see this same probability called a *marginal likelihood*. And the integral above just defines the proper way to compute the average over a continuous distribution of values, like the infinite possible values of $p$.
The key lesson is that the **posterior is proportional to the product of the prior and the probability of the data**. Why? Because for each specific value of p, the number of paths through the garden of forking data is the product of the prior number of paths and the new number of paths. Multiplication is just compressed counting. The average probability on the bottom just standardizes the counts so they sum to one. So while Bayes’ theorem looks complicated, because the relationship with counting paths is obscured, it just **expresses the counting that logic demands**.
```{python}
def uniform_prior(grid_points):
"""
Returns Uniform prior density
"""
return np.repeat(5, grid_points)
def truncated_prior(grid_points, trunc_point=0.5):
"""
Returns Truncated prior density
"""
return (np.linspace(0, 1, grid_points) >= trunc_point).astype(int)
def double_exp_prior(grid_points):
"""
Returns Double Exponential prior density
"""
return np.exp(-5 * abs(np.linspace(0, 1, grid_points) - 0.5))
def binom_grid_approx(prior_func, grid_points, k, n):
# define grid
p_grid = np.linspace(0, 1, grid_points)
# define prior
prior = prior_func(grid_points)
# compute likelihood at each point in the grid
likelihood = stats.binom.pmf(k, n, p_grid)
# compute product of likelihood and prior
unstd_posterior = likelihood * prior
# standardize the posterior, so it sums to 1
posterior = unstd_posterior / np.sum(unstd_posterior)
return p_grid, likelihood, posterior
```
```{python}
def example_plot():
w, n = 6, 9
priors = [uniform_prior, truncated_prior, double_exp_prior]
titles = ['Prior', 'Likelihood', 'Posterior']
_, axs = plt.subplots(3, 3, figsize=(8, 8))
for i in range(3):
ax = axs[i]
p_grid, likelihood, posterior = binom_grid_approx(priors[i], 100, w, n)
ax[0].plot(p_grid, priors[i](100), "k-")
ax[1].plot(p_grid, likelihood, "k-")
ax[2].plot(p_grid, posterior, "k-")
for ii in range(3):
ax[ii].axes.get_yaxis().set_visible(False)
if i == 0:
ax[ii].set_title(f"{titles[ii]}")
utils.inline_plot(example_plot)
```
The posterior distribution as a product of the prior distribution and likelihood. Top: A flat prior constructs a posterior that is simply proportional to the likelihood. Middle: A step prior, assigning zero probability to all values less than 0.5, results in a truncated posterior. Bottom: A peaked prior that shifts and skews the posterior, relative to the likelihood.
The plot illustrates the multiplicative interaction of a prior and a probability of data. On each row, a prior on the left is multiplied by the probability of data in the middle to produce a posterior on the right. The probability of data in each case is the same. The priors however vary. As a result, the posterior distributions vary.
------------------------------------------------------------------------
#### Bayesian Data Analysis isn’t about Bayes’ Theorem
A common notion about Bayesian data analysis, and Bayesian inference more generally, is that it is distinguished by the use of Bayes’ theorem. This is a mistake. Inference under any probability concept will eventually make use of Bayes’ theorem. Common introductory examples of “Bayesian” analysis using HIV and DNA testing are not uniquely Bayesian. Since all of the elements of the calculation are frequencies of observations, a non- Bayesian analysis would do exactly the same thing. Instead, Bayesian approaches get to use Bayes’ theorem more generally, to quantify uncertainty about theoretical entities that cannot be observed, like parameters and models. Powerful inferences can be produced under both Bayesian and non- Bayesian probability concepts, but different justifications and sacrifices are necessary.
------------------------------------------------------------------------
### Motors: Conditioning Engines
The Bayesian model is a machine and it has built-in definitions for the likelihood, the parameters, and the prior. And then at its heart lies a motor that processes data, producing a posterior distribution. The action of this motor can be thought of as *conditioning* the prior on the data. As explained in the previous section, this conditioning is governed by the rules of probability theory, which defines a uniquely logical posterior for set of assumptions and observations.
However, knowing the mathematical rule is often of little help, because many of the interesting models in contemporary science cannot be conditioned formally, no matter your skill in mathematics. And while some broadly useful models like linear regression can be conditioned formally, this is only possible if you constrain your choice of prior to special forms that are easy to do mathematics with. We’d like to avoid forced modeling choices of this kind, instead favoring conditioning engines that can accommodate whichever prior is most useful for inference.
What this means is that various numerical techniques are needed to approximate the mathematics that follows from the definition of Bayes’ theorem. In this book, you’ll meet three different conditioning engines, numerical techniques for computing posterior distributions:
1) Grid approximation
2) Quadratic approximation
3) Markov chain Monte Carlo (MCMC)
There are many other engines, and new ones are being invented all the time. But the three you’ll get to know here are common and widely useful. In addition, as you learn them, you’ll also learn principles that will help you understand other techniques.
------------------------------------------------------------------------
#### How you fit the model is part of the model
Earlier, we implicitly defined the model as a composite of a prior and a likelihood. That definition is typical. But in practical terms, we should also consider how the model is fit to data as part of the model. In very simple problems, like the globe tossing example that consumes this chapter, calculation of the posterior density is trivial and foolproof. In even moderately complex problems, however, the details of fitting the model to data force us to recognize that our numerical technique influences our inferences. This is because different mistakes and compromises arise under different techniques. The same model fit to the same data using different techniques may produce different answers. When something goes wrong, every piece of the machine may be suspect. And so our golems carry with them their updating engines, as much slaves to their engineering as they are to the priors and likelihoods we program into them.
------------------------------------------------------------------------
### Grid Approximation
One of the simplest conditioning techniques is grid approximation. While most parameters are continuous, capable of taking on an infinite number of values, it turns out that we can achieve an excellent approximation of the continuous posterior distribution by considering only a finite grid of parameter values. At any particular value of a parameter, $p'$, it’s a simple matter to compute the posterior probability: just multiply the prior probability of $p'$ by the likelihood at $p'$. Repeating this procedure for each value in the grid generates an approximate picture of the exact posterior distribution. This procedure is called **grid approximation**.
Grid approximation will mainly be useful as a pedagogical tool, as learning it forces the user to really understand the nature of Bayesian updating. But in most of your real modeling, grid approximation isn’t practical. The reason is that it scales very poorly, as the number of parameters increases. So in later chapters, grid approximation will fade away, to be replaced by other, more efficient techniques.
In the context of the globe tossing problem, grid approximation works extremely well. So let’s build a grid approximation for the model we’ve constructed so far. Here is the recipe:
1) Define the grid. This means you decide how many points to use in estimating the posterior, and then you make a list of the parameter values on the grid.
2) Compute the value of the prior at each parameter value on the grid.
3) Compute the likelihood at each parameter value.
4) Compute the unstandardized posterior at each parameter value, by multiplying the prior by the likelihood.
5) Finally, standardize the posterior, by dividing each value by the sum of all values.
In the globe tossing context, here’s the code to complete all five of these steps:
```{python}
def binom_grid_approx(prior_func, grid_points, k, n):
# define grid
p_grid = np.linspace(0, 1, grid_points)
# define prior
prior = prior_func(grid_points)
# compute likelihood at each point in the grid
likelihood = stats.binom.pmf(k, n, p_grid)
# compute product of likelihood and prior
unstd_posterior = likelihood * prior
# standardize the posterior, so it sums to 1
posterior = unstd_posterior / np.sum(unstd_posterior)
return p_grid, likelihood, posterior
```
The above code makes a grid of specified points. Let's display the posterior distribution with 5, 20, 100 points:
```{python}
def example_plot():
w, n = 6, 9
points = (5, 20, 100)
_, ax = plt.subplots(1, len(points), figsize=(4 * len(points), 4))
for idx, ps in enumerate(points):
p_grid, _, posterior = binom_grid_approx(uniform_prior, ps, w, n)
ax[idx].plot(p_grid, posterior, "o-", label=f"successes = {w}\ntosses = {n}")
ax[idx].set_xlabel("probability of water")
ax[idx].set_ylabel("posterior probability")
ax[idx].set_title(f"{ps} points")
ax[idx].legend(loc=0)
utils.inline_plot(example_plot)
```
The correct density for your grid is determined by how accurate you want your approximation to be. More points means more precision. In this simple example, you can go crazy and use 100,000 points, but there won’t be much change in inference after the first 100.
### Quadratic Approximation
We’ll stick with the grid approximation to the globe tossing posterior, for the first few exercises. But before long we’ll have to resort to another approximation, one that makes stronger assumptions. The reason is that the number of unique values to consider in the grid grows rapidly as the number of parameters in your model increases. For the single-parameter globe tossing model, it’s no problem to compute a grid of 100 or 1000 values. But for two parameters approximated by 100 values each, that’s already $100^{2} =10,000$ values to compute. For 10 parameters, the grid becomes many billions of values. These days, it’s routine to have models with hundreds or thousands of parameters. The grid approximation strategy scales very poorly with model complexity, so it won’t get us very far.
A useful approach is **quadratic approximation**. Under quite general conditions, the region near the peak of the posterior distribution will be nearly Gaussian—or “normal”—in shape. This means the posterior distribution can be usefully approximated by a Gaussian distribution. A Gaussian distribution is convenient, because it can be completely described by only two numbers: the location of its center (**mean**) and its spread (**variance**).
A **Gaussian approximation** is called “quadratic approximation” because the logarithm of a Gaussian distribution forms a parabola. And a parabola is a quadratic function. So this approximation essentially represents any log-posterior with a parabola.
For many of the most common procedures in applied statistics—linear regression, for example—the approximation works very well. Often, it is even exactly correct, not actually an approximation at all. Computationally, quadratic approximation is very inexpensive, at least compared to grid approximation and MCMC (discussed next). The procedure contains two steps.
1) Find the posterior mode. This is usually accomplished by some optimization algorithm, a procedure that virtually “climbs” the posterior distribution, as if it were a mountain. The algorithm doesn’t know where the peak is, but it does know the slope under its feet. There are many well-developed optimization procedures, most of them more clever than simple hill climbing. But all of them try to find peaks.
2) Once you find the peak of the posterior, you must estimate the curvature near the peak. This curvature is sufficient to compute a quadratic approximation of the entire posterior distribution. In some cases, these calculations can be done analytically, but usually your computer uses some numerical technique instead.
To compute the quadratic approximation for the globe tossing data, we’ll use Stan's `optimize()` function to obtain the MLE (or MAP) of the parameter $p$ and BridgeStan's `log_density_hessian()` functions to extract the model's **hessian matrix** through which we will obtain the standard deviation. All of this is conveniently wrapped inside the `StanQuap` class.
Let us compute the quadratic approximation to the globe tossing data:
```{python}
#| output: false
# Define the Bayesian model in Stan
globe_qa = '''
data {
int<lower=0> W; // Number of successes (Water)
int<lower=0> L; // Number of failures (Land)
}
parameters {
real<lower=0, upper=1> p; // Probability of water
}
model {
p ~ uniform(0, 1); // Uniform prior
W ~ binomial(W + L, p); // Binomial likelihood
}
'''
# Set-up the data in a dictionary
toss_data = {'W': 6, 'L': 3}
# Instantiate StanQuap object
globe_qa_model = utils.StanQuap('stan_models/globe_qa', globe_qa, toss_data)
# display summary of quadratic approximation
globe_qa_summary = globe_qa_model.precis().round(2)
globe_qa_summary
```
The `precis()` method presents a brief summary of the quadratic approximation. In this case, it shows the posterior mean value of $p=0.67$, which it calls the “Mean.” The curvature is labeled “StdDev” This stands for *standard deviation*. This value is the standard deviation of the posterior distribution, while the mean value is its peak. Finally, the last two values in the output show the 89% percentile interval, which you’ll learn more about later. You can read this kind of approximation like: *Assuming the posterior is Gaussian, it is maximized at 0.67, and its standard deviation is 0.16*.
Since we already know the posterior, let’s compare to see how good the approximation is. We will use the analytical approach here, which uses `stats.beta.pdf`.
```{python}
Ws, Ls = [6, 12, 24], [3, 6, 12]
# quadratic approximation
def plot_quap():
_, axes = plt.subplots(1, 3, figsize=(12, 4))
x = np.linspace(0, 1, 100)
for i, ax in enumerate(axes):
toss_data = {'W': Ws[i], 'L': Ls[i]}
globe_qa_summary = utils.StanQuap('stan_models/globe_qa', globe_qa, toss_data).precis()
ax.plot(x,
stats.beta.pdf(x, Ws[i] + 1, Ls[i] + 1), # analytical posterior calculation
color='k',
label="True posterior")
ax.plot(x,
stats.norm.pdf(x, globe_qa_summary['Mean'][0], globe_qa_summary['StDev'][0]),
color='k',
linestyle='--',
label="Quadratic approximation")
ax.set(xlabel='Proportion wate', title=f"n = {Ws[i]+Ls[i]}")
utils.inline_plot(plot_quap)
```
Accuracy of the quadratic approximation. In each plot, the exact posterior distribution is plotted in solid curve, and the quadratic approximation is plotted as the dotted curve. Left: The globe tossing data with $n=9$ tosses and $w=6$ waters. Middle: Double the amount of data, with the same fraction of water, $n=18$ and $w=12$. Right: Four times as much data, $n=36$ and $w=24$.
The quadratic approximation curve does alright on its left side, but looks pretty bad on its right side. It even assigns positive probability to $p=1$, which we know is **impossible**, since we saw at least one land sample.
As the amount of data increases, however, the quadratic approximation gets better. In the middle plot, the sample size is doubled to $n =18$ tosses, but with the same fraction of water, so that the mode of the posterior is in the same place. The quadratic approximation looks better now, although still not great. At quadruple the data, on the right side of the figure, the two curves are nearly the same now.
This phenomenon, where the quadratic approximation improves with the amount of data, is very common. It’s one of the reasons that so many classical statistical procedures are nervous about small samples: Those procedures use quadratic (or other) approximations that are only known to be safe with infinite data. Often, these approximations are useful with less than infinite data, obviously. But the rate of improvement as sample size increases varies greatly depending upon the details. In some models, the quadratic approximation can remain terrible even with thousands of samples.
Using the quadratic approximation in a Bayesian context brings with it all the same concerns. But you can always lean on some algorithm other than quadratic approximation, if you have doubts. Indeed, grid approximation works very well with small samples, because in such cases the model must be simple and the computations will be quite fast. You can also use MCMC, which is introduced next.
------------------------------------------------------------------------
#### Maximum Likelihood Estimation (MLE)
The quadratic approximation, either with a uniform prior or with a lot of data, is often equivalent to a **maximum likelihood estimate** (MLE) and its **standard error**. The MLE is a very common non-Bayesian parameter estimate. This correspondence between a Bayesian approximation and a common non-Bayesian estimator is both a blessing and a curse. It is a blessing, because it allows us to reinterpret a wide range of published non-Bayesian model fits in Bayesian terms. It is a curse, because maximum likelihood estimates have some curious drawbacks, and the quadratic approximation can share them. We’ll explore these drawbacks later, they are one of the reasons we’ll turn to Markov chain Monte Carlo.
------------------------------------------------------------------------
#### Hessian Matrix
A Hessian is a square matrix of second derivatives. It is used for many purposes in mathematics, but in the quadratic approximation it is second derivatives of the log of posterior probability with respect to the parameters. It turns out that these derivatives are sufficient to describe a Gaussian distribution, because the logarithm of a Gaussian distribution is just a parabola. Parabolas have no derivatives beyond the second, so once we know the center of the parabola (the posterior mode) and its second derivative, we know everything about it. And indeed the second derivative (with respect to the outcome) of the logarithm of a Gaussian distribution is proportional to its **inverse squared standard deviation** (its “precision”). So knowing the standard deviation tells us everything about its shape.
The standard deviation is typically computed from the Hessian, so computing the Hessian is nearly always a necessary step. For now it’s enough to recognize the term and associate it with an attempt to find the standard deviation for a quadratic approximation.
------------------------------------------------------------------------
#### **Conjugacy**
- Source: [Conjugate Prior](https://en.wikipedia.org/wiki/Conjugate_prior)
if, given a likelihood function $p(x\mid \theta)$, the posterior distribution $p(\theta \mid x)$ is in the same probability distribution family as the prior probability distribution $p(\theta)$, the prior and posterior are then called **conjugate distributions** with respect to that likelihood function and the prior is called a **conjugate prior** for the likelihood function $p(x\mid \theta)$.
A conjugate prior is an algebraic convenience, giving a closed-form expression for the posterior; otherwise, numerical integration may be necessary. Further, conjugate priors may give intuition by more transparently showing how a likelihood function updates a prior distribution.
The form of the conjugate prior can generally be determined by inspection of the probability density or probability mass function of a distribution. For example, consider a random variable which consists of the number of successes $s$ in $n$ Bernoulli trials with unknown probability of success $q$ in \[0,1\]. This random variable will follow the binomial distribution, with a probability mass function of the form
$$
p(s)={n \choose s}q^{s}(1-q)^{n-s}
$$
The usual conjugate prior is the beta distribution with parameters $\alpha$, $\beta$:
$$
p(q)={q^{\alpha -1}(1-q)^{\beta -1} \over \mathrm {B} (\alpha ,\beta )}
$$
where $\alpha$ and $\beta$ are chosen to reflect any existing belief or information ($\alpha =1$ and $\beta =1$ would give a uniform distribution) and $B(\alpha ,\beta)$ is the Beta function acting as a normalizing constant.
That number is the relative number of ways to get six water, holding $p$ at 0.5 and $N =W +L$ at nine. So it does the job of counting relative number of paths through the garden.
```{python}
# Binomial Distribution - likelihood of six W’s in nine tosses — under 0.5 value of p
stats.binom.pmf(6, 9, 0.5)
# Beta distribution
stats.beta.pdf(0.5, 6+1, 3+1)
```
```{python}
p = np.linspace(0, 1, 100)
def beta_binom_plot():
fig, ax1 = plt.subplots()
ax1.set_xlabel('$p$')
ax1.set_ylabel('Binomial - Likelihood', color='k')
ax1.plot(p, stats.binom.pmf(7, 10, p), color='k')
ax1.tick_params(axis='y')
ax2 = ax1.twinx()
ax2.set_ylabel('Beta - Posterior Probability Density', color='r')
ax2.plot(p, stats.beta.pdf(p, 7+1, 3+1), color='r', linewidth=6, alpha=0.2)
ax2.tick_params(axis='y', labelcolor='r')
utils.inline_plot(beta_binom_plot)
```
------------------------------------------------------------------------
### Markov Chain Monte Carlo (MCMC)
There are lots of important model types, like multilevel (mixed-effects) models, for which neither grid approximation nor quadratic approximation is always satisfactory. Such models may have hundreds or thousands or tens-of-thousands of parameters. Grid approximation routinely fails here, because it just takes too long. Special forms of quadratic approximation might work, if everything is just right. But commonly, something is not just right. Furthermore, multilevel models do not always allow us to write down a single, unified function for the posterior distribution. This means that the function to maximize (when finding the MAP) is not known, but must be computed in pieces.
As a result, various counterintuitive model fitting techniques have arisen. The most popular of these is Markov chain Monte Carlo (MCMC), which is a family of conditioning engines capable of handling highly complex models.
The conceptual challenge with MCMC lies in its highly non-obvious strategy. Instead of attempting to compute or approximate the posterior distribution directly, MCMC techniques merely draw samples from the posterior. You end up with a collection of parameter values,and the frequencies of these values correspond to the posterior plausibilities. You can then build a picture of the posterior from the histogram of these samples.
We nearly always work directly with these samples, rather than first constructing some mathematical estimate from them. And the samples are in many ways more convenient than having the posterior, because they are easier to think with.
------------------------------------------------------------------------
#### A) From Prior to Posterior - Monte Carlo globe tossing
The below code implements a simple **Metropolis-Hastings algorithm**, a Markov Chain Monte Carlo (MCMC) method, to approximate the posterior distribution of a probability parameter $p$, given a likelihood defined by a binomial model.
**Key Components**:
1. **Inputs**:
- `W = 6`: Number of "wins" (successes).
- `L = 3`: Number of "losses" (failures).
- `n_samples = 1000`: Total number of samples to generate for the Markov chain.
- Initial probability, `p[0] = 0.5`.
2. **Model**:
- $p$ represents the probability of success (e.g., win rate) in a binomial distribution. The algorithm explores possible values for $p$ to estimate its posterior distribution.
- The likelihood is computed using the binomial probability mass function (`stats.binom.pmf`).
3. **Proposal Distribution**:
- A new candidate value, $p_{\text{new}}$, is proposed at each step, sampled from a normal distribution centered at the current value $p[i-1]$ with a small standard deviation (`0.1`).
4. **Reflection Boundary**:
- To ensure $p_{\text{new}}$ stays in the valid range $[0, 1]$:
- If $p_{\text{new}} < 0$, it is reflected back: $p_{\text{new}} = -p_{\text{new}}$.
- If $p_{\text{new}} > 1$, it is reflected back: $p_{\text{new}} = 2 - p_{\text{new}}$.
5. **Acceptance Step**:
- The acceptance ratio, $\frac{q_1}{q_0}$, is calculated, where:
- $q_0$: Likelihood of observing `W` successes given $p[i-1]$.
- $q_1$: Likelihood of observing `W` successes given $p_{\text{new}}$.
- A random number from a uniform distribution is drawn, and if it is less than $\frac{q_1}{q_0}$, the new value is accepted ($p[i] = p_{\text{new}}$). Otherwise, the chain retains the previous value ($p[i] = p[i-1]$).
**Outcome**:
- The algorithm generates a sequence of 1000 samples (`p`) that approximate the posterior distribution of the probability parameter $p$.
- Over many iterations, the chain will converge to the posterior distribution of $p$, allowing statistical inference about the probability of success given the observed data ($W = 6$, $L = 3$).
This is a simple example of Bayesian inference using MCMC to estimate a posterior distribution for a parameter in a binomial model.
```{python}
n_samples = 1000
p = np.zeros(n_samples)
p[0] = 0.5
W = 6
L = 3
for i in range(1, n_samples):
p_new = stats.norm(p[i - 1], 0.1).rvs(1)
if p_new < 0: p_new = -p_new
if p_new > 1: p_new = 2 - p_new
q0 = stats.binom.pmf(W, n=W + L, p=p[i - 1])
q1 = stats.binom.pmf(W, n=W + L, p=p_new)
if stats.uniform.rvs(0, 1) < q1 / q0:
p[i] = p_new
else:
p[i] = p[i - 1]
```
The values in $p$ are samples from the posterior distribution. To compare to the analytical posterior:
```{python}
def plot_post_dist():
bw = utils.bw_nrd0(p)
az.plot_kde(p, bw=bw, label="Metropolis approximation", plot_kwargs={'color': 'black', 'linestyle':'--'})
x = np.linspace(0, 1, 100)
plt.plot(x, stats.beta.pdf(x, W + 1, L + 1), color='k', label="True posterior")
plt.legend();
utils.inline_plot(plot_post_dist)
```
------------------------------------------------------------------------
## Summary
> - For each possible explanation of the sample
> - Count all the ways the sample could occur
> - The explanations with the largest number of ways to produce the observed sample are more plausible