01 - The Garden of Forking Data.qmd

---
title: 01 - The Garden of Forking Data
author: "Abdullah Mahmood"
date: "last-modified"
format:
  html:
    theme: cosmo # united is darker
    css: style.css
    highlight-style: atom-one
    mainfont: Palatino
    fontcolor: black
    monobackgroundcolor: white
    monofont: "Menlo, Lucida Console, Liberation Mono, DejaVu Sans Mono, Bitstream Vera Sans Mono, Courier New, monospace"
    fontsize: 13pt
    linestretch: 1.4
    number-sections: true
    number-depth: 3
    toc: true
    toc-location: right
    code-fold: false
    code-copy: true
    cap-location: bottom
    format-links: false
    embed-resources: true
    anchor-sections: true
    html-math-method:
        method: mathjax
        url: "https://cdn.jsdelivr.net/npm/mathjax@3/es5/tex-mml-chtml.js"
editor: source
jupyter: main
---

**Sources**:

-   [Statistical Rethinking - Chapter 2: Small Worlds and Large Worlds](https://www.routledge.com/Statistical-Rethinking-A-Bayesian-Course-with-Examples-in-R-and-STAN/McElreath/p/book/9780367139919)

-   [Video: Statistical Rethinking 2023 - 02 - The Garden of Forking Data](https://youtu.be/R1vcdhPBlXA?si=rL3BOz9hHxkPt79m)

#### Imports

```{python}
from init import *
```

## Introduction

The **small world** is the self-contained logical world of the model. Within the small world , all possibilities are nominated. There are no pure surprises. Within the small world of the model, it is important to be able to verify the model’s logic, making sure that it performs as expected under favorable assumptions. Bayesian models have some advantages in this regard, as they have reasonable claims to optimality: No alternative model could make better use of the information in the data and support better decisions, assuming the small world is an accurate description of the real world.

The **large world** is the broader context in which one deploys a model. In the large world, there may be events that were not imagined in the small world. Moreover, the model is always an incomplete representation of the large world, and so will make mistakes, even if all kinds of events have been properly nominated. The logical consistency of a model in the small world is no guarantee that it will be optimal in the large world. But it is certainly a warm comfort.

In this chapter, we will begin to build Bayesian models. The way that Bayesian models learn from evidence is arguably optimal in the small world. When their assumptions approximate reality, they also perform well in the large world. But large world performance has to be demonstrated rather than logically deduced. Passing back and forth between these two worlds allows both formal methods, like Bayesian inference, and informal methods, like peer review, to play an indispensable role.

This chapter focuses on the small world. It explains probability theory in its essential form: counting the ways things can happen. Bayesian inference arises automatically from this perspective. Then the chapter presents the stylized components of a Bayesian statistical model, a model for learning from data. Then it shows you how to animate the model, to produce estimates.

## The Garden of Forking Data

Our goal in this section will be to build Bayesian inference up from humble beginnings, so there is no superstition about it. Bayesian inference is really just counting and comparing of possibilities. Consider by analogy Jorge Luis Borges’ short story “The Garden of Forking Paths.” The story is about a man who encounters a book filled with contradictions. In most books, characters arrive at plot points and must decide among alternative paths. A protagonist may arrive at a man’s home. She might kill the man, or rather take a cup of tea. Only one of these paths is taken—murder or tea. But the book within Borges’ story explores all paths, with each decision branching outward into an expanding garden of forking paths.

This is the same device that Bayesian inference offers. In order to make good inference about what actually happened, it helps to consider everything that could have happened. A Bayesian analysis is a garden of forking data, in which alternative sequences of events are cultivated. As we learn about what did happen, some of these alternative sequences are pruned. In the end, what remains is only what is logically consistent with our knowledge.

This approach provides a quantitative ranking of hypotheses, a ranking that is maximally conservative, given the assumptions and data that go into it. The approach cannot guarantee a correct answer, on large world terms. But it can guarantee the best possible answer, on small world terms, that could be derived from the information fed into it.

Consider the following toy example.

### Counting Possibilities

Suppose there’s a bag, and it contains four marbles. These marbles come in two colors: blue and white. We know there are four marbles in the bag, but we don’t know how many are of each color. We do know that there are five possibilities: (1) \['W', 'W', 'W', 'W'\], (2) \['B', 'W', 'W', 'W'\], (3) \['B', 'B', 'W', 'W'\], (4) \['B', 'B', 'B', 'W'\], (5) \['B', 'B', 'B', 'B'\]. These are the only possibilities consistent with what we know about the contents of the bag. Call these five possibilities the ***conjectures***.

Our goal is to figure out which of these conjectures is most plausible, given some evidence about the contents of the bag. We do have some evidence: A sequence of three marbles is pulled from the bag, one at a time, replacing the marble each time and shaking the bag before drawing another marble. The sequence that emerges is: \['B', 'W', 'B'\], in that order. These are the data.

So now let’s plant the garden and see how to use the data to infer what’s in the bag. Let’s begin by considering just the single conjecture, \['B', 'W', 'W', 'W'\], that the bag contains one blue and three white marbles. On the first draw from the bag, one of four things could happen, corresponding to one of four marbles in the bag. So we can visualize the possibilities branching outward:

![](images/marble-conjecture-2.png){fig-align="center" width="170"}

Notice that even though the three white marbles look the same from a data perspective— we just record the color of the marbles, after all—they are really different events. This is important, because it means that there are three more ways to see W than to see B.

Now consider the garden as we get another draw from the bag. It expands the garden out one layer:

![](images/marble-tree-1.png){fig-align="center" width="270"}

Now there are 16 possible paths through the garden, one for each pair of draws. On the second draw from the bag, each of the paths above again forks into four possible paths. Why?

Because we believe that our shaking of the bag gives each marble a fair chance at being drawn, regardless of which marble was drawn previously. The third layer is built in the same way, and the full garden is shown below. There are $4^3 = 64$ possible paths in total.

![](images/marble-tree-2.png){fig-align="center" width="270"}

As we consider each draw from the bag, some of these paths are logically eliminated. The first draw tuned out to be 'B', recall, so the three white paths at the bottom of the garden are eliminated right away. If you imagine the real data tracing out a path through the garden, it must have passed through the one blue path near the origin. The second draw from the bag produces 'W', so three of the paths forking out of the first blue marble remain. As the data trace out a path, we know it must have passed through one of those three white paths (after the first blue path), but we don’t know which one, because we recorded only the color of each marble. Finally, the third draw is 'B'. Each of the remaining three paths in the middle layer sustain one blue path, leaving a total of three ways for the sequence to appear, assuming the bag contains \['B', 'W', 'W', 'W'\].

![](images/marble-tree-3.png){fig-align="center" width="290"}

Figure above shows the garden again, now with logically eliminated paths grayed out. We can’t be sure which of those three paths the actual data took. But as long as we’re considering only the possibility that the bag contains one blue and three white marbles, we can be sure that the data took one of those three paths. Those are the only paths consistent with both our knowledge of the bag’s contents (four marbles, white or blue) and the data ('B', 'W', 'B').

This demonstrates that there are three (out of 64) ways for a bag containing \['B', 'W', 'W', 'W'\] to produce the data . We have no way to decide among these three ways. The inferential power comes from comparing this count to the numbers of ways each of the other conjectures of the bag’s contents could produce the same data. For example, consider the conjecture \['W', 'W', 'W', 'W'\]. There are zero ways for this conjecture to produce the observed data, because even one is logically incompatible with it. The conjecture \['B', 'B', 'B', 'B'\] is likewise logically incompatible with the data. So we can eliminate these two conjectures, because neither provides even a single path that is consistent with the data.

The figure below displays the full garden now, for the remaining three conjectures: \['B', 'W', 'W', 'W'\], \['B', 'B', 'W', 'W'\], and \['B', 'B', 'B', 'W'\]. The upper-left wedge displays the same garden as earlier. The upper-right shows the analogous garden for the conjecture that the bag contains three blue marbles and one white marble. And the bottom wedge shows the garden for two blue and two white marbles. Now we count up all of the ways each conjecture could produce the observed data. For one blue and three white, there are three ways, as we counted already. For two blue and two white, there are eight paths forking through the garden that are logically consistent with the observed sequence. For three blue and one white, there are nine paths that survive.

![](images/garden_of_forking_data.PNG){alt="garden_of_forking_data" fig-align="center" width="270"}

To summarize, we’ve considered five different conjectures about the contents of the bag, ranging from zero blue marbles to four blue marbles. For each of these conjectures, we’ve counted up how many sequences, paths through the garden of forking data, could potentially produce the observed data, ('B', 'W', 'B'):

| Conjecture     | Ways to Produce (B, W, B) |
|----------------|---------------------------|
| \[B, W, W, W\] | 0 x 4 x 0 = 0             |
| \[B, B, W, W\] | 1 x 3 x 1 = 3             |
| \[B, B, B, W\] | 3 x 1 x 3 = 9             |
| \[B, B, B, B\] | 4 x 0 x 4 = 0             |

Notice that the number of ways to produce the data, for each conjecture, can be computed by first counting the number of paths in each “ring” of the garden and then by multiplying these counts together. This is just a computational device. It tells us the same thing as the figure above but without having to draw the garden. The fact that numbers are multiplied during calculation doesn’t change the fact that this is still just counting of logically possible paths. This point will come up again, when you meet a formal representation of Bayesian inference.

So what good are these counts? By comparing these counts, we have part of a solution for a way to rate the relative plausibility of each conjectured bag composition. But it’s only a part of a solution, because in order to compare these counts we first have to decide *how many ways each conjecture could itself be realized.* We might argue that when we have no reason to assume otherwise, we can just consider *each conjecture equally plausible and compare the counts directly*. But often we do have reason to assume otherwise.

### **Combining Other Information**

We may have additional information about the relative plausibility of each conjecture. This information could arise from knowledge of how the contents of the bag were generated. It could also arise from previous data. Whatever the source, it would help to have a way to combine different sources of information to update the plausibilities. Luckily there is a natural solution: Just multiply the counts.

To grasp this solution, suppose we’re willing to say each conjecture is equally plausible at the start. So we just compare the counts of ways in which each conjecture is compatible with the observed data. This comparison suggests that \['B', 'B', 'B', 'W'\] is slightly more plausible than \['B', 'B', 'W', 'W'\], and both are about three times more plausible than \['B', 'W', 'W', 'W'\]. Since these are our initial counts, and we are going to update them next, let’s label them *priors*.

Now suppose we draw another marble from the bag to get another observation: 'B'. Now you have two choices. You could start all over again, making a garden with four layers to trace out the paths compatible with the data sequence \['B', 'W', 'B', 'B'\]. Or you could take the previous counts—the prior counts—over conjectures (0, 3, 8, 9, 0) and just update them in light of the new observation. It turns out that these two methods are mathematically identical, as long as the *new observation is logically independent of the previous observations*.

Here’s how to do it. First we count the numbers of ways each conjecture could produce the new observation, 'B'. Then we multiply each of these new counts by the prior numbers of ways for each conjecture. In table form:

| Conjecture     | Ways to Produce B | Prior Counts | New Count  |
|----------------|-------------------|--------------|------------|
| \[W, W, W, W\] | 0                 | 0            | 0 x 0 = 0  |
| \[B, W, W, W\] | 1                 | 3            | 1 x 3 = 3  |
| \[B, B, W, W\] | 2                 | 8            | 2 x 8 = 16 |
| \[B, B, B, W\] | 3                 | 9            | 3 x 9 = 27 |
| \[B, B, B, B\] | 4                 | 0            | 4 x 0 = 0  |

The new counts in the right-hand column above summarize all the evidence for each conjecture. As new data arrive, and provided those data are independent of previous observations, then the number of logically possible ways for a conjecture to produce all the data up to that point can be computed just by multiplying the new count by the old count.

This updating approach amounts to nothing more than asserting that (1) when we have previous information suggesting there are $W_{prior}$ ways for a conjecture to produce a previous observation $D_{prior}$ and (2) we acquire new observation $D_{new}$ that the same conjecture can produce in $W_{new}$ ways, then (3) the number of ways the conjecture can account for both $D_{prior}$ as well as $D_{new}$ is just product of $W_{prior} \times W_{new}$. For example, in the table above the conjecture \['B', 'B', 'W', 'W'\] has $W_{prior}=8$ ways to produce $D_{prior}=$ \['B', 'W', 'B'\]. It also has $W_{new} = 2$ ways to produce the new observation $D_{new}=$ 'B'. So there are $8 \times 2 = 16$ ways for the conjecture to produce both $D_{prior}$ and $D_{new}$. Why multiply? Multiplication is just a shortcut to enumerating and counting up all of the paths through the garden that could produce all the observations.

In this example, the prior data and new data are of the same type: marbles drawn from the bag. But in general, the prior data and new data can be of different types. Suppose for example that someone from the marble factory tells you that blue marbles are rare. So for every bag containing \['B', 'B', 'B', 'W'\], they made two bags containing \['B', 'B', 'W', 'W'\] and three bags containing \['B', 'W', 'W', 'W'\]. They also ensured that every bag contained at least one blue and one white marble. We can update our counts again:

| Conjecture     | Prior Counts | Factory Count | New Count   |
|----------------|--------------|---------------|-------------|
| \[W, W, W, W\] | 0            | 0             | 0 x 0 = 0   |
| \[B, W, W, W\] | 3            | 3             | 3 x 3 = 9   |
| \[B, B, W, W\] | 16           | 2             | 16 x 2 = 32 |
| \[B, B, B, W\] | 27           | 1             | 27 x 1 = 27 |
| \[B, B, B, B\] | 0            | 0             | 4 x 0 = 0   |

Now the conjecture \['B', 'B', 'W', 'W'\] is most plausible, but barely better than \['B', 'B', 'B', 'W'\]. Is there a threshold difference in these counts at which we can safely decide that one of the conjectures is the correct one? You’ll spend the next chapter exploring that question.

------------------------------------------------------------------------

**Original Ignorance**

Which assumption should we use, when there is no previous information about the conjectures? The most common solution is to assign an equal number of ways that each conjecture could be correct, before seeing any data. This is sometimes known as the **principle of indifference:** When there is no reason to say that one conjecture is more plausible than another, weigh all of the conjectures equally. This book does not use nor endorse “ignorance” priors. As we’ll see in later chapters, the structure of the model and the scientific context always provide information that allows us to do better than ignorance.

For the sort of problems we examine in this book, the principle of indifference results in inferences very comparable to mainstream non-Bayesian approaches, most of which contain implicit equal weighting of possibilities. For example a typical non-Bayesian confidence interval weighs equally all of the possible values a parameter could take, regardless of how implausible some of them are. In addition, many non-Bayesian procedures have moved away from equal weighting, through the use of penalized likelihood and other methods.

------------------------------------------------------------------------

### **From Counts to Probability**

It is helpful to think of this strategy as adhering to a principle of honest ignorance: *When we don't know what caused the data, potential causes that may produce the data in more ways are more plausible*. This leads us to count paths through the garden of forking data. We’re counting the implications of assumptions.

It’s hard to use these counts though, so we almost always standardize them in a way that transforms them into probabilities. Why is it hard to work with the counts? First, since relative value is all that matters, the size of the counts 3, 8, and 9 contain no information of value. They could just as easily be 30, 80, and 90. The meaning would be the same. It’s just the relative values that matter. Second, as the amount of data grows, the counts will very quickly grow very large and become difficult to manipulate. By the time we have 10 data points, there are already more than one million possible sequences. We’ll want to analyze data sets with thousands of observations, so explicitly counting these things isn’t practical.

Luckily, there’s a mathematical way to compress all of this. Specifically, we define the updated plausibility of each possible composition of the bag, after seeing the data, as:

$$
\begin{align*}
\text{Plausibility of ['B', 'W', 'W', 'W']} &\text{ after seeing ['B', 'W', 'B']} \\
&\propto \\
\text{ways ['B', 'W', 'W', 'W']} &\text{ can produce ['B', 'W', 'B']} \\
&\times \\
\text{prior plausibility} &\text{ ['B', 'W', 'W', 'W']}
\end{align*}
$$

That little $\propto$ means *proportional to.* We want to compare the plausibility of each possible composition. So it’ll be helpful to define $p$ as the proportion of the marbles that is blue. For \['B', 'W', 'W', 'W'\], $p = 1/4 = 0.25$. And so we can not write:

$$
\text{plausibility of } p \text{ after } D_{new} \propto \text{ways } p \text{ can produce } D_{new} \times \text{prior plausibility of } p
$$

The above just means that for any value $p$ can take, we judge the plausibility of that value $p$ as proportional to the number of ways it can get through the garden of forking data. This expression just summarizes the calculations you did in the tables of the previous section.

Finally, we construct probabilities by standardizing the plausibility so that the sum of the plausibilities for all possible conjectures will be one. All you need to do in order to standardize is to add up all of the products, one for each value $p$ can take, and then divide each product by the sum of products:

$$
\text{plausibility of } p \text{ after } D_{new} = \frac{\text{ways }p\text{ can produce } D_{new} \times \text{prior plausibility }p}{\text{sum of products}}
$$

A worked example is needed for this to really make sense. So consider again the table from before, now updated using our definitions of $p$ and "plausibility":

| Possible Compositions | p    | Ways to Produce Data | New Count |
|-----------------------|------|----------------------|-----------|
| \[W, W, W, W\]        | 0    | 0                    | 0.00      |
| \[B, W, W, W\]        | 0.25 | 3                    | 0.15      |
| \[B, B, W, W\]        | 0.5  | 8                    | 0.40      |
| \[B, B, B, W\]        | 0.75 | 9                    | 0.45      |
| \[B, B, B, B\]        | 1    | 0                    | 0.00      |

You can quickly compute these plausibilities in Python:

```{python}
ways = np.array([0, 3, 8, 9, 0])
ways/np.sum(ways)
```

The values in `ways` are the products mentioned before. And `np.sum(ways)` is the denominator "sum of products" in the expression near the top of the page.

These plausibilities are also *probabilities—*they are non-negative (zero or positive) real numbers that sum to one. And all of the mathematical things you can do with probabilities you can also do with these values. Specifically, each piece of the calculation has a direct partner in applied probability theory. These partners have stereotyped names, so it’s worth learning them, as you’ll see them again and again.

-   A conjectured proportion of blue marbles, *p*, is usually called a **parameter** value. It's just a way of indexing possible explanations of the data.

-   The relative number of ways that a value *p* can produce the data is usually called a **likelihood**. It is derived by enumerating all the possible data sequences that could have happened and then eliminating those sequences inconsistent with the data.

-   The prior plausibility of any specific *p* is usually called the **prior probability**.

-   The new, updated plausibility of any specific *p* is usually called the **posterior probability**.

In the next major section, you’ll meet the more formal notation for these objects and see how they compose a simple statistical model.

------------------------------------------------------------------------

#### Analytical Solution

-   A single toss of the globe has a probability $p$ of producing a water ($W$) observation. It has a probability $1−p$ of producing a land ($L$) observation.
-   Each toss of the globe is independent of the others.

Results suggest the following analytical solution $$W,L = (Rp)^W \times ((1 - p)R)^L$$ where $R$ is the number of possible sides in a globes, in this case 4

```{python}
def plausibility_count(p, W, L, resolution=4):
    """
    p:          proportion of water 
    W:          number of water observations in the sample
    L:          number of land observations in sample
    resolution: number of globe sides
    """
    return (resolution * p) ** W * ((1 - p)*resolution) ** L
```

```{python}
RESOLUTION = 4
# B W B B
W = 3
L = 1

p_water = np.linspace(0,1,5)
n_possible_ways = np.vectorize(plausibility_count)(p_water, W, L)
n_possible_ways
```

------------------------------------------------------------------------

#### Randomization

When you shuffle a deck of cards or assign subjects to treatments by flipping a coin, it is common to say that the resulting deck and treatment assignments are *randomized*. What does it mean to randomize something? It just means that we have processed the thing so that we know almost nothing about its arrangement. Shuffling a deck of cards changes our state of knowledge, so that we no longer have any specific information about the ordering of cards. However, the bonus that arises from this is that, if we really have shuffled enough to erase any prior knowledge of the ordering, then the order the cards end up in is very likely to be one of the many orderings with high **information entropy**.

------------------------------------------------------------------------

## Building a Model

By working with probabilities instead of raw counts, Bayesian inference is made much easier, but it looks much harder. So in this section, we follow up on the garden of forking data by presenting the conventional form of a Bayesian statistical model. The toy example we’ll use here has the anatomy of a typical statistical analysis, so it’s the style that you’ll grow accustomed to. But every piece of it can be mapped onto the garden of forking data. The logic is the same.

Suppose you have a globe representing our planet, the Earth. This version of the world is small enough to hold in your hands. You are curious how much of the surface is covered in water. You adopt the following strategy: You will toss the globe up in the air. When you catch it, you will record whether or not the surface under your right index finger is water or land. Then you toss the globe up in the air again and repeat the procedure. This strategy generates a sequence of samples from the globe. The first nine samples might look like:

$$
W \space L \space W \space W \space W \space L \space W \space L \space W
$$

where W indicates water and L indicates land. So in this example you observe six W (water) observations and three L (land) observations. Call this sequence of observations the *data*.

To get the logic moving, we need to make assumptions, and these assumptions constitute the model. Designing a simple Bayesian model benefits from a design loop with three steps:

1.  **Data Story:** Motivate the model by narrating how the data might arise.

2.  **Update**: Educate your model by feeding it the data.

3.  **Evaluate**: All statistical models require supervision, leading to model revision.

The next sections walk through these steps, in the context of the globe tossing evidence.

### A Data Story

Bayesian data analysis usually means producing a story for how the data came to be. This story may be *descriptive*, specifying associations that can be used to predict outcomes, given observations. Or it may be *causal*, a theory of how some events produce other events. Typically, any story you intend to be causal may also be descriptive. But many descriptive stories are hard to interpret causally. But all data stories are complete, in the sense that they are sufficient for specifying an algorithm for simulating new data. In the next chapter, you’ll see examples of doing just that, as simulating new data is useful not only for model criticism but also for model construction.

You can motivate your data story by trying to explain how each piece of data is born. This usually means describing aspects of the underlying reality as well as the sampling process. The data story in this case is simply a restatement of the sampling process:

1.  The true proportion of water covering the globe is $p$.

2.  A single toss of the globe has a probability $p$ of producing a water (W) observation. It has a probability $1 - p$ of producing a land (L) observation.

3.  Each toss of the globe is independent of the others.

The data story is then translated into a formal probability model. This probability model is easy to build, because the construction process can be usefully broken down into a series of component decisions. Before meeting these components, however, it’ll be useful to visualize how a Bayesian model behaves. After you’ve become acquainted with how such a model learns from data, we’ll pop the machine open and investigate its engineering.

------------------------------------------------------------------------

#### The value of storytelling

The data story has value, even if you quickly abandon it and never use it to build a model or simulate new observations. Indeed, it is important to eventually discard the story, because many different stories correspond to the same model. As a result, showing that a model does a good job does not in turn uniquely support our data story. Still, the story has value because in trying to outline the story, often one realizes that additional questions must be answered. Most data stories are much more specific than are the verbal hypotheses that inspire data collection. Hypotheses can be vague, such as “it’s more likely to rain on warm days.” When you are forced to consider sampling and measurement and make a precise statement of how temperature predicts rain, many stories and resulting models will be consistent with the same vague hypothesis. Resolving that ambiguity often leads to important realizations and model revisions, before any model is fit to data.

------------------------------------------------------------------------

### Bayesian Updating

Our problem is one of using the evidence—the sequence of globe tosses—to decide among different possible proportions of water on the globe. These proportions are like the conjectured marbles inside the bag, from earlier in the chapter. Each possible proportion may be more or less plausible, given the evidence. A Bayesian model begins with one set of plausibilities assigned to each of these possibilities. These are the prior plausibilities. Then it updates them in light of the data, to produce the posterior plausibilities. This updating process is a kind of learning, called **Bayesian updating**. The details of this updating—how it is mechanically achieved—can wait until later in the chapter. For now, let’s look only at how such a machine behaves.

For the sake of the example only, let’s program our Bayesian machine to initially assign the same plausibility to every proportion of water, every value of p. We’ll do better than this later. Now look at the top-left plot below. The dashed horizontal line represents this initial plausibility of each possible value of $p$. After seeing the first toss, which is a “W,” the model updates the plausibilities to the solid line. The plausibility of $p=0$ has now fallen to exactly zero—the equivalent of “impossible.” Why? Because we observed at least one speck of water on the globe, so now we know there is some water. The model executes this logic automatically. You don’t have it instruct it to account for this consequence. Probability theory takes care of it for you, because it is essentially counting paths through the garden of forking data, as in the previous section.

Likewise, the plausibility of $p > 0.5$ has increased. This is because there is not yet any evidence that there is land on the globe, so the initial plausibilities are modified to be consistent with this. Note however that the relative plausibilities are what matter, and there isn’t yet much evidence. So the differences in plausibility are not yet very large. In this way, the amount of evidence seen so far is embodied in the plausibilities of each value of $p$.

In the remaining plots in below, the additional samples from the globe are introduced to the model, one at a time. Each dashed curve is just the solid curve from the previous plot, moving left to right and top to bottom. Every time a “W” is seen, the peak of the plausibility curve moves to the right, towards larger values of $p$. Every time an “L” is seen, it moves the other direction. The maximum height of the curve increases with each sample, meaning that fewer values of $p$ amass more plausibility as the amount of evidence increases. As each new observation is added, the curve is updated consistent with all previous observations.

Notice that every updated set of plausibilities becomes the initial plausibilities for the next observation. Every conclusion is the starting point for future inference. However, this updating process works backwards, as well as forwards. Given the final set of plausibilities in the bottom-right plot, and knowing the final observation (W), it is possible to mathematically divide out the observation, to infer the previous plausibility curve. So the data could be presented to your model in any order, or all at once even. In most cases, you will present the data all at once, for the sake of convenience. But it’s important to realize that this merely represents abbreviation of an iterated learning process.

```{python}
def globe_toss_plot(observations):
  
    def plot_beta_posterior(ax, k, n, **kwargs):
        p = np.linspace(0,1)
        # Beta Continuous - Probability density function
        pmf = stats.beta.pdf(p, k+1, n-k+1)
        return ax.plot(p, pmf, **kwargs) 

    fig, axes = plt.subplots(3, 3, figsize=(8, 8))
    for i, ax in enumerate(axes.flat):
        obs_prev, obs_new = observations[:i], observations[:i+1]
        n, k = len(obs_prev), np.sum(obs_prev)
        plot_beta_posterior(ax, k, n, linestyle='--')
        n, k = len(obs_new), np.sum(obs_new)    
        color = 'C1' if observations[i] == 1 else 'C0'
        plot_beta_posterior(ax, k, n, color=color, linewidth=2, alpha=.5)
        obs_str = ''.join("W" if x == 1 else "L" for x in obs_new)
        ax.set(title=obs_str)
        ax.set_yticks([])
    fig.supxlabel('Proportion of Water')
    fig.supylabel('Plausibility')

# Globe Tossing Simulation
# W = 1, L = 0
observations = np.array([1,0,1,1,1,0,1,0,1])
utils.inline_plot(globe_toss_plot, observations)
```

------------------------------------------------------------------------

#### Sample Size and Reliable Inference

It is common to hear that there is a minimum num- ber of observations for a useful statistical estimate. For example, there is a widespread superstition that 30 observations are needed before one can use a Gaussian distribution. Why? In non-Bayesian statistical inference, procedures are often justified by the method’s behavior at very large sample sizes, so-called *asymptotic* behavior. As a result, performance at small samples sizes is questionable.

In contrast, Bayesian estimates are valid for any sample size. This does not mean that more data isn’t helpful—it certainly is. Rather, the estimates have a clear and valid interpretation, no matter the sample size. But the price for this power is dependency upon the initial plausibilities, the prior. If the prior is a bad one, then the resulting inference will be misleading. There’s no free lunch, when it comes to learning about the world. A Bayesian golem must choose an initial plausibility, and a non-Bayesian golem must choose an estimator. Both golems pay for lunch with their assumptions.

------------------------------------------------------------------------

### Evaluate

The Bayesian model learns in a way that is demonstrably optimal, provided that it accurately describes the real, large world. This is to say that your Bayesian machine guarantees perfect inference within the small world. No other way of using the available information, beginning with the same state of information, could do better.

Don’t get too excited about this logical virtue, however. The calculations may malfunction, so results always have to be checked. And if there are important differences between the model and reality, then there is no logical guarantee of large world performance. And even if the two worlds did match, any particular sample of data could still be misleading. So it’s worth keeping in mind at least two cautious principles.

First, the model’s certainty is no guarantee that the model is a good one. As the amount of data increases, the globe tossing model will grow increasingly sure of the proportion of water. This means that the curves in the plot above will become increasingly narrow and tall, restricting plausible values within a very narrow range. But models of all sorts—Bayesian or not—can be very confident about an inference, even when the model is seriously misleading. This is because the inferences are conditional on the model. What your model is telling you is that, given a commitment to this particular model, it can be very sure that the plausible values are in a narrow range. Under a different model, things might look differently.

Second, it is important to supervise and critique your model’s work. Consider again the fact that the updating in the previous section works in any order of data arrival. We could shuffle the order of the observations, as long as six W’s and three L’s remain, and still end up with the same final plausibility curve. That is only true, however, because the model assumes that order is irrelevant to inference. When something is irrelevant to the machine, it won’t affect the inference directly. But it may affect it indirectly, because the data will depend upon order. So it is important to check the model’s inferences in light of aspects of the data it does not know about. Such checks are an inherently creative enterprise, left to the analyst and the scientific community.

For now, note that the goal is not to test the truth value of the model’s assumptions. We know the model’s assumptions are never exactly right, in the sense of matching the true data generating process. Therefore there’s no point in checking if the model is true. Failure to conclude that a model is false must be a failure of our imagination, not a success of the model. Moreover, models do not need to be exactly true in order to produce highly precise and useful inferences. All manner of small world assumptions about error distributions and the like can be violated in the large world, but a model may still produce a perfectly useful estimate. This is because models are essentially information processing machines, and there are some surprising aspects of information that cannot be easily captured by framing the problem in terms of the truth of assumptions.

Instead, the objective is to check the model’s adequacy for some purpose. This usually means asking and answering additional questions, beyond those that originally constructed the model. Both the questions and answers will depend upon the scientific context. So it’s hard to provide general advice. There will be many examples, throughout the book, and of course the scientific literature is replete with evaluations of the suitability of models for different jobs—prediction, comprehension, measurement, and persuasion.

------------------------------------------------------------------------

#### Deflationary Statistics

It may be that Bayesian inference is the best general purpose method of inference known. However, Bayesian inference is much less powerful than we’d like it to be. There is no approach to inference that provides universal guarantees. No branch of applied mathematics has unfettered access to reality, because math is not discovered, like the proton. Instead it is invented, like the shovel.

------------------------------------------------------------------------

## Components of the Model

Now that you’ve seen how the Bayesian model behaves, it’s time to open up the machine and learn how it works. Consider three different things that we counted in the previous sections.

1.  The number of ways each conjecture could produce an observation

2.  The accumulated number of ways each conjecture could produce the entire data

3.  The initial plausibility of each conjectured cause of the data

Each of these things has a direct analog in conventional probability theory. And so the usual way we build a statistical model involves choosing distributions and devices for each that represent the relative numbers of ways things can happen.

In this section, you’ll meet these components in some detail and see how each relates to the counting you did earlier in the chapter. The job in front of us is really nothing more than naming all of the variables and defining each. We’ll take these tasks in turn.

### Variables

Variables are just symbols that can take on different values. In a scientific context, variables include things we wish to infer, such as proportions and rates, as well as things we might observe, the data. In the globe tossing model, there are three variables.

The first variable is our target of inference, $p$, the proportion of water on the globe. This variable cannot be observed. Unobserved variables are usually called **parameters**. But while $p$ itself is unobserved, we can infer it from the other variables.

The other variables are the observed variables, the counts of water and land. Call the count of water $W$ and the count of land $L$. The sum of these two variables is the number of globe tosses: $N= W + L$.

### Definitions

Once we have the variables listed, we then have to define each of them. In defining each, we build a model that relates the variables to one another. Remember, the goal is to count all the ways the data could arise, given the assumptions. This means, as in the globe tossing model, that for each possible value of the unobserved variables, such as p, we need to define the relative number of ways—the probability—that the values of each observed variable could arise. And then for each unobserved variable, we need to define the prior plausibility of each value it could take. I appreciate that this is all a bit abstract. So here are the specifics, for the globe.

#### Observed Variable

For the count of water $W$ and land $L$, we define how plausible any combination of $W$ and $L$ would be, for a specific value of $p$. This is very much like the marble counting we did earlier in the chapter. Each specific value of p corresponds to a specific plausibility of the data, as in the plot above.

So that we don’t have to literally count, we can use a mathematical function that tells us the right plausibility. In conventional statistics, a distribution function assigned to an observed variable is usually called a **likelihood**. That term has special meaning in non- Bayesian statistics, however. We will be able to do things with our distributions that non- Bayesian models forbid. So I will sometimes avoid the term likelihood and just talk about distributions of variables. But when someone says, “likelihood,” they will usually mean a distribution function assigned to an observed variable.

In the case of the globe tossing model, the function we need can be derived directly from the data story. Begin by nominating all of the possible events. There are two: *water* ($W$) and *land* ($L$). There are no other events. The globe never gets stuck to the ceiling, for example. When we observe a sample of $W$’s and $L$’s of length $N$ (nine in the actual sample), we need to say how likely that exact sample is, out of the universe of potential samples of the same length. That might sound challenging, but it’s the kind of thing you get good at very quickly, once you start practicing.

In this case, once we add our assumptions that (1) every toss is independent of the other tosses and (2) the probability of W is the same on every toss, probability theory provides a unique answer, known as the *binomial distribution*. This is the common “coin tossing” distribution. And so the probability of observing $W$ waters and $L$ lands, with a probability $p$ of water on each toss, is:

$$
\text{Posterior probability of }p = Pr(W,L \mid p) = \frac{(W + L + 1)!}{W!L!} p^W(1-p)^L
$$

The function is read as: *The counts of “water”* $W$ *and “land’* $L$ *are distributed binomially, with probability* $p$ *of “water” on each toss.*

And the binomial distribution formula is built into `scipy.stats`, so you can easily compute the likelihood of the data—six $W$’s in nine tosses—under any value of $p$ with:

```{python}
stats.binom.pmf(k=6, n=9, p=0.5)
```

That number is the relative number of ways to get six water, holding $p$ at 0.5 and $N= W + L$ at nine. So it does the job of counting relative number of paths through the garden. Change the 0.5 to any other value, to see how the value changes.

Later, we’ll see that the binomial distribution is rather special, because it represents the **maximum entropy** way to count binary events. “Maximum entropy” might sound like a bad thing. Isn’t entropy disorder? Doesn’t “maximum entropy” mean the death of the universe? Actually it means that the distribution contains no additional information other than: There are two events, and the probabilities of each in each trial are $p$ and $1−p$.

------------------------------------------------------------------------

#### Binomial & Beta Distribution

The binomial *probability mass function* (pmf) has the form:

$$
P(k, n \mid p) = \binom{n}{k} p^k (1-p)^{n-k}
$$

This is an analytical function that gives us the pmf as the limit as number of possibilities $\rightarrow \infty$, where $\binom{n}{k}$ is a normalizing constant to make the distribution sum to 1. The binomial function is a likelihood function that models/describes the number of successes (likelihood of observing) $k$ in $n$ independent, Bernoulli trials, where each trial has a probability $p$ of success. It is a discrete distribution and is used when the data consists of counts (e.g., number of heads in coin tosses).

The p*robability density function* (pdf) for beta is $f(x, a, b)$, where $\alpha = k + 1$ and $\beta = n - k + 1$.

Beta probability density function (pdf) has the form:

$$
P(p \mid \alpha, \beta) =  \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} p^{\alpha-1} (1 − p)^{\beta-1}
$$

This function gives us the pdf and is used as a posterior function for a continuous distribution.

The beta distribution is a conjugate prior of the binomial likelihood. The beta distribution is used to model uncertainty about a probability $p$, and describes the probability distribution of $p$ itself.

When you use a beta prior with a binomial likelihood, the posterior distribution of $p$ is also a beta distribution. This is called **conjugacy**:

$$
\text{Posterior} \propto \text{Likelihood} \times \text{Prior}
$$

For a binomial likelihood and beta prior, the posterior is:

$$
\text{Beta}(\alpha + k, \beta + n - k)
$$

-   Use the **binomial distribution** when:
    -   You have a fixed number of trials $n$.
    -   You want to calculate the probability of observing $k$ successes.
    -   The probability $p$ is known or fixed.
-   Use the **beta distribution** when:
    -   You want to model uncertainty about the probability $p$.
    -   You are performing Bayesian inference and need a prior or posterior distribution for $p$.
    -   You are combining a binomial likelihood with a beta prior to obtain a posterior distribution.

------------------------------------------------------------------------

### A central role for likelihood

The most influential assumptions in both Bayesian and many non-Bayesian models are the distributions assigned to data, the likelihood functions. The likelihoods influence inference for every piece of data, and as sample size increases, the likelihood matters more and more. This helps to explain why Bayesian and non-Bayesian inferences are often so similar. If we had to explain Bayesian inference using only one aspect of it, we should describe likelihood, not priors.

------------------------------------------------------------------------

#### Unobserved variables.

The distributions we assign to the observed variables typically have their own variables. In the binomial above, there is $p$, the probability of sampling water. Since $p$ is not observed, we usually call it a **parameter**. Even though we cannot observe $p$, we still have to define it.

Later, there will be more parameters in your models. In statistical modeling, many of the most common questions we ask about data are answered directly by parameters:

-   What is the average difference between treatment groups?

-   How strong is the association between a treatment and an outcome?

-   Does the effect of the treatment depend upon a covariate?

-   How much variation is there among groups?

You’ll see how these questions become extra parameters inside the distribution function we assign to the data.

You’ll see how these questions become extra parameters inside the distribution function we assign to the data.

For every parameter you intend your Bayesian machine to consider, you must provide a distribution of prior plausibility, its prior. A Bayesian machine must have an initial plausibility assignment for each possible value of the parameter, and these initial assignments do useful work. When you have a previous estimate to provide to the machine, that can become the prior, as in the steps in in the plot above. Back in the associated code of the plot, the machine did its learning one piece of data at a time. As a result, each estimate becomes the prior for the next step. But this doesn’t resolve the problem of providing a prior, because at the dawn of time, when $N= 0$, the machine still had an initial state of information for the parameter $p$: a flat line specifying equal plausibility for every possible value.

So where do priors come from? They are both engineering assumptions, chosen to help the machine learn, and scientific assumptions, chosen to reflect what we know about a phenomenon. The flat prior in the plot above is very common, but it is hardly ever the best prior. Later chapters will focus on prior choice a lot more.

There is a school of Bayesian inference that emphasizes choosing priors based upon the personal beliefs of the analyst. While this **subjective Bayesian** approach thrives in some statistics and philosophy and economics programs, it is rare in the sciences. Within Bayesian data analysis in the natural and social sciences, the prior is considered to be just part of the model. As such it should be chosen, evaluated, and revised just like all of the other components of the model. In practice, the subjectivist and the non-subjectivist will often analyze data in nearly the same way.

None of this should be understood to mean that any statistical analysis is not inherently subjective, because of course it is—lots of little subjective decisions are involved in all parts of science. It’s just that priors and Bayesian data analysis are no more inherently subjective than are likelihoods and the repeat sampling assumptions required for significance testing. Anyone who has visited a statistics help desk at a university has probably experienced this subjectivity—statisticians do not in general exactly agree on how to analyze anything but the simplest of problems. The fact that statistical inference uses mathematics does not imply that there is only one reasonable or useful way to conduct an analysis. Engineering uses math as well, but there are many ways to build a bridge.

Beyond all of the above, there’s no law mandating we use only one prior. If you don’t have a strong argument for any particular prior, then try different ones. Because the prior is an assumption, it should be interrogated like other assumptions: by altering it and checking how sensitive inference is to the assumption. No one is required to swear an oath to the assumptions of a model, and no set of assumptions deserves our obedience.

------------------------------------------------------------------------

#### Prior as Probability Distribution

You could write the prior in the example here as:

$$
Pr(p) = \frac{1}{1-0} = 1
$$

The prior is a probability distribution for the parameter. In general, for a uniform prior from a to b, the probability of any point in the interval is $1/(b−a)$. If you’re bothered by the fact that the probability of every value of $p$ is 1, remember that every probability distribution must sum (integrate) to 1. The expression $1/(b−a)$ ensures that the area under the flat line from a to b is equal to 1. There will be more to say about this later.

------------------------------------------------------------------------

#### Datum or parameter?

It is typical to conceive of data and parameters as completely different kinds of entities. Data are measured and known; parameters are unknown and must be estimated from data. Usefully, in the Bayesian framework the distinction between a datum and a parameter is not so fundamental. Sometimes we observe a variable, but sometimes we do not. In that case, the same distribution function applies, even though we didn’t observe the variable. As a result, the same assumption can look like a “likelihood” or a “prior,” depending upon context, without any change to the model. Much later, you’ll see how to exploit this deep identity between certainty (data) and uncertainty (parameters) to incorporate measurement error and missing data into your modeling.

------------------------------------------------------------------------

#### Prior, prior pants on fire

Historically, some opponents of Bayesian inference objected to the arbitrariness of priors. It’s true that priors are very flexible, being able to encode many different states of information. If the prior can be anything, isn’t it possible to get any answer you want? Indeed it is. Regardless, after a couple hundred years of Bayesian calculation, it hasn’t turned out that people use priors to lie. If your goal is to lie with statistics, you’d be a fool to do it with priors, because such a lie would be easily uncovered. Better to use the more opaque machinery of the likelihood. Or better yet—don’t actually take this advice!—massage the data, drop some “outliers,” and otherwise engage in motivated data transformation.

It is true though that choice of the likelihood is much more conventionalized than choice of prior. But conventional choices are often poor ones, smuggling in influences that can be hard to discover. In this regard, both Bayesian and non-Bayesian models are equally harried, because both traditions depend heavily upon likelihood functions and conventionalized model forms. And the fact that the non-Bayesian procedure doesn’t have to make an assumption about the prior is of little comfort. This is because non-Bayesian procedures need to make choices that Bayesian ones do not, such as choice of estimator or likelihood penalty. Often, such choices can be shown to be equivalent to some Bayesian choice of prior or rather choice of loss function.

------------------------------------------------------------------------

### A Model is Born

With all the above work, we can now summarize our model. The observed variables $W$ and $L$ are given relative counts through the binomial distribution. So we can write, as a shortcut:

$$
W \sim \text{Binomial}(N,p)
$$

where $N =W +L$. The above is just a convention for communicating the assumption that the relative counts of ways to realize $W$ in $N$ trials with probability $p$ on each trial comes from the binomial distribution.

The unobserved parameter p similarly gets:

$$
p \sim \text{Uniform}(0,1)
$$

This means that $p$ has a uniform—flat—prior over its entire possible range, from zero to one. This is obviously not the best we could do, since we know the Earth has more water than land, even if we do not know the exact proportion yet.

Next, let’s see how to use these assumptions to generate inference.

## Making the Model Go

Once you have named all the variables and chosen definitions for each, a Bayesian model can update all of the prior distributions to their purely logical consequences: the **posterior distribution**. For every unique combination of data, likelihood, parameters, and prior, there is a unique posterior distribution. This distribution contains the relative plausibility of different parameter values, conditional on the data and model. The posterior distribution takes the form of the probability of the parameters, conditional on the data. In this case, it would be $Pr(p \mid W,L)$, the probability of each possible value of $p$, conditional on the specific W and L that we observed.

### Bayes' Theorem

The mathematical definition of the posterior distribution arises from Bayes’ theorem. This is the theorem that gives Bayesian data analysis its name. But the theorem itself is a trivial implication of probability theory. Here’s a quick derivation of it, in the context of the globe tossing example. Really this will just be a re-expression of the garden of forking data derivation from earlier in the chapter. What makes it look different is that it will use the rules of probability theory to coax out the updating rule. But it is still just counting.

The **joint probability** of the data $W$ and $L$ and any particular value of $p$ is: $$Pr(W, L, p) = Pr(W, L \mid p)Pr(p)$$

This just says that the probability of $W$, $L$ and $p$ is the product of $Pr(W,L \mid p)$ and the prior probability $Pr(p)$. This is like saying that the probability of rain and cold on the same day is equal to the probability of rain, when it’s cold, times the probability that it’s cold. This much is just definition. But it’s just as true that:

$$
Pr(W, L, p) = Pr(p \mid W, L)Pr(W,L)
$$

This is just the reverse of which probability is conditional, on the right-hand side. It is still a true definition. It’s like saying that the probability of rain and cold on the same day is equal to the probability that it’s cold, when it’s raining, times the probability of rain.

Now since both right-hand sides above are equal to the same thing, $Pr(W,L,p)$, they are also equal to one another:

$$
Pr(W, L \mid p)Pr(p) = Pr(p \mid W, L)Pr(W,L)
$$

So we can now solve for the thing that we want, $Pr(p|W,L)$:

$$
Pr(p \mid W, L) = \frac{Pr(W, L \mid p)Pr(p)}{Pr(W,L)}
$$

This is Bayes’ theorem. It says that the probability of any particular value of $p$, considering the data, is equal to the product of the relative plausibility of the data, conditional on $p$, and the prior plausibility of $p$, divided by this thing $Pr(W,L)$, which I’ll call the average probability of the data.

$$
\text{Posterior} = \frac{\text{Probability of the data} \times \text{Prior}}{\text{Average Probability of the data}}
$$

The **average probability of the data**, $Pr(W,L)$, can be confusing. It is commonly called the **“evidence”** or the “**average likelihood**,” neither of which is a transparent name. The probability $Pr(W,L)$ is literally the average probability of the data. Averaged over what? Averaged over the prior. It’s job is just to standardize the posterior, to ensure it sums (integrates) to one. In mathematical form:

$$
Pr(W,L) = E(Pr(W,L \mid p)) = \int Pr(W,L\mid p)Pr(p)dp
$$

The operator $E$ means to take an *expectation*. Such averages are commonly called *marginals* in mathematical statistics, and so you may also see this same probability called a *marginal likelihood*. And the integral above just defines the proper way to compute the average over a continuous distribution of values, like the infinite possible values of $p$.

The key lesson is that the **posterior is proportional to the product of the prior and the probability of the data**. Why? Because for each specific value of p, the number of paths through the garden of forking data is the product of the prior number of paths and the new number of paths. Multiplication is just compressed counting. The average probability on the bottom just standardizes the counts so they sum to one. So while Bayes’ theorem looks complicated, because the relationship with counting paths is obscured, it just **expresses the counting that logic demands**.

```{python}
def uniform_prior(grid_points):
    """
    Returns Uniform prior density
    """
    return np.repeat(5, grid_points)

def truncated_prior(grid_points, trunc_point=0.5):
    """
    Returns Truncated prior density
    """
    return (np.linspace(0, 1, grid_points) >= trunc_point).astype(int)

def double_exp_prior(grid_points):
    """
    Returns Double Exponential prior density
    """
    return np.exp(-5 * abs(np.linspace(0, 1, grid_points) - 0.5))

def binom_grid_approx(prior_func, grid_points, k, n):
    # define grid
    p_grid = np.linspace(0, 1, grid_points)
    
    # define prior
    prior = prior_func(grid_points)
    
    # compute likelihood at each point in the grid
    likelihood = stats.binom.pmf(k, n, p_grid)
    
    # compute product of likelihood and prior
    unstd_posterior = likelihood * prior
    
    # standardize the posterior, so it sums to 1
    posterior = unstd_posterior / np.sum(unstd_posterior)
    
    return p_grid, likelihood, posterior
```

```{python}
def example_plot():
  w, n = 6, 9
  priors = [uniform_prior, truncated_prior, double_exp_prior]
  titles = ['Prior', 'Likelihood', 'Posterior']
  
  _, axs = plt.subplots(3, 3, figsize=(8, 8))
  for i in range(3):
      ax = axs[i]   
      p_grid, likelihood, posterior = binom_grid_approx(priors[i], 100, w, n)
      ax[0].plot(p_grid, priors[i](100), "k-")
      ax[1].plot(p_grid, likelihood, "k-")
      ax[2].plot(p_grid, posterior, "k-")
      for ii in range(3):
          ax[ii].axes.get_yaxis().set_visible(False)
          if i == 0:
              ax[ii].set_title(f"{titles[ii]}")  

utils.inline_plot(example_plot)

```

The posterior distribution as a product of the prior distribution and likelihood. Top: A flat prior constructs a posterior that is simply proportional to the likelihood. Middle: A step prior, assigning zero probability to all values less than 0.5, results in a truncated posterior. Bottom: A peaked prior that shifts and skews the posterior, relative to the likelihood.

The plot illustrates the multiplicative interaction of a prior and a probability of data. On each row, a prior on the left is multiplied by the probability of data in the middle to produce a posterior on the right. The probability of data in each case is the same. The priors however vary. As a result, the posterior distributions vary.

------------------------------------------------------------------------

#### Bayesian Data Analysis isn’t about Bayes’ Theorem

A common notion about Bayesian data analysis, and Bayesian inference more generally, is that it is distinguished by the use of Bayes’ theorem. This is a mistake. Inference under any probability concept will eventually make use of Bayes’ theorem. Common introductory examples of “Bayesian” analysis using HIV and DNA testing are not uniquely Bayesian. Since all of the elements of the calculation are frequencies of observations, a non- Bayesian analysis would do exactly the same thing. Instead, Bayesian approaches get to use Bayes’ theorem more generally, to quantify uncertainty about theoretical entities that cannot be observed, like parameters and models. Powerful inferences can be produced under both Bayesian and non- Bayesian probability concepts, but different justifications and sacrifices are necessary.

------------------------------------------------------------------------

### Motors: Conditioning Engines

The Bayesian model is a machine and it has built-in definitions for the likelihood, the parameters, and the prior. And then at its heart lies a motor that processes data, producing a posterior distribution. The action of this motor can be thought of as *conditioning* the prior on the data. As explained in the previous section, this conditioning is governed by the rules of probability theory, which defines a uniquely logical posterior for set of assumptions and observations.

However, knowing the mathematical rule is often of little help, because many of the interesting models in contemporary science cannot be conditioned formally, no matter your skill in mathematics. And while some broadly useful models like linear regression can be conditioned formally, this is only possible if you constrain your choice of prior to special forms that are easy to do mathematics with. We’d like to avoid forced modeling choices of this kind, instead favoring conditioning engines that can accommodate whichever prior is most useful for inference.

What this means is that various numerical techniques are needed to approximate the mathematics that follows from the definition of Bayes’ theorem. In this book, you’ll meet three different conditioning engines, numerical techniques for computing posterior distributions:

1)  Grid approximation
2)  Quadratic approximation
3)  Markov chain Monte Carlo (MCMC)

There are many other engines, and new ones are being invented all the time. But the three you’ll get to know here are common and widely useful. In addition, as you learn them, you’ll also learn principles that will help you understand other techniques.

------------------------------------------------------------------------

#### How you fit the model is part of the model

Earlier, we implicitly defined the model as a composite of a prior and a likelihood. That definition is typical. But in practical terms, we should also consider how the model is fit to data as part of the model. In very simple problems, like the globe tossing example that consumes this chapter, calculation of the posterior density is trivial and foolproof. In even moderately complex problems, however, the details of fitting the model to data force us to recognize that our numerical technique influences our inferences. This is because different mistakes and compromises arise under different techniques. The same model fit to the same data using different techniques may produce different answers. When something goes wrong, every piece of the machine may be suspect. And so our golems carry with them their updating engines, as much slaves to their engineering as they are to the priors and likelihoods we program into them.

------------------------------------------------------------------------

### Grid Approximation

One of the simplest conditioning techniques is grid approximation. While most parameters are continuous, capable of taking on an infinite number of values, it turns out that we can achieve an excellent approximation of the continuous posterior distribution by considering only a finite grid of parameter values. At any particular value of a parameter, $p'$, it’s a simple matter to compute the posterior probability: just multiply the prior probability of $p'$ by the likelihood at $p'$. Repeating this procedure for each value in the grid generates an approximate picture of the exact posterior distribution. This procedure is called **grid approximation**.

Grid approximation will mainly be useful as a pedagogical tool, as learning it forces the user to really understand the nature of Bayesian updating. But in most of your real modeling, grid approximation isn’t practical. The reason is that it scales very poorly, as the number of parameters increases. So in later chapters, grid approximation will fade away, to be replaced by other, more efficient techniques.

In the context of the globe tossing problem, grid approximation works extremely well. So let’s build a grid approximation for the model we’ve constructed so far. Here is the recipe:

1)  Define the grid. This means you decide how many points to use in estimating the posterior, and then you make a list of the parameter values on the grid.
2)  Compute the value of the prior at each parameter value on the grid.
3)  Compute the likelihood at each parameter value.
4)  Compute the unstandardized posterior at each parameter value, by multiplying the prior by the likelihood.
5)  Finally, standardize the posterior, by dividing each value by the sum of all values.

In the globe tossing context, here’s the code to complete all five of these steps:

```{python}
def binom_grid_approx(prior_func, grid_points, k, n):
    # define grid
    p_grid = np.linspace(0, 1, grid_points)
    
    # define prior
    prior = prior_func(grid_points)
    
    # compute likelihood at each point in the grid
    likelihood = stats.binom.pmf(k, n, p_grid)
    
    # compute product of likelihood and prior
    unstd_posterior = likelihood * prior
    
    # standardize the posterior, so it sums to 1
    posterior = unstd_posterior / np.sum(unstd_posterior)
    
    return p_grid, likelihood, posterior
```

The above code makes a grid of specified points. Let's display the posterior distribution with 5, 20, 100 points:

```{python}
def example_plot():
    w, n = 6, 9
    
    points = (5, 20, 100)
    
    _, ax = plt.subplots(1, len(points), figsize=(4 * len(points), 4))
    for idx, ps in enumerate(points):
        p_grid, _, posterior = binom_grid_approx(uniform_prior, ps, w, n)
        ax[idx].plot(p_grid, posterior, "o-", label=f"successes = {w}\ntosses = {n}")
        ax[idx].set_xlabel("probability of water")
        ax[idx].set_ylabel("posterior probability")
        ax[idx].set_title(f"{ps} points")
        ax[idx].legend(loc=0)

utils.inline_plot(example_plot)
```

The correct density for your grid is determined by how accurate you want your approximation to be. More points means more precision. In this simple example, you can go crazy and use 100,000 points, but there won’t be much change in inference after the first 100.

### Quadratic Approximation

We’ll stick with the grid approximation to the globe tossing posterior, for the first few exercises. But before long we’ll have to resort to another approximation, one that makes stronger assumptions. The reason is that the number of unique values to consider in the grid grows rapidly as the number of parameters in your model increases. For the single-parameter globe tossing model, it’s no problem to compute a grid of 100 or 1000 values. But for two parameters approximated by 100 values each, that’s already $100^{2} =10,000$ values to compute. For 10 parameters, the grid becomes many billions of values. These days, it’s routine to have models with hundreds or thousands of parameters. The grid approximation strategy scales very poorly with model complexity, so it won’t get us very far.

A useful approach is **quadratic approximation**. Under quite general conditions, the region near the peak of the posterior distribution will be nearly Gaussian—or “normal”—in shape. This means the posterior distribution can be usefully approximated by a Gaussian distribution. A Gaussian distribution is convenient, because it can be completely described by only two numbers: the location of its center (**mean**) and its spread (**variance**).

A **Gaussian approximation** is called “quadratic approximation” because the logarithm of a Gaussian distribution forms a parabola. And a parabola is a quadratic function. So this approximation essentially represents any log-posterior with a parabola.

For many of the most common procedures in applied statistics—linear regression, for example—the approximation works very well. Often, it is even exactly correct, not actually an approximation at all. Computationally, quadratic approximation is very inexpensive, at least compared to grid approximation and MCMC (discussed next). The procedure contains two steps.

1)  Find the posterior mode. This is usually accomplished by some optimization algorithm, a procedure that virtually “climbs” the posterior distribution, as if it were a mountain. The algorithm doesn’t know where the peak is, but it does know the slope under its feet. There are many well-developed optimization procedures, most of them more clever than simple hill climbing. But all of them try to find peaks.
2)  Once you find the peak of the posterior, you must estimate the curvature near the peak. This curvature is sufficient to compute a quadratic approximation of the entire posterior distribution. In some cases, these calculations can be done analytically, but usually your computer uses some numerical technique instead.

To compute the quadratic approximation for the globe tossing data, we’ll use Stan's `optimize()` function to obtain the MLE (or MAP) of the parameter $p$ and BridgeStan's `log_density_hessian()` functions to extract the model's **hessian matrix** through which we will obtain the standard deviation. All of this is conveniently wrapped inside the `StanQuap` class.

Let us compute the quadratic approximation to the globe tossing data:

```{python}
#| output: false

# Define the Bayesian model in Stan
globe_qa = '''
data {
    int<lower=0> W; // Number of successes (Water)
    int<lower=0> L; // Number of failures (Land)
}
parameters {
    real<lower=0, upper=1> p; // Probability of water
}
model {
    p ~ uniform(0, 1);      // Uniform prior
    W ~ binomial(W + L, p);   // Binomial likelihood
}
'''

# Set-up the data in a dictionary
toss_data = {'W': 6, 'L': 3}

# Instantiate StanQuap object
globe_qa_model = utils.StanQuap('stan_models/globe_qa', globe_qa, toss_data)

# display summary of quadratic approximation
globe_qa_summary = globe_qa_model.precis().round(2)
globe_qa_summary
```

The `precis()` method presents a brief summary of the quadratic approximation. In this case, it shows the posterior mean value of $p=0.67$, which it calls the “Mean.” The curvature is labeled “StdDev” This stands for *standard deviation*. This value is the standard deviation of the posterior distribution, while the mean value is its peak. Finally, the last two values in the output show the 89% percentile interval, which you’ll learn more about later. You can read this kind of approximation like: *Assuming the posterior is Gaussian, it is maximized at 0.67, and its standard deviation is 0.16*.

Since we already know the posterior, let’s compare to see how good the approximation is. We will use the analytical approach here, which uses `stats.beta.pdf`.

```{python}
Ws, Ls = [6, 12, 24], [3, 6, 12]
# quadratic approximation
def plot_quap():
    _, axes = plt.subplots(1, 3, figsize=(12, 4))
    x = np.linspace(0, 1, 100)
    for i, ax in enumerate(axes):
        toss_data = {'W': Ws[i], 'L': Ls[i]}
        globe_qa_summary = utils.StanQuap('stan_models/globe_qa', globe_qa, toss_data).precis()
        ax.plot(x, 
                 stats.beta.pdf(x, Ws[i] + 1, Ls[i] + 1), # analytical posterior calculation
                 color='k', 
                 label="True posterior")
        ax.plot(x, 
                stats.norm.pdf(x, globe_qa_summary['Mean'][0], globe_qa_summary['StDev'][0]), 
                color='k', 
                linestyle='--', 
                label="Quadratic approximation")
        ax.set(xlabel='Proportion wate', title=f"n = {Ws[i]+Ls[i]}")

utils.inline_plot(plot_quap)
```

Accuracy of the quadratic approximation. In each plot, the exact posterior distribution is plotted in solid curve, and the quadratic approximation is plotted as the dotted curve. Left: The globe tossing data with $n=9$ tosses and $w=6$ waters. Middle: Double the amount of data, with the same fraction of water, $n=18$ and $w=12$. Right: Four times as much data, $n=36$ and $w=24$.

The quadratic approximation curve does alright on its left side, but looks pretty bad on its right side. It even assigns positive probability to $p=1$, which we know is **impossible**, since we saw at least one land sample.

As the amount of data increases, however, the quadratic approximation gets better. In the middle plot, the sample size is doubled to $n =18$ tosses, but with the same fraction of water, so that the mode of the posterior is in the same place. The quadratic approximation looks better now, although still not great. At quadruple the data, on the right side of the figure, the two curves are nearly the same now.

This phenomenon, where the quadratic approximation improves with the amount of data, is very common. It’s one of the reasons that so many classical statistical procedures are nervous about small samples: Those procedures use quadratic (or other) approximations that are only known to be safe with infinite data. Often, these approximations are useful with less than infinite data, obviously. But the rate of improvement as sample size increases varies greatly depending upon the details. In some models, the quadratic approximation can remain terrible even with thousands of samples.

Using the quadratic approximation in a Bayesian context brings with it all the same concerns. But you can always lean on some algorithm other than quadratic approximation, if you have doubts. Indeed, grid approximation works very well with small samples, because in such cases the model must be simple and the computations will be quite fast. You can also use MCMC, which is introduced next.

------------------------------------------------------------------------

#### Maximum Likelihood Estimation (MLE)

The quadratic approximation, either with a uniform prior or with a lot of data, is often equivalent to a **maximum likelihood estimate** (MLE) and its **standard error**. The MLE is a very common non-Bayesian parameter estimate. This correspondence between a Bayesian approximation and a common non-Bayesian estimator is both a blessing and a curse. It is a blessing, because it allows us to reinterpret a wide range of published non-Bayesian model fits in Bayesian terms. It is a curse, because maximum likelihood estimates have some curious drawbacks, and the quadratic approximation can share them. We’ll explore these drawbacks later, they are one of the reasons we’ll turn to Markov chain Monte Carlo.

------------------------------------------------------------------------

#### Hessian Matrix

A Hessian is a square matrix of second derivatives. It is used for many purposes in mathematics, but in the quadratic approximation it is second derivatives of the log of posterior probability with respect to the parameters. It turns out that these derivatives are sufficient to describe a Gaussian distribution, because the logarithm of a Gaussian distribution is just a parabola. Parabolas have no derivatives beyond the second, so once we know the center of the parabola (the posterior mode) and its second derivative, we know everything about it. And indeed the second derivative (with respect to the outcome) of the logarithm of a Gaussian distribution is proportional to its **inverse squared standard deviation** (its “precision”). So knowing the standard deviation tells us everything about its shape.

The standard deviation is typically computed from the Hessian, so computing the Hessian is nearly always a necessary step. For now it’s enough to recognize the term and associate it with an attempt to find the standard deviation for a quadratic approximation.

------------------------------------------------------------------------

#### **Conjugacy**

-   Source: [Conjugate Prior](https://en.wikipedia.org/wiki/Conjugate_prior)

if, given a likelihood function $p(x\mid \theta)$, the posterior distribution $p(\theta \mid x)$ is in the same probability distribution family as the prior probability distribution $p(\theta)$, the prior and posterior are then called **conjugate distributions** with respect to that likelihood function and the prior is called a **conjugate prior** for the likelihood function $p(x\mid \theta)$.

A conjugate prior is an algebraic convenience, giving a closed-form expression for the posterior; otherwise, numerical integration may be necessary. Further, conjugate priors may give intuition by more transparently showing how a likelihood function updates a prior distribution.

The form of the conjugate prior can generally be determined by inspection of the probability density or probability mass function of a distribution. For example, consider a random variable which consists of the number of successes $s$ in $n$ Bernoulli trials with unknown probability of success $q$ in \[0,1\]. This random variable will follow the binomial distribution, with a probability mass function of the form

$$
p(s)={n \choose s}q^{s}(1-q)^{n-s}
$$

The usual conjugate prior is the beta distribution with parameters $\alpha$, $\beta$:

$$
p(q)={q^{\alpha -1}(1-q)^{\beta -1} \over \mathrm {B} (\alpha ,\beta )}
$$

where $\alpha$ and $\beta$ are chosen to reflect any existing belief or information ($\alpha =1$ and $\beta =1$ would give a uniform distribution) and $B(\alpha ,\beta)$ is the Beta function acting as a normalizing constant.

That number is the relative number of ways to get six water, holding $p$ at 0.5 and $N =W +L$ at nine. So it does the job of counting relative number of paths through the garden.

```{python}
# Binomial Distribution - likelihood of six W’s in nine tosses — under 0.5 value of p
stats.binom.pmf(6, 9, 0.5)

# Beta distribution 
stats.beta.pdf(0.5, 6+1, 3+1)
```

```{python}
p = np.linspace(0, 1, 100)

def beta_binom_plot():
    fig, ax1 = plt.subplots()
    
    ax1.set_xlabel('$p$')
    ax1.set_ylabel('Binomial - Likelihood', color='k')
    ax1.plot(p, stats.binom.pmf(7, 10, p), color='k')
    ax1.tick_params(axis='y')
    
    ax2 = ax1.twinx() 
    ax2.set_ylabel('Beta - Posterior Probability Density', color='r') 
    ax2.plot(p, stats.beta.pdf(p, 7+1, 3+1), color='r', linewidth=6, alpha=0.2)
    ax2.tick_params(axis='y', labelcolor='r')

utils.inline_plot(beta_binom_plot)
```

------------------------------------------------------------------------

### Markov Chain Monte Carlo (MCMC)

There are lots of important model types, like multilevel (mixed-effects) models, for which neither grid approximation nor quadratic approximation is always satisfactory. Such models may have hundreds or thousands or tens-of-thousands of parameters. Grid approximation routinely fails here, because it just takes too long. Special forms of quadratic approximation might work, if everything is just right. But commonly, something is not just right. Furthermore, multilevel models do not always allow us to write down a single, unified function for the posterior distribution. This means that the function to maximize (when finding the MAP) is not known, but must be computed in pieces.

As a result, various counterintuitive model fitting techniques have arisen. The most popular of these is Markov chain Monte Carlo (MCMC), which is a family of conditioning engines capable of handling highly complex models.

The conceptual challenge with MCMC lies in its highly non-obvious strategy. Instead of attempting to compute or approximate the posterior distribution directly, MCMC techniques merely draw samples from the posterior. You end up with a collection of parameter values,and the frequencies of these values correspond to the posterior plausibilities. You can then build a picture of the posterior from the histogram of these samples.

We nearly always work directly with these samples, rather than first constructing some mathematical estimate from them. And the samples are in many ways more convenient than having the posterior, because they are easier to think with.

------------------------------------------------------------------------

#### A) From Prior to Posterior - Monte Carlo globe tossing

The below code implements a simple **Metropolis-Hastings algorithm**, a Markov Chain Monte Carlo (MCMC) method, to approximate the posterior distribution of a probability parameter $p$, given a likelihood defined by a binomial model.

**Key Components**:

1.  **Inputs**:

    -   `W = 6`: Number of "wins" (successes).

    -   `L = 3`: Number of "losses" (failures).

    -   `n_samples = 1000`: Total number of samples to generate for the Markov chain.

    -   Initial probability, `p[0] = 0.5`.

2.  **Model**:

    -   $p$ represents the probability of success (e.g., win rate) in a binomial distribution. The algorithm explores possible values for $p$ to estimate its posterior distribution.
    -   The likelihood is computed using the binomial probability mass function (`stats.binom.pmf`).

3.  **Proposal Distribution**:

    -   A new candidate value, $p_{\text{new}}$, is proposed at each step, sampled from a normal distribution centered at the current value $p[i-1]$ with a small standard deviation (`0.1`).

4.  **Reflection Boundary**:

    -   To ensure $p_{\text{new}}$ stays in the valid range $[0, 1]$:
        -   If $p_{\text{new}} < 0$, it is reflected back: $p_{\text{new}} = -p_{\text{new}}$.
        -   If $p_{\text{new}} > 1$, it is reflected back: $p_{\text{new}} = 2 - p_{\text{new}}$.

5.  **Acceptance Step**:

    -   The acceptance ratio, $\frac{q_1}{q_0}$, is calculated, where:
        -   $q_0$: Likelihood of observing `W` successes given $p[i-1]$.
        -   $q_1$: Likelihood of observing `W` successes given $p_{\text{new}}$.
    -   A random number from a uniform distribution is drawn, and if it is less than $\frac{q_1}{q_0}$, the new value is accepted ($p[i] = p_{\text{new}}$). Otherwise, the chain retains the previous value ($p[i] = p[i-1]$).

**Outcome**:

-   The algorithm generates a sequence of 1000 samples (`p`) that approximate the posterior distribution of the probability parameter $p$.

-   Over many iterations, the chain will converge to the posterior distribution of $p$, allowing statistical inference about the probability of success given the observed data ($W = 6$, $L = 3$).

This is a simple example of Bayesian inference using MCMC to estimate a posterior distribution for a parameter in a binomial model.

```{python}
n_samples = 1000
p = np.zeros(n_samples)
p[0] = 0.5
W = 6
L = 3
for i in range(1, n_samples):
    p_new = stats.norm(p[i - 1], 0.1).rvs(1)
    if p_new < 0: p_new = -p_new
    if p_new > 1: p_new = 2 - p_new
    q0 = stats.binom.pmf(W, n=W + L, p=p[i - 1])
    q1 = stats.binom.pmf(W, n=W + L, p=p_new)
    if stats.uniform.rvs(0, 1) < q1 / q0:
        p[i] = p_new
    else:
        p[i] = p[i - 1]
```

The values in $p$ are samples from the posterior distribution. To compare to the analytical posterior:

```{python}
def plot_post_dist():
    bw = utils.bw_nrd0(p)
    az.plot_kde(p, bw=bw, label="Metropolis approximation", plot_kwargs={'color': 'black', 'linestyle':'--'})
    x = np.linspace(0, 1, 100)
    plt.plot(x, stats.beta.pdf(x, W + 1, L + 1), color='k', label="True posterior")
    plt.legend();
utils.inline_plot(plot_post_dist)
```

------------------------------------------------------------------------

## Summary

> -   For each possible explanation of the sample
> -   Count all the ways the sample could occur
> -   The explanations with the largest number of ways to produce the observed sample are more plausible

Basis of Applied Probability: Things that can happen more ways are more plausible

The target of inference in Bayesian inference is a posterior probability distribution. Posterior probabilities state the relative numbers of ways each conjectured cause of the data could have produced the data. These relative numbers indicate plausibilities of the different conjectures. These plausibilities are updated in light of observations through Bayesian updating.

More mechanically, a Bayesian model is a composite of variables and distributional definitions for these variables. The probability of the data, often called the likelihood, provides the plausibility of an observation (data), given a fixed value for the parameters. The prior provides the plausibility of each possible value of the parameters, before accounting for the data. The rules of probability tell us that the logical way to compute the plausibilities, after accounting for the data, is to use Bayes’ theorem. This results in the posterior distribution.

In practice, Bayesian models are fit to data using numerical techniques, like grid approximation, quadratic approximation, and Markov chain Monte Carlo.

------------------------------------------------------------------------

#### From Posterior to Prediction

-   To make predictions, we must average (i.e. integrate) over the entire posterior -- this averages over the uncertainty in the posterior
-   We could do this with integral calculus
-   OR, we could just **take samples from the posterior and average over those**

**Note**: Turn a calculus problem into a data summary problem

**On Bayesian Inference**

-   **There is no minimum sample size** -- fewer samples fall back to prior
-   **Posterior shape embodies the sample size** -- more data makes the posterior more precise
-   There is no point estimates -- **the estimate is the entire posterior distribution**
-   There is no true interval -- there are an infinite number of intervals one could draw, each is arbitrary and depends on what you're trying to communicate/summarize

Sampling from Posterior Distribution:

```{python}
# Parameters
W, L = 6, 3
N = W + L  # Binomial size (number of tosses)

a, b = 1 + W, 1 + L # alpha and beta of posterior

n_samples = 1000  # Number of samples

p_range = np.linspace(0, 1, 100)
```

```{python}
# draw random samples from Beta PDF
beta_posterior_pdf = stats.beta(a, b)
beta_posterior_samples = beta_posterior_pdf.rvs(size=n_samples)

def plot_beta_pos():
    # Show that our beta postorior captures shape of beta-distributed samples
    plt.hist(beta_posterior_samples, bins=50, density=True, label='samples');
    plt.plot(p_range, stats.beta.pdf(p_range, a, b), linewidth=3, color='r', linestyle='--', label='beta distribution')
    plt.xlabel("proportion water")
    plt.ylabel("density")
    plt.legend()

utils.inline_plot(plot_beta_pos)
```

```{python}
from matplotlib.animation import FuncAnimation

# Initialize variables
posterior_predictive = np.zeros(N + 1)  # Accumulated posterior predictive
p_samples = []  # Store sampled p values

fig, axes = plt.subplots(1, 3, figsize=(12, 4))

def update(frame):
    global posterior_predictive, p_samples
    
    # Clear all axes for the new frame
    for ax in axes:
        ax.clear()
    
    # Sample from Beta posterior
    p = stats.beta.rvs(a, b)
    p_samples.append(p)
    
    # Sample from Binomial distribution
    w = stats.binom.rvs(N, p)
    # Accumulate frequency of w
    posterior_predictive[w] += 1

    # Posterior distribution plot
    posterior_pdf = stats.beta.pdf(p_range, a, b)
    # Beta distribution
    axes[0].plot(p_range, posterior_pdf, 'k-', linewidth=2) 
    # Highlight sampled p
    axes[0].vlines(x=p, ymin=0, ymax=stats.beta.pdf(p, a, b), color='red', linewidth=5) 
    axes[0].set(title='Posterior Distribution', 
                xlabel='Proportion water (p)', 
                ylabel='Density')
    axes[0].set_ylim(0, max(posterior_pdf) * 1.1) 

    # Predictive distribution for current p
    binom_probs = stats.binom.pmf(np.arange(N + 1), N, p)
    axes[1].bar(np.arange(N + 1), binom_probs, color='k')
    # Highlight sampled water count
    axes[1].bar(w, binom_probs[w], color='red')  
    axes[1].set(title=f'Predictive Distribution for {p:.2f}',
                xlabel='Number of water samples', 
                ylabel='Probability')
    axes[1].set_ylim(0, max(binom_probs) * 1.1)

    # Posterior predictive distribution (accumulated)
    axes[2].bar(np.arange(N + 1), posterior_predictive, color='k')
    # Highlight current count
    axes[2].bar(w, posterior_predictive[w], color='red')
    axes[2].set(title='Posterior Predictive Distribution', 
                xlabel='Number of water samples', 
                ylabel='Frequency')
    axes[2].set_ylim(0, max(105, np.max(posterior_predictive) * 1.1))

    # Add frame number
    fig.suptitle(f'Frame: {frame + 1}/{n_frames}', fontsize=16)

# Create animation
n_frames = 500  # Number of frames in animation
ani = FuncAnimation(fig, update, frames=n_frames, repeat=False, interval=25)

# Save or display the animation
ani.save('images/posterior_predictive_animation1.mp4', writer='ffmpeg', fps=10)

plt.show()
```

The simulation demonstrates **Bayesian posterior and prior predictive distributions**:

1.  **Bayesian Updating**:
    -   The posterior distribution $P(p | \text{data})$ is updated using observed data (6 water, 3 land) and a prior belief (Beta(1,1)).
    -   The simulation samples $p$ from the posterior distribution $Beta(a, b)$, showing how evidence refines our belief about $p$.
2.  **Posterior Predictive Checks**:
    -   Predicting future data based on posterior samples of $p$.\
    -   The middle plot shows the sampling distribution of outcomes ($W$) given a specific sampled $p$ (likelihood).
3.  **Accumulated Posterior Predictive Distribution**:
    -   The rightmost plot aggregates predictions over multiple posterior samples, representing the range of plausible future outcomes under the posterior.
4.  **MCMC-like Simulation**:
    -   The process mimics a **Markov Chain Monte Carlo (MCMC)** approach:
        -   Sampling $p$ (parameter space).
        -   Simulating $w$ (data space) based on $p$.
        -   Repeating this many times to approximate distributions of interest.
    -   This highlights how Bayesian inference relies on sampling to explore posterior and predictive distributions, especially when analytical solutions are unfeasible.

In summary, the simulation demonstrates **how Bayesian inference uses observed data to update beliefs about parameters** (posterior) and **predict future data** (posterior predictive), which are key principles of Bayesian data analysis and MCMC methods.

**Predictive Distribution**:

The horizontal axis here goes from zero to nine and we count the number of water samples in those nine tosses, so the possibilities are from zero to nine and each of the black bars here shows the relative number of ways to get that particular value of $w$ on the horizontal axis out of a very large number of experiments of tossing the globe nine times this is called the predictive distribution.

##### Sampling from Posterior Predictive Distribution

**Posterior Prediction**: a prediction for out-of-sample data based on the current posterior estimate

1.  Draw a sample of model parameters from the posterior (i.e. proportions)

2.  Generate/simulate data predictions using our generative model and the sampled parameters

3.  The resulting probability distribution is our prediction

```{python}
# 1. Sample parameters values from posterior
N_posterior_samples = 10_000
posterior_samples = beta_posterior_pdf.rvs(size=N_posterior_samples)

# 2. Use samples for the posterior to simulate sampling 10 observations from our generative model
N_draws_for_prediction = 10
posterior_predictive = stats.binom.rvs(N_draws_for_prediction, posterior_samples)
ppd_unique, ppd_counts = np.unique(posterior_predictive, return_counts=True)

# 3. for comparison we can compare to the distribution that results from pinning the parameter to a specific value
specific_prob = 0.64
specific_predictive = stats.binom.rvs(N_draws_for_prediction, specific_prob, size=N_posterior_samples)
specific_unique, specific_counts = np.unique(specific_predictive, return_counts=True)

def posterior_predictive_plot():
    plt.bar(specific_unique, specific_counts, width=.5, color='k', label=f'simulation at p={specific_prob:1.2}');
    plt.bar(ppd_unique, ppd_counts, width=.2, color='C1', label='posterior predictive');
    plt.xlabel("$\hat n_W$")
    plt.ylabel('count')
    plt.title(f"number of W samples predicted ($\hat n_W$) from {N_draws_for_prediction} globe flips")
    plt.legend()

utils.inline_plot(posterior_predictive_plot)
```

------------------------------------------------------------------------

## Practice Problems

1)  *The probability of rain on Monday*:
    -   Pr(rain\|Monday)
    -   Pr(rain,Monday)/Pr(Monday) = Pr(rain\|Monday)
2)  Pr(Monday\|rain):
    -   The probability that it is Monday, given that it is raining
3)  The probability that it is Monday, given that it is raining:
    -   Pr(Monday\|rain)
    -   Pr(rain\|Monday)Pr(Monday)/Pr(rain): the product Pr(rain\|Monday) Pr(Monday) is just the joint probability of rain and Monday, Pr(rain,Monday). Then dividing by the probability of rain provides the conditional probability
4)  Compute and plot the grid approximate posterior distribution for each of the following sets of observations. In each case, assume a uniform prior for $p$.

```{python}
# W, W, W             - (3,3)
# W, W, W, L          - (3,4)
# L, W, W, L, W, W, W - (5,7)

tosses = [(3,3), (3,4), (5, 7)]

def plot_grid_approx():
    fig, ax = plt.subplots(1, 3, figsize=(12, 4))
    for idx, (w, n) in enumerate(tosses):
        p_grid, _, posterior = binom_grid_approx(uniform_prior, 100, w, n)
        ax[idx].plot(p_grid, posterior, "k-", label=f"successes = {w}\ntosses = {n}")
        ax[idx].set_xlabel("p_grid")
        ax[idx].set_ylabel("posterior")
        ax[idx].set_title(f"Water = {w}, Land = {n-w}")

utils.inline_plot(plot_grid_approx)
```

5.  Now assume a prior for p that is equal to zero when $p <0.5$ and is a positive constant when $p ≥ 0.5$. Again compute and plot the grid approximate posterior distribution for each of the sets of observations in the problem just above.

```{python}
def plot_grid_approx():
    fig, ax = plt.subplots(1, 3, figsize=(12, 4))
    for idx, (w, n) in enumerate(tosses):
        p_grid, _, posterior = binom_grid_approx(truncated_prior, 100, w, n)
        ax[idx].plot(p_grid, posterior, "k-", label=f"successes = {w}\ntosses = {n}")
        ax[idx].set_xlabel("p_grid")
        ax[idx].set_ylabel("posterior")
        ax[idx].set_title(f"Water = {w}, Land = {n-w}")

utils.inline_plot(plot_grid_approx)
```

6.  Suppose there are two globes, one for Earth and one for Mars. The Earth globe is 70% covered in water. The Mars globe is 100% land. Further suppose that one of these globes—you don’t know which—was tossed in the air and produced a “land” observation. Assume that each globe was equally likely to be tossed. Show that the posterior probability that the globe was the Earth, conditional on seeing “land” (Pr(Earth\|land)), is 0.23.

![prob-tree](images/earth-mars-probability-tree.png)

$P(Land, Earth) = P(Earth)P(Land \mid Earth) = 0.5 \times 0.3 = 0.15$

$P(Land, Mars) = P(Mars)P(Land \mid Mars) = 0.5 \times 1 = 0.5$

$P(Land) = P(Land, Earth) + P(Land, Mars) = 0.15 + 0.5 = 0.65$

$P(Earth \mid Land) = \frac{P(Land \mid Earth)P(Earth)}{P(Land)} = \frac{0.3 \times 0.5}{0.65} \approx 0.23$

We can think of $P(Land)$ as the marginal (average) probability of land, averaging over the two globes.

You can think of this **posterior probability** as an **updated prior**. The prior probability was 0.5. Since there is more land coverage on Mars than on Earth, the posterior probability after observing land is smaller than the prior.

7.  Suppose you have a deck with only three cards. Each card has two sides, and each side is either black or white. One card has two black sides. The second card has one black and one white side. The third card has two white sides. Now suppose all three cards are placed in a bag and shuffled. Someone reaches into the bag and pulls out a card and places it flat on a table. A black side is shown facing up, but you don’t know the color of the side facing down. Show that the probability that the other side is also black is 2/3. Use the counting method (Section 2 of the chapter) to approach this problem. This means counting up the ways that each card could produce the observed data (a black side facing up on the table)

p = probability of 'B'

Observed: 'B' - \['B', 'B'\], p(B) = 1.0, ways to produce = 2 - \['B', 'W'\], p(B) = 0.5, ways to produce = 1 - \['W', 'W'\], p(B) = 0.0, ways to produce = 0

Total Ways to produce black side up = 2 + 1 + 0 = 3

Way to produce B/B = 2

2/3 probability that the other side is black

![3-deck-card](images/3-deck-card.png)

$P(B-Card \mid B) = \frac{P(B \mid B-Card)P(B-Card)}{P(B)} = \frac{1 \times 1/3}{(1/3 \times 1/2) + 1/3} = 2/3$

Another rudimentary application of Bayes' theorem. Let DB:= you picked a double black card, B:= Observed a black face, then

$Pr(DB|B) = \frac{Pr(B|DB)Pr(DB)}{Pr(B)} = \frac{(1)(1/3)}{1/2} = 2/3$

Pr(B) = 1/2 as half that you might pick are indeed black

Pr(DB\|B) = 1 as if you pick the double black card, you have to see a black face

Pr(DB) = 1/3 as there are 3 possible cards you might choose

8.  Now suppose there are four cards: B/B, B/W, W/W, and another B/B. Again suppose a card is drawn from the bag and a black side appears face up. Again calculate the probability that the other side is black.

Observed: 'B' - \['B', 'B'\], p(B) = 1.0, ways to produce = 2 - \['B', 'B'\], p(B) = 1.0, ways to produce = 2 - \['B', 'W'\], p(B) = 0.5, ways to produce = 1 - \['W', 'W'\], p(B) = 0.0, ways to produce = 0

Total Ways to produce black side up = 2 + 2 + 1 + 0 = 5

Way to produce B/B = 2 + 2 = 4

4/5 probability that the other side is black

With the extra B/B card, there are now 5 ways to see a black card face up: 2 from the first B/B card, 1 from the B/W card, and 2 more from the other B/B card. 4 of these ways are consistent with a B/B card, so the probability is now 4/5 that the other side of the card is also black.

![4-deck-card](images/4-deck-card.png)

$P(B-Card \mid B) = \frac{P(B \mid B-Card)P(B-Card)}{P(B)} = \frac{1 \times 1/2}{(1/4 \times 1/2) + 1/2} = 4/5$

The probability that the other side is black is the same as the probability that you picked a double black card, given that you saw a black face. We can reapply Bayes' theorem with updated values

$Pr(DB|B) = \frac{Pr(B|DB)Pr(DB)}{Pr(B)} = \frac{(1)(1/2)}{5/8} = 4/5$

as 4/5 of the faces are black and you have a 50-50 chance of picking a double black card in this set-up

9.  Imagine that black ink is heavy, and so cards with black sides are heavier than cards with white sides. As a result, it’s less likely that a card with black sides is pulled from the bag. So again assume there are three cards: B/B, B/W, and W/W. After experimenting a number of times, you conclude that for every way to pull the B/B card from the bag, there are 2 ways to pull the B/W card and 3 ways to pull the W/W card. Again suppose that a card is pulled and a black side appears face up. Show that the probability the other side is black is now 0.5. Use the counting method, as before.

Observed: 'B' - \['B', 'B'\], p(B) = 1.0, prior count = 2, likelihood = 1, new count = 2*1 - \['B', 'W'\], p(B) = 0.5, prior count = 1, likelihood = 2, new count = 1*2 - \['W', 'W'\], p(B) = 0.0, prior count = 0, likelihood = 3, new count = 0\*3

Total Ways to produce black side up = 2 + 2 + 0 = 4

Way to produce B/B = 2

1/2 probability that the other side is black

This problem introduces uneven numbers of ways to draw each card from the bag. So while in the two previous problems we could treat each card as equally likely, prior to the observation, now we need to employ the prior odds explicitly.

There are still 2 ways for B/B to produce a black side up, 1 way for B/W, and zero ways for W/W. But now there is 1 way to get the B/B card, 2 ways to get the B/W card, and 3 ways to get the W/W card. So there are, in total, 1 ×2 =2 ways for the B/B card to produce a black side up and 2 × 1 = 2 ways for the B/W card to produce a black side up. This means there are 4 ways total to see a black side up, and 2 of these are from the B/B card. 2/4 ways means probability 0.5.

To make this clearer, this is equivalent to setting a new prior distribution on the set of cards that you have. They are as follows.

$Pr(B/B) = \frac{1}{1+2+3}$

$Pr(B/W) = \frac{2}{1+2+3}$

$Pr(W/W) = \frac{3}{1+2+3}$

Then we can see that this implies the following

$$
\begin{align*}
Pr(B) &= Pr(B|B/B)Pr(B/B) + Pr(B|B/W)Pr(B/W) + Pr(B|W/W)Pr(W/W) \\
&= 1 \times 1/6 + 1/2 \times 2/6 + 0 \times1/2 = 1/3
\end{align*}
$$

Giving the final result to be

$Pr(DB|B) = \frac{Pr(B|DB)Pr(DB)}{Pr(B)} =  \frac{1 \times 1/6}{1/3} = 1/2$

10. Assume again the original card problem, with a single card showing a black side face up. Before looking at the other side, we draw another card from the bag and lay it face up on the table. The face that is shown on the new card is white. Show that the probability that the first card, the one showing a black side, has black on its other side is now 0.75. Use the counting method, if you can. Hint: Treat this like the sequence of globe tosses, counting all the ways to see each observation, for each possible first card.

Observed: 'B' - \['B', 'B'\], p(B) = 1.0, ways to produce = 2 - \['B', 'W'\], p(B) = 0.5, ways to produce = 1 - \['W', 'W'\], p(B) = 0.0, ways to produce = 0

Total Ways to produce black side up = 2 + 1 + 0 = 3

Way to produce B/B = 2

2/3 **prior probability** that the other side is black

Observed New: 'W' - \['B', 'B'\], p(B) = 1.0, ways to produce = 0 - \['B', 'W'\], p(B) = 0.5, ways to produce = 1 - \['W', 'W'\], p(B) = 0.0, ways to produce = 2

Total Ways to produce white side up = 0 + 1 + 2 = 3

Given that the first draw has at least 1 side black and the second draw has at least 1 side white, the first draw cannot be card W/W and second card cannot be B/B. Either the first draw is B/W or B/B or the second draw is B/W or W/W.

Observing (1) B then (2) W - Scenario 1: **B**/**B** then B/**W** or **W**/**W** = 2 \* 3 = 6 - Scenario 2: **B**/W then **W**/**W** = 1 \* 2 = 2

Ways to create the sequence = 6 + 2 = 8

Posterior probability of sequence where B/B is first draw = 6/8 = 0.75

The observation is now the sequence: black side up then white side up. We’re still interested in the probability the other side of the first card is black. Let’s take each possible card in turn.

First the B/B card. There are $2$ ways for it to produce the first observation, the black side up. This leaves the B/W card and W/W card to produce the next observation. Each card is equally likely (has same number of ways to get drawn from the bag). But the B/W card has only $1$ way to produce a white side up, while the W/W card has $2$ ways. So $3$ ways in total to get the second card to show white side up. All together, assuming the first card is B/B, there are $2×3=6$ ways to see the BW sequence of sides up.

Now consider the B/W card being drawn first. There is $1$ way for it to show black side up. This leaves the B/B and W/W cards to produce the second side up. B/B cannot show white up, so zero ways there. W/W has $2$ ways to show white up. All together, that’s $1×2=2$ ways to see the sequence BW, when the first card is B/W. The final card, W/W, cannot produce the sequence when drawn first. So zero ways. Now let’s bring it all together. Among all three cards, there are $6+2=8$ ways to produce the sequence BW. $6$ of these are from the B/B being drawn first. So that’s $6/8=0.75$ probability that the first card is B/B.

We can solve this as follows. Let B/B be the event that the first card picked was double black and B/W be the event that the first card picked was the Black and white mixed card. Now let $B_1$ and $W_2$ be the observations that the face observed on the first card was black and the face observed on the second was white. We can apply Bayes' theorem in the usual manner

$Pr(B/B) = \frac{Pr(B_1,W_2|B/B)Pr(B/B)}{Pr(B_1,W_2)}$

Using the law total probability and the properties of conditional probabilities we can find the denominator with the following

$Pr(B_1,W_2) = Pr(B_1,W_2|B/B)Pr(B/B) + Pr(B_1,W_2|B/W)Pr(B/W)$

(We can ignore conditioning on the W/W event, as it's impossible to observe a black face if this was the first card you picked)

$$
\begin{align*}
 Pr(B_1,W_2) &= Pr(W_2\|B_1,B/B)Pr(B_1\|B/B)Pr(B/B) + Pr(W_2\|B_1,B/W)Pr(B_1\|B/W)Pr(B/W) \\
 &= 3/4 \times 1 \times 1/3 + 1/2 \times 1/2 \times 1/3 \\
 &= 1/3
\end{align*}
$$

Plugging in, we get

$$
\frac{Pr(B_1,W_2|B/B)Pr(B/B)}{Pr(B_1,W_2)} = \frac{3/4 \times 1/3}{1/3} = 3/4
$$

11. Suppose there are two species of panda bear. Both are equally common in the wild and live in the same places. They look exactly alike and eat the same food, and there is yet no genetic assay capable of telling them apart. They differ however in their family sizes. Species A gives birth to twins 10% of the time, otherwise birthing a single infant. Species B births twins 20% of the time, otherwise birthing singleton infants. Assume these numbers are known with certainty, from many years of field research. Now suppose you are managing a captive panda breeding program. You have a new female panda of unknown species, and she has just given birth to twins. What is the probability that her next birth will also be twins?

To solve this problem, realize first that it is asking for a conditional probability:

$$
P(Twins_{2} \mid Twins_{1})
$$

the probability the second birth is twins, conditional on the first birth being twins. Remember the definition of conditional probability:

$$
P(Twins_{2} \mid Twins_{1})=\frac{P(Twins_{1}, Twins_{2})}{P(Twins)}
$$

So our job is to define $P(Twin_{1}, Twin_{2})$, the joint probability that both births are twins, and $P(Twins)$, the unconditioned probability of twins.

$P(Twins)$ is easier, so let’s do that one first. The “unconditioned” probability just means that we have to average over the possibilities. In this case, that means the species have to averaged over. The problem implies that both species are equally common, so there’s a half chance that any given panda is of either species. This gives us

$$P(Twins) = 0.5(0.1) + 0.5(0.2) = 0.15$$

Now for $P(Twins_{1},Twins_{2})$. The probability that a female from species A has two sets of twins is $0.1 ×0.1 =0.01$. The corresponding probability for species B is $0.2×0.2 =0.04$. Averaging over species identity:

$$P(Twins_{1}, Twins_{2}) = 0.5(0.01) + 0.5(0.04) = 0.025$$

Finally, we combine these probabilities to get the answer: $$P(Twins_{2} \mid Twins_{1}) = \frac{0.025}{0.15} = \frac{1}{6} \approx 0.17$$

Note that this is higher than $P(twins)$. This is because the first set of twins provides some information about which species we have, and this information was automatically used in the calculation.

12. Recall all the facts from the problem above. Now compute the probability that the panda we have is from species A, assuming we have observed only the first birth and that it was twins.

$$
P(Species_{A} \mid Twins) = \frac{P(Twins \mid Species_{A})P(Species_{A})}{P(Twins)}
$$

This can be computed using the values obtained from earlier:

$$
P(Species_{A} \mid Twins) = \frac{0.1 \times 0.5}{0.15} = \frac{1}{3} \approx 0.33
$$

So the posterior probability of species A, after observing twins, falls to 1/3, from a prior probability of 1/2. This also implies a posterior probability of 2/3 that our panda is species B, since we are assuming only two possible species.

13. Continuing on from the previous problem, suppose the same panda mother has a second birth and that it is not twins, but a singleton infant. Compute the posterior probability that this panda is species A.

Given the new information, we are computing

$$
\text{Posterior} = \frac{\text{Probability of the data} \times \text{Prior}}{\text{Average Probability of the data}}
$$

Recall that Bayes’ theorem accumulates evidence, using Bayesian updating. So we can take the posterior probabilities from the previous problem and use them as prior probabilities in this problem. This implies $P(A)=1/3$. Now we can ignore the first observation, the twins, and concern ourselves with only the latest observation, the singleton birth. The previous observation is embodied in the prior, so there’s no need to account for it again.

The probability of this data for a species A pandas is $P(Singleton \mid Species_A) = 0.9$.

Whereas, the prior is the probability computed earlier $P(Species_{A} \mid Twins) = \frac{1}{3}$.

The average/marginal probability of the data =

$$
P(Singleton) = P(Singleton \mid (Species_{A} \mid Twins))P(Species_{A} \mid Twins) + \\
P(Singleton \mid (Species_{B} \mid Twins))P(Species_{B} \mid Twins)
$$

Given that $P(Species_{A} \mid Twins) = 1/3$, $P(Species_{B} \mid Twins) = 1 - 1/3$

Therefore,

$$
P(Singleton) = P(Singleton \mid A)P(A) + P(Singleton \mid B)P(B) = (0.9)\times\frac{1}{3} + (0.8)\times\frac{2}{3} = \frac{5}{6}
$$

Combining everything, we get: $$P(A\mid Singleton)= \frac{(0.9)(1/3)}{5/6}=\frac{9}{25} = 0.36$$

This is a modest increase in posterior probability of species A, an increase from about 0.33 to 0.36.

The other way to proceed is to go back to the original prior, Pr(A)=0.5, before observed any births. Then you can treat both observations (twins, singleton) as data and update the original prior. I’m going to start abbreviating “twins” as T and “singleton” as S. The formula:

$$P(A\mid T, S) = \frac{P(T,S \mid A)P(A)}{P(T,S)}$$

Let’s start with the average likelihood, $P(T, S)$, because it will force us to define the likelihoods anyway:

$$P(T,S) = P(T,S \mid A)P(A) + P(T,S \mid B)P(B)$$

The first likelihood is just the probability a species A mother has twins and then a singleton:

$$P(T,S \mid A) = (0.1)(0.9) = 0.09$$

And the second likelihood is similar, but for species B:

$$P(T,S \mid B) = (0.2)(0.8) = 0.16$$

The priors are both 0.5, so all together: $$P(T,S) = (0.09)(0.5) + (0.16)(0.5) = 0.125$$

We already have the likelihood needed for the numerator, so we can go to the final answer now:

$$P(A\mid T, S) = \frac{(0.09)(0.5)}{0.125} = 0.36$$

14. A common boast of Bayesian statisticians is that Bayesian inference makes it easy to use all of the data, even if the data are of different types. So suppose now that a veterinarian comes along who has a new genetic test that she claims can identify the species of our mother panda. But the test, like all tests, is imperfect. This is the information you have about the test:

    -   The probability it correctly identifies a species A panda is 0.8.
    -   The probability it correctly identifies a species B panda is 0.65.

    The vet administers the test to your panda and tells you that the test is positive for species A. First ignore your previous information from the births and compute the posterior probability that your panda is species A. Then redo your calculation, now using the birth data as well.

|        | Species A | Species B |
|--------|-----------|-----------|
| Test A | 0.8       | 0.2       |
| Test B | 0.35      | 0.65      |

Posterior Probability That Panda is Species A:

-   $P(A) = 0.5$
-   $P(Test_A \mid A) = 0.8$
-   $P(Test_A \mid B) = 1 - 0.65 = 0.35$ (False Negative)

We use the original prior, Pr(A)=0.5. Plugging everything into Bayes’ theorem:

$$P(A \mid Test_A) = \frac{(0.8)(0.5)}{(0.8)(0.5)+(0.35)(0.5)} \approx 0.7$$

So the test has increased the confidence in species A from 0.5 to 0.7.

Now to use the birth information as well. The easiest way to do this is just to begin with the posterior from the previous problem. That posterior already used the birth data, so if we adopt it as our prior, we automatically use the birth data. And so our prior becomes Pr(A)=0.36. Then the approach is just the same as just above:

$$P(A \mid Test_A) = \frac{P(Test_A \mid A)P(A)}{P(Test_A)}$$

$$P(A \mid Test_A) = \frac{(0.8)(0.36)}{(0.36)(0.8) + (1-0.36)(0.35)} \approx 0.56$$

And since this posterior uses all of the data, the two births and the genetic test, it would be honest to label it as $P(A|Test_A, Twins, Singleton) \approx 0.56$.

------------------------------------------------------------------------

## Appendix

------------------------------------------------------------------------

### PDF and PMF

The terms **PMF**, **PDF, CDF**, and **PPF** are fundamental concepts in probability and statistics, each describing different aspects of probability distributions. Here's a detailed explanation of each:

1.  **PMF (Probability Mass Function)**

-   **What it is**:
    -   The PMF is used for **discrete random variables**.
    -   It gives the probability that a discrete random variable takes on a specific value.
-   **Mathematical Definition**: $$
    P(X = x)
    $$ where $X$ is a discrete random variable, and $x$ is a specific value.
-   **Properties**:
    -   The sum of the PMF over all possible values of $X$ is 1: $$
        \sum_{x} P(X = x) = 1
        $$
    -   $P(X = x) \geq 0$ for all $x$.
-   **Example**:
    -   For a fair six-sided die, the PMF is: $$
        P(X = x) = \frac{1}{6} \quad \text{for } x = 1, 2, 3, 4, 5, 6.
        $$

2.  **CDF (Cumulative Distribution Function)**

-   **What it is**:
    -   The CDF is used for both **discrete and continuous random variables**.
    -   It gives the probability that a random variable $X$ takes on a value less than or equal to a specific value $x$.
-   **Mathematical Definition**: $$
    F(x) = P(X \leq x)
    $$
-   **Properties**:
    -   The CDF is a non-decreasing function.
    -   It ranges from 0 to 1: $$
        \lim_{x \to -\infty} F(x) = 0 \quad \text{and} \quad \lim_{x \to \infty} F(x) = 1.
        $$
-   **Example**:
    -   For a fair six-sided die, the CDF at $x = 3$ is: $$
        F(3) = P(X \leq 3) = \frac{1}{6} + \frac{1}{6} + \frac{1}{6} = \frac{1}{2}.
        $$

3.  **PDF (Probability Density Function)**

-   **What it is**:
    -   The PDF is used for **continuous random variables**.
    -   It describes the relative likelihood of a continuous random variable taking on a specific value.
-   **Mathematical Definition**: $$
    f(x)
    $$ where $f(x)$ is the PDF of a continuous random variable $X$.
-   **Properties**:
    -   The PDF is non-negative: $$
        f(x) \geq 0 \quad \text{for all } x.
        $$
    -   The total area under the PDF curve is 1: $$
        \int_{-\infty}^{\infty} f(x) \, dx = 1.
        $$
    -   The probability of $X$ taking on a specific value is 0 (since the area under a single point is 0). Instead, probabilities are calculated over intervals: $$
        P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx.
        $$
-   **Example**:
    -   For a normal distribution with mean $\mu$ and standard deviation $\sigma$, the PDF is: $$
        f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}.
        $$

4.  **PPF (Percent-Point Function) or Quantile Function**

-   **What it is**:
    -   The PPF is the **inverse of the CDF**.
    -   It gives the value $x$ such that the probability of the random variable $X$ being less than or equal to $x$ is a specified probability $p$.
-   **Mathematical Definition**: $$
    x = F^{-1}(p)
    $$ where $F^{-1}$ is the PPF, and $p$ is the probability.
-   **Properties**:
    -   The PPF is used to find percentiles or quantiles of a distribution.
    -   For example, the 0.5 quantile (median) is the value $x$ such that $P(X \leq x) = 0.5$.
-   **Example**:
    -   For a standard normal distribution, the PPF at $p = 0.975$ is approximately 1.96, meaning: $$
        P(X \leq 1.96) = 0.975.
        $$

**Key Differences**

| **Term** | **Full Name** | **Applies To** | **Description** |
|----|----|----|----|
| PMF | Probability Mass Function | Discrete random variables | Probability that a discrete random variable takes on a specific value. |
| CDF | Cumulative Distribution Function | Discrete and continuous random variables | Probability that a random variable is less than or equal to a specific value. |
| PDF | Probability Density Function | Continuous random variables | Relative likelihood of a continuous random variable taking on a specific value. |
| PPF | Percent-Point Function | Discrete and continuous random variables | Value $x$ such that $P(X \leq x) = p$ (inverse of CDF). |

**Relationship Between CDF and PDF/PMF**

-   For **discrete random variables**: $$
      F(x) = \sum_{k \leq x} P(X = k)
      $$ where $P(X = k)$ is the PMF.

-   For **continuous random variables**: $$
      F(x) = \int_{-\infty}^{x} f(t) \, dt
      $$ where $f(t)$ is the PDF.

**When to Use Each**

-   Use **PMF** when working with discrete data (e.g., counts, categories).

-   Use **CDF** to calculate cumulative probabilities (e.g., $P(X \leq x)$).

-   Use **PDF** when working with continuous data (e.g., measurements, time).

-   Use **PPF** to find percentiles or critical values (e.g., median, 95th percentile).

------------------------------------------------------------------------

### Probability Rules Summary

-   [Sum Rule, Product Rule, Joint & Marginal Probability](https://www.youtube.com/watch?v=xjLqawhT3bY)

**Sum Rule**:

$$P(X)=\sum_Y P(X,Y)$$

$$P(Y)=\sum_X P(X,Y)$$

**Product Rule**:

$$P(X,Y) = P(Y \mid X)P(X)$$

$$P(Y,X) = P(X \mid Y)P(Y)$$

**Bayes' Rule**:

$$P(Y \mid X) = \frac{P(X \mid Y)P(Y)}{P(X)}$$

$$P(X \mid Y) = \frac{P(Y \mid X)P(X)}{P(Y)}$$

------------------------------------------------------------------------