Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dustin Stansbury's statistical rethinking notebooks #754

Open
wants to merge 51 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 50 commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
1f02291
Draft update of BNN notebook
fonnesbeck May 6, 2022
d274a08
Pre-commit fixes
fonnesbeck May 6, 2022
3897626
Merge branch 'main' into bnn_update
fonnesbeck May 30, 2022
bad197e
Address reviewer comments
fonnesbeck May 30, 2022
4ad5e16
Merge branch 'fonnesbeck-bnn_update'
fonnesbeck May 30, 2022
287de21
Merge remote-tracking branch 'upstream/main'
Jul 22, 2022
80d58f9
Merge remote-tracking branch 'upstream/main'
Sep 15, 2022
aa2412d
Merge branch 'main' of github.com:fonnesbeck/pymc-examples
Feb 1, 2023
95328be
Merge remote-tracking branch 'upstream/main'
Feb 1, 2023
8b2effe
Merge remote-tracking branch 'upstream/main'
Feb 3, 2023
ecbca86
Merge remote-tracking branch 'upstream/main'
fonnesbeck Mar 17, 2023
2b1faab
Merge remote-tracking branch 'upstream/main'
fonnesbeck Apr 11, 2023
e7bbd7c
Merge remote-tracking branch 'upstream/main'
fonnesbeck Jun 30, 2023
dd4bf34
Merge remote-tracking branch 'upstream/main'
fonnesbeck Nov 17, 2023
5a56149
Merge remote-tracking branch 'upstream/main'
fonnesbeck Nov 19, 2023
1c85de6
Merge remote-tracking branch 'upstream/main'
fonnesbeck Mar 10, 2024
9e804eb
Merge remote-tracking branch 'upstream/main'
fonnesbeck Jul 17, 2024
fae292f
Merge remote-tracking branch 'upstream/main'
fonnesbeck Jul 18, 2024
7c214ef
Merge remote-tracking branch 'upstream/main'
fonnesbeck Sep 11, 2024
92f15dc
Merge remote-tracking branch 'upstream/main'
fonnesbeck Sep 16, 2024
4b8070d
Additional edits
fonnesbeck Sep 23, 2024
037964d
Removed bivariate example; updated formatting
fonnesbeck Sep 24, 2024
e4317db
Merge remote-tracking branch 'upstream/main'
fonnesbeck Dec 14, 2024
bda97af
Merge remote-tracking branch 'upstream/main'
fonnesbeck Dec 14, 2024
bb3e4b3
Merge remote-tracking branch 'upstream/main'
fonnesbeck Dec 17, 2024
32d92df
Draft of first notebook
fonnesbeck Dec 17, 2024
bbac7ad
Lecture 3 draft
fonnesbeck Dec 17, 2024
3338b74
Added two more notebooks, plus all data and images
fonnesbeck Dec 17, 2024
548f82c
Added two more notebooks
fonnesbeck Dec 17, 2024
41a1a43
Added lecture 8
fonnesbeck Dec 17, 2024
5947ff3
Added several more adapted notebooks
fonnesbeck Dec 17, 2024
1c3ce6f
Added lecture 11
fonnesbeck Dec 17, 2024
48e6ea4
Two more adapted notebooks added
fonnesbeck Dec 17, 2024
25183b3
Add statistical rethinking folder to sidebar
fonnesbeck Dec 17, 2024
93084ed
Added lecture 15
fonnesbeck Dec 17, 2024
06fa93c
Added lecture 16
fonnesbeck Dec 17, 2024
097b341
Added lecture 17
fonnesbeck Dec 17, 2024
814efdf
Added lecture 19
fonnesbeck Dec 17, 2024
ce475e6
Added lecture 20
fonnesbeck Dec 17, 2024
0b56e8e
Added lecture 18
fonnesbeck Dec 17, 2024
aaa934e
Merge branch 'main' into statistical_rethinking
fonnesbeck Jan 4, 2025
8e6b7ac
Sort notebook paths by base filename before generating gallery rst
jessegrabowski Jan 4, 2025
6b81693
Draft update of BNN notebook
fonnesbeck May 6, 2022
211420c
Pre-commit fixes
fonnesbeck May 6, 2022
5b12553
Address reviewer comments
fonnesbeck May 30, 2022
26c9926
Additional edits
fonnesbeck Sep 23, 2024
c34dec1
Removed bivariate example; updated formatting
fonnesbeck Sep 24, 2024
c54adc4
Rename lecture notebooks
fonnesbeck Jan 4, 2025
bdd8624
Renamed notebooks
fonnesbeck Jan 4, 2025
3c23a08
Changed style in Categories & Curves
fonnesbeck Jan 19, 2025
b8de082
Replaced MutableData with Data throughout
fonnesbeck Jan 31, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,262 changes: 1,262 additions & 0 deletions examples/statistical_rethinking_lectures/02-The_Garden_of_Forking_Data.ipynb

Large diffs are not rendered by default.

Large diffs are not rendered by default.

1,645 changes: 1,645 additions & 0 deletions examples/statistical_rethinking_lectures/03-Geocentric_Models.ipynb

Large diffs are not rendered by default.

577 changes: 577 additions & 0 deletions examples/statistical_rethinking_lectures/03-Geocentric_Models.myst.md

Large diffs are not rendered by default.

4,950 changes: 4,950 additions & 0 deletions examples/statistical_rethinking_lectures/04-Categories_&_Curves.ipynb

Large diffs are not rendered by default.

1,273 changes: 1,273 additions & 0 deletions examples/statistical_rethinking_lectures/04-Categories_&_Curves.myst.md

Large diffs are not rendered by default.

4,054 changes: 4,054 additions & 0 deletions examples/statistical_rethinking_lectures/05-Elemental_Confounds.ipynb

Large diffs are not rendered by default.

1,387 changes: 1,387 additions & 0 deletions examples/statistical_rethinking_lectures/05-Elemental_Confounds.myst.md

Large diffs are not rendered by default.

5,146 changes: 5,146 additions & 0 deletions examples/statistical_rethinking_lectures/06-Good_&_Bad_Controls.ipynb

Large diffs are not rendered by default.

1,312 changes: 1,312 additions & 0 deletions examples/statistical_rethinking_lectures/06-Good_&_Bad_Controls.myst.md

Large diffs are not rendered by default.

2,155 changes: 2,155 additions & 0 deletions examples/statistical_rethinking_lectures/07-Fitting_Over_&_Under.ipynb

Large diffs are not rendered by default.

Large diffs are not rendered by default.

3,598 changes: 3,598 additions & 0 deletions examples/statistical_rethinking_lectures/08-Markov_Chain_Monte_Carlo.ipynb

Large diffs are not rendered by default.

1,073 changes: 1,073 additions & 0 deletions examples/statistical_rethinking_lectures/08-Markov_Chain_Monte_Carlo.myst.md

Large diffs are not rendered by default.

4,680 changes: 4,680 additions & 0 deletions examples/statistical_rethinking_lectures/09-Modeling_Events.ipynb

Large diffs are not rendered by default.

1,226 changes: 1,226 additions & 0 deletions examples/statistical_rethinking_lectures/09-Modeling_Events.myst.md

Large diffs are not rendered by default.

5,721 changes: 5,721 additions & 0 deletions examples/statistical_rethinking_lectures/10-Counts_&_Hidden_Confounds.ipynb

Large diffs are not rendered by default.

1,391 changes: 1,391 additions & 0 deletions examples/statistical_rethinking_lectures/10-Counts_&_Hidden_Confounds.myst.md

Large diffs are not rendered by default.

4,223 changes: 4,223 additions & 0 deletions examples/statistical_rethinking_lectures/11-Ordered_Categories.ipynb

Large diffs are not rendered by default.

1,027 changes: 1,027 additions & 0 deletions examples/statistical_rethinking_lectures/11-Ordered_Categories.myst.md

Large diffs are not rendered by default.

3,337 changes: 3,337 additions & 0 deletions examples/statistical_rethinking_lectures/12-Multilevel_Models.ipynb

Large diffs are not rendered by default.

1,142 changes: 1,142 additions & 0 deletions examples/statistical_rethinking_lectures/12-Multilevel_Models.myst.md

Large diffs are not rendered by default.

1,753 changes: 1,753 additions & 0 deletions examples/statistical_rethinking_lectures/13-Multilevel_Adventures.ipynb

Large diffs are not rendered by default.

Large diffs are not rendered by default.

2,007 changes: 2,007 additions & 0 deletions examples/statistical_rethinking_lectures/14-Correlated_Features.ipynb

Large diffs are not rendered by default.

Large diffs are not rendered by default.

4,089 changes: 4,089 additions & 0 deletions examples/statistical_rethinking_lectures/15-Social_Networks.ipynb

Large diffs are not rendered by default.

1,293 changes: 1,293 additions & 0 deletions examples/statistical_rethinking_lectures/15-Social_Networks.myst.md

Large diffs are not rendered by default.

5,790 changes: 5,790 additions & 0 deletions examples/statistical_rethinking_lectures/16-Gaussian_Processes.ipynb

Large diffs are not rendered by default.

1,553 changes: 1,553 additions & 0 deletions examples/statistical_rethinking_lectures/16-Gaussian_Processes.myst.md

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

5,216 changes: 5,216 additions & 0 deletions examples/statistical_rethinking_lectures/18-Missing_Data.ipynb

Large diffs are not rendered by default.

1,420 changes: 1,420 additions & 0 deletions examples/statistical_rethinking_lectures/18-Missing_Data.myst.md

Large diffs are not rendered by default.

2,292 changes: 2,292 additions & 0 deletions examples/statistical_rethinking_lectures/19-Generalized_Linear_Madness.ipynb

Large diffs are not rendered by default.

Large diffs are not rendered by default.

403 changes: 403 additions & 0 deletions examples/statistical_rethinking_lectures/20-Horoscopes.ipynb

Large diffs are not rendered by default.

228 changes: 228 additions & 0 deletions examples/statistical_rethinking_lectures/20-Horoscopes.myst.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,228 @@
---
jupytext:
text_representation:
extension: .md
format_name: myst
format_version: 0.13
kernelspec:
display_name: Python 3 (ipykernel)
language: python
name: python3
---

(lecture_20)=
# Horoscopes
:::{post} Jan 7, 2024
:tags: statistical rethinking, bayesian inference, scientific workflow
:category: intermediate
:author: Dustin Stansbury
:::

This notebook is part of the PyMC port of the [Statistical Rethinking 2023](https://github.com/rmcelreath/stat_rethinking_2023) lecture series by Richard McElreath.

[Video - Lecture 20 - Horoscopes](https://youtu.be/qwF-st2NGTU)# [Lecture 20 - Horoscopes](https://www.youtube.com/watch?v=qwF-st2NGTU)

```{code-cell} ipython3
# Ignore warnings
import warnings

import arviz as az
import numpy as np
import pandas as pd
import pymc as pm
import statsmodels.formula.api as smf
import utils as utils
import xarray as xr

from matplotlib import pyplot as plt
from matplotlib import style
from scipy import stats as stats

warnings.filterwarnings("ignore")

# Set matplotlib style
STYLE = "statistical-rethinking-2023.mplstyle"
style.use(STYLE)
```

# Horoscopes

This lecture mostly outlines a set of high-level heuristics and workflows to improve the quality of scientific research. Therefore there's not a lot of implementation details in the lecture to cover. I won't go through copying the content from each slide, but I cover some highlights (mostly for my own benefit) below:

## Statistics is like fortune telling

- Vague facts lead to vague advice
- Reading tea leaves is like following common flow charts for statistical analysis
- There's little scientific inputs, therefore little scientific interpretation
- That's often the feature and the bug of fortune telling, and statistics:
- by providing vague interpretations (e.g. horoscope predictions) from vague inputs (e.g. birthday), they can "explain" any number of outcomes
- just like vague horoscopes can "explain" any number of possible future events
- Exaggerated importance
- no one wants to hear evil portents in their tea leaves, just as no one wants to hear about NULL or negative statistical results
- there's often incentive to use statistics to find the positive result
- It's often easier to offload subjective scientific responsibility onto objective statistical procedures

## Three pillars of scientific workflow
**1. Planning**
- Goal setting
- estimands
- Theory building
- assumptions
- 4 types of theory building, increasing in specificity
1. Heuristic (DAGs)
- allows us to deduce a lot from establishing causal structure
3. Structural
- moves beyond DAGs by establishing specific functional forms of causes
5. Dynamical models
- usually work over spatial/temporal grid
- tend to collapse large number of micro-states into macro interpretation
7. Agent-based
- focuses on individual micro states
- Justified sampling
- Which data do we use, and what's it's structure
- Verify with simulation
- Justified analysis
- Which golems?
- Can we recover estimands from simulations?
- Documentation
- How did it happen?
- Help others and your future self
- Scripting is self-documenting
- Comments are important
- Don't be clever, be explicit
- Avoid clever one-liners
- I find Python PEP useful here
- Sharing
- open source code and data formats
- proprietary software does not facilitate shareing, and is bad scientific ethics
- the irony here, is that MATLAB is so common in academic setting, particularly engineering 🙄
- proprietery data formats can shoot you in the foot when you (or others) can no longer open them
- Preregistration isn't a silver bullet
- Pre-allocating expectations on a bad analysis approach (e.g. causal salad) doesn't fix the bad approach

**2. Working**

- Research engineering
- Treat research more like software enginnering
- standardized, battle-tested procedures that make software dependable and repeatable
- version control (git)
- testing
- unit testing
- integration testing
- build up tests incrementally, validating each part of the workflow before proceeding to the next
- documentation
- review
- 👀, 👀 have at least one other person review your analysis code and docs, and provide feedback
- will often point out bugs, optimizations, or shortcomings in documentation
- Look at good examples
- e.g. on of [McElreath's Consulting Projects](https://github.com/rmcelreath/CES_rater_2021/tree/main)
- [Data Carpentry](https://datacarpentry.org/)

**3. Reporting**

- Sharing materials
- by following code-based Working flow, sharing is pretty much done for you
- [Nice breakdown and example of Describing Methods](https://youtu.be/qwF-st2NGTU?si=3I7CMalLXv3pIQhr&t=3742)
- Justify priors
- Justify methods, and dealing with reviewers
- Common fallacy: "good scientific design doesn't require complex statistics"
- valid causal modeling requires complexity
- don't try to convince Reviewer 3 to accept your methods, write to editor
- move the convo from statistical to causal modeling
- Describe data
- structure
- missing values: justify imputation if any
- Describe results
- aim to report contrasts and marginal effects
- use densities over intervals
- avoid interpeting coefficients as causal effects
- Making decisions
- this is often the goal (particularly in industry)
- embrace uncertainty
- uncertainty is not admission of weakness
- Bayesian decision theory
- use the posterior to simulate various policy interventions
- can be used to provide posteriors to costs/benefits due to those interventions

+++

## Scientific Reform

- many of the metrics for good science are counterproductive
- e.g. papers that are least replicated continue to have higher citation count
- META POINT: this result in publishing be explained using a causal modeling and colider bias

### Collider bias in scientific publishing

#### Causal model of collider bias

```{code-cell} ipython3
---
jupyter:
source_hidden: true
---
utils.draw_causal_graph(
edge_list=[("newsworhiness, N", "published, P"), ("trustworthiness, T", "published, P")],
)
```

#### Simulating data from collider causal model

```{code-cell} ipython3
np.random.seed(123)
n_samples = 200

# N and T are independent
N = stats.norm.rvs(size=n_samples)
T = stats.norm.rvs(size=n_samples)

# Award criterion; either are large enough to threshold
A = np.where(N + T > 2, 1, 0)

for awarded in [0, 1]:
color = "gray" if not awarded else "C0"
N_A = N[A == awarded]
T_A = T[A == awarded]
utils.plot_scatter(N_A, T_A, color=color)

fit_data = pd.DataFrame({"N": N_A, "T": T_A})
cc = np.corrcoef(fit_data.T, fit_data.N)[0][1]
awarded_model = smf.ols("T ~ N", data=fit_data).fit()
utils.plot_line(
N_A, awarded_model.predict(), color="C0", label=f"Post-selection\nTrend\ncorrelation={cc:0.2}"
)
plt.xlabel("Newsworthiness")
plt.ylabel("Trustworthiness")
plt.axis("square")
plt.legend();
```

By selecting at papers that are published based on a threshold that combines either newsworthiness--i.e. "sexy papers" that get cited a lot--or trustworthiness--i.e. boring papers that are replicable--we end up with highly-cited papers that tend to be less replicable.

+++ {"jp-MarkdownHeadingCollapsed": true}

## Horoscopes of research

- Many things that are "bad" about science (e.g. impact factor) are once well-intentioned reforms
- Some potential fixes are available:
1. No stats before transparently-communicated causal model
- avoid causal salad
2. Prove your code/analysis works within the scope of your project and assumptions
3. Share as much as possible
- sometimes data is not shareable
- but you can create partial, anonomized, or synthetic datasets
4. Beware proxies for research quality (e.g. citation count, impact factor)

+++

## Authors
* Ported to PyMC by Dustin Stansbury (2024)
* Based on Statistical Rethinking (2023) lectures by Richard McElreath

```{code-cell} ipython3
%load_ext watermark
%watermark -n -u -v -iv -w -p pytensor,xarray
```

:::{include} ../page_footer.md
:::
Loading
Loading