Skip to content

Hierarchical Causal Models #236

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 114 commits into
base: main
Choose a base branch
from
Open

Hierarchical Causal Models #236

wants to merge 114 commits into from

Conversation

adamrupe
Copy link
Collaborator

@adamrupe adamrupe commented Sep 6, 2024

Closes #278

This PR implements Hierarchical Causal Models (Weinstein and Blei, 2024)

This PR will be ready for review when the following algorithms have been tested and implemented.

  • Algorithm 1: Graphical algorithm for collapsing a hierarchical causal graphical model (HCGM). This algorithm transforms the graph of a hierarchical causal model (HCM) into the graph of its collapsed model, following Definition 4.
  • Algorithm 2: Graphical algorithm for augmenting a collapsed model. This algorithm adds an
    augmentation variable to a collapsed HCGM, following Definition 6.
  • Algorithm 3: Graphical algorithm for marginalizing an augmented model. This algorithm
    marginalizes out parent(s) of an augmentation variable (Section 5.2).
  • Causal query pipeline: Utilizes Algorithms 1 -3 (as needed) to check if a causal query is identifiable in the HCM. The use of Algorithms 2 and 3 depends on the causal query, i.e. whether a variable needs to be augmented in (Alg 2) and then whether another variable needs to be marginalized out (Alg 3).
  • HSCM tests
  • High-level example (with real-world motivation) that shows how to do a causal query on a HCM

@adamrupe adamrupe linked an issue Sep 6, 2024 that may be closed by this pull request
Copy link

codecov bot commented Sep 6, 2024

Codecov Report

Attention: Patch coverage is 88.12500% with 19 lines in your changes missing coverage. Please review.

Project coverage is 81.27%. Comparing base (05a9456) to head (3af8c66).
Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
src/y0/hierarchical.py 88.12% 9 Missing and 10 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #236      +/-   ##
==========================================
+ Coverage   80.87%   81.27%   +0.39%     
==========================================
  Files          50       51       +1     
  Lines        4135     4314     +179     
  Branches      845      981     +136     
==========================================
+ Hits         3344     3506     +162     
- Misses        668      670       +2     
- Partials      123      138      +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cthoyt
Copy link
Member

cthoyt commented Sep 10, 2024

hi @adamrupe - can you add a checklist into the PR description with the tasks to complete for this PR before it needs review?

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 18 out of 19 changed files in this pull request and generated 1 comment.

Files not reviewed (1)
  • tox.ini: Language not supported

@djinnome djinnome marked this pull request as ready for review January 24, 2025 00:01
cthoyt

This comment was marked as outdated.

@cthoyt cthoyt force-pushed the HCM-fig2 branch 2 times, most recently from 3237de1 to ef54579 Compare February 3, 2025 08:42
Copy link
Member

@cthoyt cthoyt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did a major refactor to address the software issues from the last round. The next steps for @adamrupe and @djinnome are:

  1. Read through the code and familiarize yourselves with the new interface
  2. Comment on / address all TODO's I left in the code (there aren't many)
  3. Tests
    • Either implement tests for conversion to HSCM or delete the conversion code
    • Test augment_collapsed_model
  4. Check the notebook, which used to raise some exceptions, but I replaced those with the high-level identify_outcomes API. Please review to make sure that the places where there is no estimand produced because the graph has a single c-component are all correct
  5. Create a high-level, real world example that demonstrates using all of the code in a story-driven workflow (i.e., do not explain the math, only explain which functions you implemented solve the problem). Use https://github.com/y0-causal-inference/y0/blob/main/notebooks/Counterfactual%20Transportability.ipynb as a golden standard for how a great notebook with applications looks

Along the way, please make sure that you check the CI/CD system for automated, objective feedback on code quality. @adamrupe if you're not familiar with how to do this, I am happy to show you

@adamrupe
Copy link
Collaborator Author

adamrupe commented Feb 5, 2025

@cthoyt What's your recommendation for handling merge conflicts with jupyter notebooks? I need to do this before I can pull your updates. I'm also not familiar with the CI/CD system, so if you could talk me through it that would be great.

@cthoyt
Copy link
Member

cthoyt commented Feb 5, 2025

@adamrupe before merging, copy your local notebook to your desktop. While merging, throw away everything from your repository's copy and overwrite it with remote. Then, you can think about manually inspecting your notebook on your desktop, and the new version from the remote repo side-by-side.

The best way to avoid this kind of thing is never to leave changes unpushed when you finish working, and to always pull before you start working again


The short explanation of how to use the CI/CD system is: you can always scroll to the bottom of this pull request (#236) and look at the feedback given by GitHub running our unit tests, linting, and code quality checks.

This is what it looks like to me right now:

Screenshot 2025-02-05 at 23 51 20

You can click on any of the rows with the red x's, and then it will bring you to the page that ran the tests for you. Right now, you will be able to see all of the output from running pytest. You have to scroll up a bit since unfortunately, pytest reports timings and warnings after test failures, but you can see https://github.com/y0-causal-inference/y0/actions/runs/13134504627/job/36646756591?pr=236#step:6:69 for the currently failing test.

Similarly, while you're still getting used to having code quality checks, you will probably see that the linting or type checking scripts also give errors, which you can view in the same way..

It's sort of the expectation in a team setting for coding that you make pushes often, and each time check out what kind of feedback CI gives. This will help you iteratively make your code better, with fully objective feedback that you don't have to wait on someone else to give you. Alternatively to CI/CD in GitHub, you can run tox which also creates a reproducible execution of all of the testing suite.

There's documentation in the README on how to use all of the nice development tools built into this repo at https://github.com/y0-causal-inference/y0?tab=readme-ov-file#%EF%B8%8F-for-developers

If you get caught up on any parts of this that aren't self-explanatory, I'm happy to plan a video chat tomorrow, or sometime next before 6PM germany time

@adamrupe
Copy link
Collaborator Author

adamrupe commented Feb 6, 2025

Awesome, thanks @cthoyt! That makes sense, and Jeremy and Richard have already shown me how to use tox a bit. I've pulled your changes and I'm going through them now. I'll add a test_to_hscm.

@cthoyt
Copy link
Member

cthoyt commented Mar 1, 2025

@adamrupe you should be unblocked on the CI/CD pipeline now. looking forward to seeing a nice case study notebook, then we can finish this PR!

@adamrupe
Copy link
Collaborator Author

@cthoyt @djinnome I've changed the name of the previous notebook to HCM Manuscript Figures.ipynb and added a new case study notebook called Hierarchical Causal Models.ipynb.

@cthoyt
Copy link
Member

cthoyt commented Mar 18, 2025

I'd like you to consider what makes https://github.com/y0-causal-inference/y0/blob/main/notebooks/Surrogate%20Outcomes.ipynb a joy to read and try and take some lessons from it to improve the HCM notebook.

  1. Write your notebook keeping in mind that you are the last human being who ever has to understand the math behind the implementation you wrote.
  2. Imagine that all users of y0 want to solve a real problem, and they are reading the documentation to understand how they can model their problem using the data structures and algorithms in y0. They do not appreciate:
    • Abstract headings. Name each case study by the actual problem it's about, not the archetypical HCM as named by the paper. E.g., "Confounder" -> "After-school Tutoring and Test Scores"
    • Prose written like a mathematician. Rather than writing "Consider a school district that is interested in understanding how effective after-school tutoring is at raising test scores.", write "A school district is interested in understanding how effective after-school tutoring is at raising test scores." Use this simple and straightforward language to tell a story, not drag the user through a convoluted proof. Avoid words like "suppose", "consider"
    • Abstract examples. Give something concrete in any place you're tempted to use a variable to represent a high-level concept. Explain the concrete reason you need to make the modeling choice based on the case study, then after you may explain the theory that corresponds to that choice
    • Cryptic variable names. Any time you use a one letter variable name, you make it harder for readers to follow the example. Why are average test scores using the variable $y$? Call it score!
    • Avoid cryptic notation. Do the bars on top of the variable help understand what's going on? If we're not using individual test scores, then what does this add?
      Further, why are we sub-scripting with i? The explanation for the subscripting comes way too late.
    • Mixture of typography. I think that it's better to re-produce Figure 1 inside the notebook rather than using a mixture of visual styles and fonts.
  3. Use Python naming conventions for all variables. Scores, Tutoring, UnitConfounder should be scores, tutoring, and unit_confounder
  4. I'm not sure you need to explain the process of collapsing and augmenting. Isn't y0 able to abstract this away? Is there a logical reason the reader needs to know this happens in the background, given "here's the problem, here's how to model it, and here's the algorithm to apply to get an answer"? Maybe you can reuse some of the thoughts from above to frame this in a way that it's about the problem instead of about the math, but I think
  5. If you want to include math, you do have to address the difference between continuous integrals and the discrete estimands that come out of y0 functions. Further, try and match your hand-written notation to the output of y0 (e.g., use capital P for probability distributions, capital Q for Q-variables)

@adamrupe
Copy link
Collaborator Author

@cthoyt thanks for the suggestions. Here are some thoughts, in no particular order:

  • We can add our own renderings instead of displaying Figure 1 from the paper. But there will still be a mix of typography because the hierarchical models require pygraphviz. It would be nice to have a pygraphviz backend for NxMixedGraph.draw(); in this case it would match styles, but pygraphviz generally has nice visuals.
  • On a related note, one reason for using abstract variable names like Y and A is that render nicely with NxMixedGraph.draw(). More verbose names like scores etc. do not fit in the nodes when visualized.
  • Speaking of Variable names, I used capitalized names for y0 Variables, like Scores to follow the same convention used in other y0 notebooks, including the Surrogate Outcomes notebook you have suggested (e.g. Cancer, Smoking, Tar).
  • Something very important to note is that the ideas and algorithms of hierarchical causal modeling are still in development. As of yet, there is no general and sound algorithm to answer whether a given causal query is identifiable from a hierarchical causal model. Therefore, in order to use the graphical algorithms we have implemented for hierarchical causal models, users need to have some baseline understanding of the algorithms and the ideas behind them, like what a Q Variable is. I've tried to distill out this baseline understanding from the (quite long) paper into the notebook. We can try to cut back on the math a little bit, but not too much.
  • For example, you asked whether the overbars are necessary to convey averaged quantities. In the first Confounder example, the distinction between quantities averaged over students in a school vs quantities for each individual student in the school is hugely important. The causal query is not identifiable when using the average values (which is not a hierarchical problem), but does become identifiable when we have the individual student data (and it is now a hierarchical problem). It is important for a user to understand why this is the case if they want to use these algorithms on their own hierarchical causal problems.

@cthoyt
Copy link
Member

cthoyt commented Apr 19, 2025

fyi @adamrupe @djinnome I'll be starting my new job in RWTH aachen probably the third week of June, but as soon as I sign the contract (sometime between now and then), I plan to submit the y0 paper to JOSS. If we want to credit the implementation of HCMs in that paper, I would like this to be finished. Please let me know if it's not clear what the expectations are for finishing this / merging this PR

@djinnome
Copy link
Contributor

Congrats on the new position! June is a good deadline for us. @adamrupe is not at PNNL anymore, but I have some time in May to finish off the remaining items

…les to english names that capture the unit-level (school) and subunit (student) hierarchical distributions
@djinnome
Copy link
Contributor

Referring back to #236 (comment)

I think we should help set reasonable expectations for the user.
The hierarchical causal model paper does not describe a sound and complete algorithm.
Certain decisions are still left to the user in terms of how to augment a hierarchical causal model so that it can be identified.
The confounder, Instrumental variables, and interference scenarios all demonstrate how to identify causal queries over hierarchical models that cannot be identified in a non-hierarchical model.
With that said, we are very much open to adjusting the notation and examples so that they are clearer to the user, even if the mechanics of the algorithms can't be completely ignored by the user.

@cthoyt cthoyt enabled auto-merge (squash) June 20, 2025 14:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Hierarchical Causal Models Implement hierarchical causal models from figure 2 in pygraphviz
3 participants