Merge pull request #97 from m-clark/dev

grammar and misc
m-clark · Sep 2, 2024 · ab9db73 · ab9db73
2 parents b64aea9 + 3dc727a
commit ab9db73
Show file tree

Hide file tree

Showing 26 changed files with 4,350 additions and 4,335 deletions.
diff --git a/abstracts.qmd b/abstracts.qmd
@@ -1,3 +1,5 @@
+<!-- TODO: Write abstracts for each of the chapters in the book. -->
+
 # Abstracts
 
 For each subheading below, I'd like to write a short abstract of the topic, and then link to the relevant notebooks.

diff --git a/causal.qmd b/causal.qmd
@@ -8,11 +8,6 @@
 :::
 
 
-TODO: Reviewer 
-- provide a code demo of confounding
-- related: let's move explanation vs. prediction to this chapter.
-
-
 Causal inference is a very important topic in machine learning and statistics, and it is also a very difficult one to understand well, or consistently, because *not everyone agrees on how to define a cause in the first place*. Our focus here is merely practical- we just want to show some of the common model approaches used when attempting to answer causal questions. But causal modeling in general is such a rabbit hole that we won't be able to go into much detail, but we will try to give you a sense of the landscape, and some of the key ideas.
 
 
@@ -36,9 +31,9 @@ Often we need a precise statement about the feature-target relationship, not jus
 This section is pretty high level, and we are not going to go into much detail here so even just some understanding of correlation and modeling would likely be enough.
 
 
-:::{.content-visible when-format='html'}
 ```{r}
 #| echo: false
+#| eval: false
 #| label: fig-causal-dag
 #| fig-cap: A Causal DAG
 library(ggdag)
@@ -72,11 +67,9 @@ tidy_ggdag |>
 
 ggsave('img/causal-dag.svg', width = 8, height = 6)
 ```
-:::
 
-:::{.content-visible when-format='pdf'}
-![A Causal DAG](img/causal-dag.svg){width=50% #fig-causal-dag}
-:::
+
+![A Causal DAG](img/causal-dag.svg){width=75% #fig-causal-dag}
 
 ## Classic Experimental Design {#sec-causal-classic}
 
@@ -505,7 +498,7 @@ There are more widely used tools for uplift modeling and meta-learners in Python
 
 
 - **S-learner** - **s**ingle model for both groups; predict the (counterfactual) difference as when all observations are treated vs when all are not, similar to our previous code demo.
-- **T-learner** - **t**wo models, one for each of the control and treatment groups; predict the values as if all observations are treated vs when all are control using both models, and take the difference.
+- **T-learner** - **t**wo models, one for each of the control and treatment groups; predict the values as if all observations are 'treated' versus when all are 'control' using both models, and take the difference.
 - **X-learner** - a more complicated modification to the T-learner also using a multi-step approach.
 
 
@@ -643,9 +636,9 @@ ggsave('img/causal-prediction-vs-explanation-demo-plot.svg', width = 8, height =
 
 But if we are interested in predictive performance, we would be disappointed with this model. It predicts the target at about the same rate as guessing, even on the data it's fit on, and does even worse with new data. Even the effect as shown is quite small by typical standards, as it would take a standard deviation change in the feature to get a ~`r prob_diff` change in the probability of the target (x is standardized).
 
-If we are concerned solely with explanation, we now would want to ask ourselves first if we can trust our result based on the data, model, and various issues that went into producing it. If so, we can then if the effect is large enough to be of interest, and if the result is useful in making decisions[^seedsofchange]. It may very well be, maybe the target concerns the rate of survival, where any increase is worthwhile. Or perhaps the data circumstances demand such interpretation, because it is extremely costly to obtain more. For more exploratory efforts however, this sort of result would likely not be enough to come to any strong conclusion even if explanation is the only goal. 
+If we are concerned solely with explanation, we now would want to ask ourselves first if we can trust our result based on the data, model, and various issues that went into producing it. If so, we can then see if the effect is large enough to be of interest, and if the result is useful in making decisions[^seedsofchange]. It may very well be, maybe the target concerns the rate of survival, where any increase is worthwhile. Or perhaps the data circumstances demand such interpretation, because it is extremely costly to obtain more. For more exploratory efforts however, this sort of result would likely not be enough to come to any strong conclusion even if explanation is the only goal. 
 
-[^seedsofchange]: This is a contrived example, but it is definitely something what you might see in the wild. The relationship is weak, and though statistically significant, the model can't predict the target well at all. The **statistical power** is actually decent in this case, roughly `r causal_power`, but this is mainly because the sample size is so large and it is a very simple model setting. <br> This is a common issue in many academic fields, and it's why we always need to be careful about how we interpret our models. In practice, we would generally need to consider other factors, such as the cost of a false positive or false negative, or the cost of the data and running the model itself, to determine if the model is worth using.
+[^seedsofchange]: This is a contrived example, but it is definitely something that you might see in the wild. The relationship is weak, and though statistically significant, the model can't predict the target well at all. The **statistical power** is actually decent in this case, roughly `r causal_power`, but this is mainly because the sample size is so large and it is a very simple model setting. <br> This is a common issue in many academic fields, and it's why we always need to be careful about how we interpret our models. In practice, we would generally need to consider other factors, such as the cost of a false positive or false negative, or the cost of the data and running the model itself, to determine if the model is worth using.
 
 As another example, consider the world happiness data we've used in previous demonstrations. We want to explain the association of country level characteristics and the population's happiness. We likely aren't going to be as interested in predicting next year's happiness score, but rather what attributes are correlated with a happy populace in general. In this election year (2024) in the U.S., we'd be interested in specific factors related to presidential elections, of which there are relatively very few data points. In these cases, explanation is the focus, and we may not even need a model at all to come to our conclusions.
 
@@ -660,7 +653,7 @@ Here are some ways we might think about different modeling contexts:
 - **Causal Modeling**: Using models to understand causal effects. We focus on explanation, and prediction on the current data. We may very well be interested in predictive performance also, and often are in industry.
 - **Generalization**: When our goal is generalizing to unseen data, the focus is always on predictive performance. This does not mean we can't use the model to understand the data though, and explanation could possibly be as important.
 
-Depending on the context, we may be more interested explanation or predictive performance, but in practice we often, and usually, want both. It is crucial to remind yourself why you are interested in the problem, what a model is capable telling you about it, and to be clear about what you want to get out of the result.
+Depending on the context, we may be more interested explanation or predictive performance, but in practice we often, and usually, want both. It is crucial to remind yourself why you are interested in the problem, what a model is capable of telling you about it, and to be clear about what you want to get out of the result.
 
 
 
@@ -684,7 +677,7 @@ Engaging in causal modeling may not even require you to learn any new models, bu
 
 ### Choose your own adventure {#causal-adventure}
 
-From here you might revisit some of the previous models and think about how you might use them to answer a causal question. You might also look into some of the other models we've mentioned here, and see how they are used in practice via the additional resources below. 
+From here you might revisit some of the previous models and think about how you might use them to answer a causal question. You might also look into some of the other models we've mentioned here, and see how they are used in practice via the additional resources. 
 
 
 ### Additional resources {#causal-resources}

diff --git a/danger_zone.qmd b/danger_zone.qmd
@@ -504,7 +504,7 @@ In terms of features, extreme values can cause strange effects, but often they r
 
 ### Big data isn't always as big as you think {#sec-danger-bigdata}
 
-Consider a model setting with 100,000 samples. Is this large? Let's say you have a rare outcome that occurs 1% of the time. This means you have 1000 samples where outcome label you're interested in occurs. Now consider a categorical feature (A) that has four categories, and one of those categories is relatively small, say 5% of the data, or 5000 cases, and you want to interact it with another categorical feature (B), one whose categories are all equally distributed. Assuming no particular correlation between the two, you'd be down to ~1% of the data for the least category of A across the levels of B. Now if there is an actual interaction, some of those interaction cells may have only a dozen or so positive target values. Odds are pretty good that you don't have enough data to make a reliable estimate of the interaction effect. 
+Consider a model setting with 100,000 samples. Is this large? Let's say you have a rare outcome that occurs 1% of the time. This means you have 1000 samples where the outcome label you're interested in is present. Now consider a categorical feature (A) that has four categories, and one of those categories is relatively small, say 5% of the data, or 5000 cases, and you want to interact it with another categorical feature (B), one whose categories are all equally distributed. Assuming no particular correlation between the two, you'd be down to ~1% of the data for the least category of A across the levels of B. Now if there is an actual interaction effect on the target, some of those interaction cells may have only a dozen or so positive target values. Odds are pretty good that you don't have enough data to make a reliable estimate of the interaction effect. 
 
 Oh wait, did you want to use cross-validation also?  A simple random sample approach might result in some validation sets with no positive values at all! Don't forget that you may have already split your 100,000 samples into training and test sets, so you have even less data to start with! The following table shows the final cell count for a dataset with these properties.
 

diff --git a/data.qmd b/data.qmd
@@ -28,7 +28,7 @@ Knowing your data is one of the most important aspects of any application of *da
 
 ### Helpful context {#sec-data-good-to-know}
 
-We're talking very generally about data here, so not much background is needed. The models mentioned are covered in other chapters, or build upon those, but we're not doing any actual modeling here. 
+We're talking very generally about data here, so not much background is needed. The models mentioned here are covered in other chapters, or build upon those, but we're not doing any actual modeling here. 
 
 ## Feature & Target Transformations {#sec-data-transformations}
 
@@ -155,7 +155,7 @@ minmax_scaled_data = apply(data, 2, function(x) {
 ```
 :::
 
-Using a **log** transformation for numeric targets and features is straightforward, and [comes with several benefits](https://stats.stackexchange.com/questions/107610/what-is-the-reason-the-log-transformation-is-used-with-right-skewed-distribution). For example, it can help with **heteroscedasticity**, which is when the variance of the target is not constant across the range of the predictions[^notnormal] (demonstrated below). It can also help to keep predictions positive after transformation, allows for interpretability gains, and more. One issue with logging is that it is not a linear transformation, which can help capture nonlinear feature-target relationships, but can also make some post-modeling transformations more less straightforward. Also if you have a lot of zeros, 'log plus one' transformations are not going to be enough to help you overcome that hurdle[^logp1]. Logging also won't help much when the variables in question have few distinct values, like ordinal variables, which we'll discuss later in @sec-data-ordinal.
+Using a **log** transformation for numeric targets and features is straightforward, and [comes with several benefits](https://stats.stackexchange.com/questions/107610/what-is-the-reason-the-log-transformation-is-used-with-right-skewed-distribution). For example, it can help with **heteroscedasticity**, which is when the variance of the target is not constant across the range of the predictions[^notnormal] (demonstrated below). It can also help to keep predictions positive after transformation, allows for interpretability gains, and more. One issue with logging is that it is not a linear transformation, which can help capture nonlinear feature-target relationships, but can also make some post-modeling transformations less straightforward. Also if you have a lot of zeros, 'log plus one' transformations are not going to be enough to help you overcome that hurdle[^logp1]. Logging also won't help much when the variables in question have few distinct values, like ordinal variables, which we'll discuss later in @sec-data-ordinal.
 
 [^logp1]: That doesn't mean you won't see many people try (and fail).
 
@@ -399,7 +399,7 @@ Ordinality of a categorical outcome is largely ignored in machine learning appli
 
 #### Rank data {#sec-data-rank}
 
-Though ranks are ordered, with rank data we are referring to cases where the observations are uniquely ordered. An ordinal vector of 1-6 with numeric labels could be something like [2, 1, 1, 3, 4, 2], whereas rank data would be [2, 1, 3, 4, 5, 6], each being unique (unless you allowed for ties).  For example, in sports, a ranking problem would regard predicting the actual finish of the runners. Assuming you have a modeling tool that actually handles this situation, the objective will be different from other scenarios. Statistical modeling methods include using the Plackett-Luce distribution (or the simpler variant Bradley-Terry model). In machine learning, you might use so-called [learning to rank methods](https://en.wikipedia.org/wiki/Learning_to_rank), like the [RankNet and LambdaRank algorithms](https://icml.cc/Conferences/2015/wp-content/uploads/2015/06/icml_ranking.pdf), and other variants for [deep learning models](https://github.com/tensorflow/ranking).
+Though ranks are ordered, with rank data we are referring to cases where the observations are uniquely ordered. An ordinal vector of 1-6 with numeric labels could be something like [2, 1, 1, 3, 4, 2], whereas rank data would be [2, 1, 3, 4, 5, 6], each being unique (unless you allow for ties).  For example, in sports, a ranking problem would regard predicting the actual finish of the runners. Assuming you have a modeling tool that actually handles this situation, the objective will be different from other scenarios. Statistical modeling methods include using the Plackett-Luce distribution (or the simpler variant Bradley-Terry model). In machine learning, you might use so-called [learning to rank methods](https://en.wikipedia.org/wiki/Learning_to_rank), like the [RankNet and LambdaRank algorithms](https://icml.cc/Conferences/2015/wp-content/uploads/2015/06/icml_ranking.pdf), and other variants for [deep learning models](https://github.com/tensorflow/ranking).
 
 
 
@@ -1153,7 +1153,7 @@ It's easy to see from such a list that latent variables are very common in model
 
 In the tabular domain, data augmentation is less common, but still possible. You'll see it most commonly applied with class-imbalance settings (@sec-data-class-imbalance), where you might create new data points for the minority class to balance the dataset. This can be done by randomly sampling from the existing data points, or by creating new data points based on the existing data points. For the latter, SMOTE and many variants of it are quite common.
 
-Unfortunately for tabular data, these techniques are not nearly as successful as augmentation for computer vision or natural language processing, nor consistently so. Part of the issue is that tabular data is very noisy and fraught with measurement error, so in a sense, such techniques are just adding noise to the modeling process[^vscv]. Downsampling the majority class can potentially throw away usefu information. Simple random upsampling of the minority class can potentially lead to an overconfident model that still doesn't generalize well. In the end, the best approach is to get more and/or better data, but hopefully more successful methods will be developed in the future.
+Unfortunately for tabular data, these techniques are not nearly as successful as augmentation for computer vision or natural language processing, nor consistently so. Part of the issue is that tabular data is very noisy and fraught with measurement error, so in a sense, such techniques are just adding noise to the modeling process[^vscv]. Downsampling the majority class can potentially throw away useful information. Simple random upsampling of the minority class can potentially lead to an overconfident model that still doesn't generalize well. In the end, the best approach is to get more and/or better data, but hopefully more successful methods will be developed in the future.
 
 [^vscv]: Compare to the image settings where there is relatively little measurement error, by just rotating an image, you are still preserving the underlying structure of the data.