-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-6698: Shapley values support for ensemble models #15734
GH-6698: Shapley values support for ensemble models #15734
Conversation
d50a58b
to
6ebfc91
Compare
h2o-algos/src/main/java/hex/deeplearning/DeepLearningModel.java
Outdated
Show resolved
Hide resolved
@@ -1416,7 +1416,7 @@ handle_ice <- function(model, newdata, column, target, centered, show_logodds, s | |||
data = results) | |||
column_value <- as.data.frame(newdata[[column]]) | |||
histogram <- stat_count_or_bin(!.is_continuous(newdata[[column]]), | |||
ggplot2::aes(x = .data[[col_name]], y = (.data$..count.. / max(.data$..count..)) * diff(y_range) / 1.61), | |||
ggplot2::aes(x = .data[[col_name]], y = (ggplot2::after_stat(count) / max(ggplot2::after_stat(count))) * diff(y_range) / 1.61), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deprecated (#15651)
This change requires ggplot2 3.3.0 and above which is already required for PDP and ICE plots so no need to change the required version of ggplot2. (All other plots should work with v3.0.0.)
7063783
to
6516897
Compare
dens = _density(col) | ||
color = ( | ||
_uniformize(frame, col_name)[permutation] | ||
if colorize_factors or not frame.isfactor(col_name) | ||
else np.full(frame.nrow, 0.5) | ||
) | ||
|
||
if not np.any(np.isfinite(color)) or np.nanmin(color) == np.nanmax(color): | ||
plt.scatter(0, i, alpha=alpha, c="grey") | ||
continue # constant variable; plotting it can throw StopIteration in some versions of matplotlib | ||
|
||
plt.scatter( | ||
col, | ||
i + dens * np.random.uniform(-jitter, jitter, size=len(col)), | ||
alpha=alpha, | ||
c=_uniformize(frame, col_name)[permutation] | ||
if colorize_factors or not frame.isfactor(col_name) | ||
else np.full(frame.nrow, 0.5), | ||
c=color, | ||
cmap=colormap | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixes #15743
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem is a bug in matplotlib 3.7.1 and in later versions of matplotlib (3.7.2+) is fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomasfryda, you did an incredible amount of work here in this PR. I checked the DeepLearning part. I tried to check your implementation with paper, and I did not notice anything suspicious. I left only a few suggestions to improve the code. Please make sure you cover all Deeplearning cases in the tests.
Good job! Thanks.
h2o-algos/src/main/java/hex/ContributionsWithBackgroundFrameTask.java
Outdated
Show resolved
Hide resolved
h2o-algos/src/main/java/hex/deeplearning/DeepLearningModel.java
Outdated
Show resolved
Hide resolved
h2o-algos/src/main/java/hex/deeplearning/DeepLearningModel.java
Outdated
Show resolved
Hide resolved
Arrays.fill(result, 0); | ||
double biasTerm = (_dinfo._intercept ? _betas[_betas.length - 1] : 0); | ||
for (int i = 0; i < _betas.length - (_dinfo._intercept ? 1 : 0); i++) { | ||
idx = _compact ? _dinfo.coefOriginalColumnIndices()[i] : i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at one way to calculate Shapley for linear models from here: https://shap.readthedocs.io/en/latest/example_notebooks/tabular_examples/linear_models/Sentiment%20Analysis%20with%20Logistic%20Regression.html
I assume you want to set shap for feature i is phi_i = beta_i*(x_i - mean(x_i)).
You don't need to loop through the whole frame to get to mean(x_j). Just call mean() on each vector. However, we do need to do something special for categorical columns though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need contribution with regard to the background dataset:
phi_i = beta_i * (x_i - bg_i)
where bg
is a point from background dataset. The background dataset is required to have "common ground" across the algos so it can be combined in the Stacked Ensemble. And also even for the tree algos, it is recommended to use SHAP calculated on top of the background set in general rather then on the tree structure (for example in Machine Learning for High-Risk Applications by Patrick Hall, James Curtis, Parul Pandey).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For _outputSpace is false. I see, so you are saying for each row j of x, you will need to do this: beta_i*bg_m(i) for m = 1 to size of bg. If you write this out, you will see this:
size bg * b_i*x_j[i]+ sum from m=1 to size bg b_i * bg_m[i]
if we take out the summation, we will have
b_i * x_j[i]+b_i*mean(bg[i])
It looks like for each row of x, you will need only the information of b_imean(bg[i]). Perhaps you can run at the beginning, calculate b_imean(bg[i]) or mean(bg[i]) as an array. I noticed that you also need each row of betabg_n+beta_0. You can generate a new frame that contains betabg_n+beta_0 for each row n.
Then, you can go through your data for x and calculate b_i * x_j[i]+b_i*mean(bg[i])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean output_per_reference=false
. The output_space
is used for switching between link space and "output" space (in glm between contributions for
Regarding your suggestion, you are right. I think this is low priority as GLM is interpretable by itself and for purely numeric datasets user can provide background dataset with dataset of means. (The same applies for DeepSHAP but the resulting SHAP values would be biased and would correspond to the original DeepSHAP not the default DeepSHAP in the shap
package).
I will try to add this (for glm) if I have some spare time.
long _endRow; | ||
Job _job; | ||
|
||
public ContributionsWithBackgroundFrameTask(Key<Frame> frKey, Key<Frame> backgroundFrameKey, boolean perReference) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this class is for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to be able to calculate baseline SHAP for each point from frame
with each point from background_frame
as the reference.
in pseudo-code this class is approximately responsible for:
result = Frame()
for x in frame:
for bg in background_frame:
result.append(baselineSHAP(x, bg)
This yields contribution for all the points against all the references. It's not very useful by itself but it's required for calculating the SHAP for Stacked Ensembles. Some other uses can be for example explaining ensemble of models where the models are proprietary and can't be shared, e.g., from one company you get fraud score, from another credit score and you input these values to your model, those companies can share the baseline shap with you and you can find calculate contributions for the input features without knowing the fraud/credit models. (This example is taken from https://www.nature.com/articles/s41467-022-31384-3)
These baseline shap values are often designated as
Note the commonly used SHAP values are
@Override | ||
protected void postGlobal() { | ||
NewChunk[] ncs = Stream.of(appendables()).map(vec -> vec.chunkForChunkIdx(0)).toArray(NewChunk[]::new); | ||
for (int i = 0; i < _partialSums.length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you just call vec.mean()? I know we may run into problems with categorical columns though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I need something like frame.groupby("RowIdx").mean()
. The number of rows is nrow(frame)*nrow(background_frame)
and the result should have nrow(frame)
rows.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is so much work! This is not the first time you got a big issue.
Just my suggestions for speedup. Also, can you compare results of your implementation with SHAP values from sklearn or other toolbox?
Arrays.fill(result, 0); | ||
double biasTerm = (_dinfo._intercept ? _betas[_betas.length - 1] : 0); | ||
for (int i = 0; i < _betas.length - (_dinfo._intercept ? 1 : 0); i++) { | ||
idx = _compact ? _dinfo.coefOriginalColumnIndices()[i] : i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For _outputSpace is false. I see, so you are saying for each row j of x, you will need to do this: beta_i*bg_m(i) for m = 1 to size of bg. If you write this out, you will see this:
size bg * b_i*x_j[i]+ sum from m=1 to size bg b_i * bg_m[i]
if we take out the summation, we will have
b_i * x_j[i]+b_i*mean(bg[i])
It looks like for each row of x, you will need only the information of b_imean(bg[i]). Perhaps you can run at the beginning, calculate b_imean(bg[i]) or mean(bg[i]) as an array. I noticed that you also need each row of betabg_n+beta_0. You can generate a new frame that contains betabg_n+beta_0 for each row n.
Then, you can go through your data for x and calculate b_i * x_j[i]+b_i*mean(bg[i])
for (int i = 0; i < _betas.length - (_dinfo._intercept ? 1 : 0); i++) { | ||
idx = _compact ? _dinfo.coefOriginalColumnIndices()[i] : i; | ||
result[idx] += _betas[i] * (row.get(i) - bgRow.get(i)); | ||
biasTerm += _betas[i] * bgRow.get(i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biasTerm is independent of the dataset. You could have formed it once and then call on it to use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I will postpone the improvements you suggested to some future PR as it would mainly speedup the use-case without StackedEnsemble and this PR is mainly for StackedEnsemble.
In the use-case with SE, there is as many bias terms as are the rows in the background data set and I could speed it up a bit but It would require me to reorder the rows in all other algos as well so I don't need to sort the contribution frames in the SE (I check for the same order but don't sort).
I already had implemented some version of the marginal shap for GLM but I removed it so I don't add unnecessary complexity (It doesn't really bring much more info than the coefficients in the GLM case).
26045d4#diff-b5a14183b7510c835fd11228e8c99c5e00ad87592933bf2569b3934a7879218fR188-R213
@wendycwong I added comparison for DeepLearning (DeepSHAP). I don't think we can convert our models to sklearn (or other toolbox) and since SHAP is model dependent I don't see a way how to do it and for most cases I think it's not necessary. Why it shouldn't be necessary for the most models? The tree-based models and GLM can now provide you with exact SHAP values, i.e., the same value you'd get if you'd calculate the SHAP naively (exponential complexity) and this is what I do in the DeepLearning StackedEnsembles |
@@ -2747,6 +2750,166 @@ Retrieving graphs via R is not yet supported. An `.ipynb demo showing this examp | |||
shap.summary_plot(shap_values, X, plot_type="bar") | |||
|
|||
|
|||
Interventional / Marginal SHAP |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hannah-tillman would you be so kind and have a quick look at the documentation, just in case something is not clear and needs rephrasing and/or typos etc.? Thanks in advance!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General updates & added references for maneuverability - LGTM :)
…implementing" a missing method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great change, respect!!!
GitHub issue: #6698
TODO
[.] Validations in Java (Make sure algos use compatible settings etc) (mostly done)
[?] use
top_n=0, bottom_n=0, compare_abs=FALSE
in the "new" SHAP[?] mojo support
[x] Update Documentation
[x] Tests in Java
[x] Update docstrings in python
[x] Update docstrings in R
[x] Validations in Python
[x] Implementation in R
[x] Validations in R
[x] Tests in R
[x] Update Explain plots in R and Python
[x] Update Fairness plots in R and Python
[x] Update
explain
andinspect_model_fairness
in both R and pythonHow does it work?
In order to be able to use the G-DeepSHAP[0] for Stacked Ensemble, this PR introduces empirical marginal Shapley values that are calculated by averaging baseline Shapley values over some reference/background dataset[1]. This is somewhere also called interventional SHAP values but that can cause confusion so marginal SHAP is the preferred term (as there are also methods to include causal knowledge into the Shapley value calculation).
In order to be able to calculate SHAP efficiently we need to exploit the inner workings of the individual algos (otherwise we'd end up with exponential complexity). For some algos we can calculate exact SHAP values efficiently (Tree algos, GLM) and for some other algos we can use some white-box approximation (Deep Learning, SE).
SHAP estimation for the tree algos use the information from the tree structure - it's easy to see which features influence the prediction for a given (sub)tree [2,3].
SHAP estimation for GLM is easy as it is just the beta coefficient multiplied by the difference between the given point and some reference (one reference point or expected value)[2].
For DeepLearning I use DeepSHAP algorithm with multiple reference points[2, 4]. DeepSHAP is based on DeepLIFT[0] but it uses (approximate) SHAP values instead of$C_{\Delta x^+ \Delta t^+}$ and $C_{\Delta x^- \Delta t^-}$ (Reveal-Cancel Rule). DeepSHAP works by calculating SHAP and linearly rescaling the contributions. This happens both between layers and also between the linear combination and the activation function. This way we can easily calculate approximations of contributions even for non-linear activation functions:
For example:
f(x1, x2, x3) = tanh(w1x1 + w2x2 + w3*x3 +b) = tanh(g(x))
Since tanh is one dimensional function we know that it's contribution is$\phi_{tanh} = tanh(x) - tanh(r)$ and the contribution of $g$ can be calculated the same way as for GLM.
Then we use the same rescale rule as between the NN layers.
In our case there is one exception and that's the MaxOut activation function which has 2 inputs/channels but it can be easily calculated exactly.
For Stack Ensembles I use G-DeepSHAP[0], that is closely related to DeepSHAP but uses generalized rescale rule. This requires to have contributions calculated for each reference separately.
Terminology
Both exact values and approximations should comply with the following rules:
Let x be the data point we want to explain and r a reference data point
Local accuracy/efficiency/summation-to-delta
$f(x) - f(r) = \sum_i \phi_i(x, r)$
Dummy property/missingness - if the feature has same value as the reference it doesn't contribute to the difference with the reference
$x_i = r_i => phi_i(x, r) = 0$
Consistency - if the model changes so that some simplified input's contribution increases, then that input's attribution ($\phi_i$ ) also increases or stays the same.
"Symmetry" (that's what I call it in the tests but in some papers symmetry corresponds to something else):$f\phi_i(f, x, b) = - \phi_i(f, b, x)$ in words if $\phi_i(f, x, b)$ corresponds to contribution of feature i that moves the prediction from f(b) to f(x) and $\phi_i(f, b, x)$ corresponds to contribution of feature i that moves the prediction from f(x) to f(b) then those two contributions should be the same in magnitude but with opposite direction.
Issues
Output_format == "Original"
This is a parameter that corresponds to different things in different algos with previously implemented TreeSHAP and I think it might end up not returning a SHAP value.
Differences
In XGBoost, this parameter returns contributions for one hot encoded categorical variables while in DRF and GBM it returns contributions for unencoded features. To get unencoded feature contributions from XGBoost we need to set output_format="compact". Output_format="compact" is not supported for GBM and DRF.
Not returning a SHAP value
This is mainly about the newly implemented SHAP but I think it applies to the previously implemented TreeSHAP as well.
By definition SHAP corresponds to average of function applied to every subset of powerset of simplified features but when we encode a categorical variable we don't have data points that would have this cat. variable with multiple levels (e.g. Animal={bird, cat, dog}; animal.dog=1 implies animal.cat=0, and animal.bird=0 but to calculate SHAP for this encoding we would have to also consider animal.dog=1, animal.cat=1 and animal.bird=0; animal.dog=1, animal.cat=1 and animal.bird=1; animal.dog=1, animal.cat=0 and animal.bird=1;...).
Since it's not a SHAP, it can break the "symmetry" property and it does especially with NAs that are also encoded differently in different algos. This can be demonstrated with the following example:
x={animal=dog}; b={animal=bird}
Is the difference$f(x) - f(b)$ due to the change animal.bird [1 -> 0] or animal.dog [0 -> 1]? This depends on the implementation of the algos. Naively I'd expect we should get contribution for the feature.level that is present in the $animal.dog=0$ , $animal.cat=0$ and $animal.bird != 0$ , clearly setting animal to dog can't have any contribution as it has no effect (and it would not be intuitive to have contributions that would not agree with varimp) but unsetting animal from bird does.
x
but let's consider GLM with beta coef.For the previously implemented SHAP for XGBoost I created this issue: #15644
How to resolve this?
I'd make
output_format="compact"
the default, rename the "original" to "compact" in GBM/DRF as it is not supported right now and deprecate theoutput_format="original"
.Binomial DRF TreeSHAP
In the previously implemented SHAP we do some weird recalculation because we calculate contribution for y=0 not y=1. I think the calculation there is incorrect and I have a different way of recalculating the contributions. I described it in this issue: #15657 So this is probably not an issue of this PR but I just in case I'm wrong I think it's good to mention this possible issue here.
DRF SHAP calculation might be wrong
I noticed that DRF with previously implemented SHAP is sometimes wrong. This is worse with XRT. I think the problem is with weights since when I enable
checkTreeConsistency
I get "Tree inconsistencyfound" error logged quite often. I mentioned this on #7385 . I wonder if we should warn users about this potential error and tell them to use the Marginal SHAP until it gets fixed.
Limitations
groupby(row).mean()
) instead of "streaming" (i.e. calculating one row against all the baselines on all base models and passing it through the SE and then averaging)References
[0] CHEN, Hugh, LUNDBERG, Scott M. and LEE, Su-In, 2022. Explaining a series of models by propagating Shapley values. Nature Communications. https://www.nature.com/articles/s41467-022-31384-3
[1] CHEN, Hugh, COVERT, Ian C., LUNDBERG, Scott M. and LEE, Su-In, 2022. Algorithms to estimate Shapley value feature attributions. http://arxiv.org/abs/2207.07605
[2] LUNDBERG, Scott and LEE, Su-In, 2017. A Unified Approach to Interpreting Model Predictions. http://arxiv.org/abs/1705.07874
[3] LABERGE, Gabriel and PEQUIGNOT, Yann, 2022. Understanding Interventional TreeSHAP: How and Why it Works. http://arxiv.org/abs/2209.15123
[4] SHRIKUMAR, Avanti, GREENSIDE, Peyton and KUNDAJE, Anshul, 2019. Learning Important Features Through Propagating Activation Differences. http://arxiv.org/abs/1704.02685