Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-6698: Shapley values support for ensemble models #15734

Merged

Conversation

tomasfryda
Copy link
Contributor

@tomasfryda tomasfryda commented Sep 8, 2023

GitHub issue: #6698

TODO

[.] Validations in Java (Make sure algos use compatible settings etc) (mostly done)

[?] use top_n=0, bottom_n=0, compare_abs=FALSE in the "new" SHAP
[?] mojo support

[x] Update Documentation
[x] Tests in Java
[x] Update docstrings in python
[x] Update docstrings in R
[x] Validations in Python
[x] Implementation in R
[x] Validations in R
[x] Tests in R
[x] Update Explain plots in R and Python
[x] Update Fairness plots in R and Python
[x] Update explain and inspect_model_fairness in both R and python

How does it work?

In order to be able to use the G-DeepSHAP[0] for Stacked Ensemble, this PR introduces empirical marginal Shapley values that are calculated by averaging baseline Shapley values over some reference/background dataset[1]. This is somewhere also called interventional SHAP values but that can cause confusion so marginal SHAP is the preferred term (as there are also methods to include causal knowledge into the Shapley value calculation).

In order to be able to calculate SHAP efficiently we need to exploit the inner workings of the individual algos (otherwise we'd end up with exponential complexity). For some algos we can calculate exact SHAP values efficiently (Tree algos, GLM) and for some other algos we can use some white-box approximation (Deep Learning, SE).

SHAP estimation for the tree algos use the information from the tree structure - it's easy to see which features influence the prediction for a given (sub)tree [2,3].

SHAP estimation for GLM is easy as it is just the beta coefficient multiplied by the difference between the given point and some reference (one reference point or expected value)[2].

For DeepLearning I use DeepSHAP algorithm with multiple reference points[2, 4]. DeepSHAP is based on DeepLIFT[0] but it uses (approximate) SHAP values instead of $C_{\Delta x^+ \Delta t^+}$ and $C_{\Delta x^- \Delta t^-}$ (Reveal-Cancel Rule). DeepSHAP works by calculating SHAP and linearly rescaling the contributions. This happens both between layers and also between the linear combination and the activation function. This way we can easily calculate approximations of contributions even for non-linear activation functions:
For example:

f(x1, x2, x3) = tanh(w1x1 + w2x2 + w3*x3 +b) = tanh(g(x))

Since tanh is one dimensional function we know that it's contribution is $\phi_{tanh} = tanh(x) - tanh(r)$ and the contribution of $g$ can be calculated the same way as for GLM.
Then we use the same rescale rule as between the NN layers.

In our case there is one exception and that's the MaxOut activation function which has 2 inputs/channels but it can be easily calculated exactly.

For Stack Ensembles I use G-DeepSHAP[0], that is closely related to DeepSHAP but uses generalized rescale rule. This requires to have contributions calculated for each reference separately.

Terminology

  • Baseline SHAP: $f(x) - f(b) = \sum_i \phi_i(f, x, b)$
  • Bias Term in baseline SHAP: $f(b)$
  • Marginal SHAP: $\phi_i(f, x) = \frac{1}{|D|} \sum_{b \in D} \phi_i(f, x, b)$
  • Bias Term in Marginal SHAP: $\frac{1}{|D|} \sum_{b \in D} f(b)$
  • Simplified feature/input: binary features where 0 corresponds to value from the baseline and 1 to the value from the explained point. This is the "lift" function that is used in this PR (because of th use of baseline SHAP but there are other choices for other types of SHAP)

Both exact values and approximations should comply with the following rules:

Let x be the data point we want to explain and r a reference data point

  • Local accuracy/efficiency/summation-to-delta
    $f(x) - f(r) = \sum_i \phi_i(x, r)$

  • Dummy property/missingness - if the feature has same value as the reference it doesn't contribute to the difference with the reference
    $x_i = r_i => phi_i(x, r) = 0$

  • Consistency - if the model changes so that some simplified input's contribution increases, then that input's attribution ($\phi_i$) also increases or stays the same.

  • "Symmetry" (that's what I call it in the tests but in some papers symmetry corresponds to something else): $f\phi_i(f, x, b) = - \phi_i(f, b, x)$ in words if $\phi_i(f, x, b)$ corresponds to contribution of feature i that moves the prediction from f(b) to f(x) and $\phi_i(f, b, x)$ corresponds to contribution of feature i that moves the prediction from f(x) to f(b) then those two contributions should be the same in magnitude but with opposite direction.

Issues

Output_format == "Original"

This is a parameter that corresponds to different things in different algos with previously implemented TreeSHAP and I think it might end up not returning a SHAP value.

Differences

In XGBoost, this parameter returns contributions for one hot encoded categorical variables while in DRF and GBM it returns contributions for unencoded features. To get unencoded feature contributions from XGBoost we need to set output_format="compact". Output_format="compact" is not supported for GBM and DRF.

Not returning a SHAP value

This is mainly about the newly implemented SHAP but I think it applies to the previously implemented TreeSHAP as well.

By definition SHAP corresponds to average of function applied to every subset of powerset of simplified features but when we encode a categorical variable we don't have data points that would have this cat. variable with multiple levels (e.g. Animal={bird, cat, dog}; animal.dog=1 implies animal.cat=0, and animal.bird=0 but to calculate SHAP for this encoding we would have to also consider animal.dog=1, animal.cat=1 and animal.bird=0; animal.dog=1, animal.cat=1 and animal.bird=1; animal.dog=1, animal.cat=0 and animal.bird=1;...).

Since it's not a SHAP, it can break the "symmetry" property and it does especially with NAs that are also encoded differently in different algos. This can be demonstrated with the following example:
x={animal=dog}; b={animal=bird}

Is the difference $f(x) - f(b)$ due to the change animal.bird [1 -> 0] or animal.dog [0 -> 1]? This depends on the implementation of the algos. Naively I'd expect we should get contribution for the feature.level that is present in the x but let's consider GLM with beta coef. $animal.dog=0$, $animal.cat=0$ and $animal.bird != 0$, clearly setting animal to dog can't have any contribution as it has no effect (and it would not be intuitive to have contributions that would not agree with varimp) but unsetting animal from bird does.

For the previously implemented SHAP for XGBoost I created this issue: #15644

How to resolve this?

I'd make output_format="compact" the default, rename the "original" to "compact" in GBM/DRF as it is not supported right now and deprecate the output_format="original".

Binomial DRF TreeSHAP

In the previously implemented SHAP we do some weird recalculation because we calculate contribution for y=0 not y=1. I think the calculation there is incorrect and I have a different way of recalculating the contributions. I described it in this issue: #15657 So this is probably not an issue of this PR but I just in case I'm wrong I think it's good to mention this possible issue here.

DRF SHAP calculation might be wrong

I noticed that DRF with previously implemented SHAP is sometimes wrong. This is worse with XRT. I think the problem is with weights since when I enable checkTreeConsistency I get "Tree inconsistency
found" error logged quite often. I mentioned this on #7385 . I wonder if we should warn users about this potential error and tell them to use the Marginal SHAP until it gets fixed.

Limitations

  • Only algos used by automl have b-shap implemented
  • Only default cat. encoding is supported (tested)
  • GLM interactions are not supported
  • Uses more memory than necessary (it uses "batch" approach (first calculate shap on one model then another one and so on until we have all the data from all the base models and then passing it to SE and then groupby(row).mean()) instead of "streaming" (i.e. calculating one row against all the baselines on all base models and passing it through the SE and then averaging)

References

[0] CHEN, Hugh, LUNDBERG, Scott M. and LEE, Su-In, 2022. Explaining a series of models by propagating Shapley values. Nature Communications. https://www.nature.com/articles/s41467-022-31384-3
[1] CHEN, Hugh, COVERT, Ian C., LUNDBERG, Scott M. and LEE, Su-In, 2022. Algorithms to estimate Shapley value feature attributions. http://arxiv.org/abs/2207.07605
[2] LUNDBERG, Scott and LEE, Su-In, 2017. A Unified Approach to Interpreting Model Predictions. http://arxiv.org/abs/1705.07874
[3] LABERGE, Gabriel and PEQUIGNOT, Yann, 2022. Understanding Interventional TreeSHAP: How and Why it Works. http://arxiv.org/abs/2209.15123
[4] SHRIKUMAR, Avanti, GREENSIDE, Peyton and KUNDAJE, Anshul, 2019. Learning Important Features Through Propagating Activation Differences. http://arxiv.org/abs/1704.02685

@tomasfryda tomasfryda self-assigned this Sep 8, 2023
@tomasfryda tomasfryda force-pushed the tomf_GH-6698_-shapley-values-support-for-ensemble-models branch from d50a58b to 6ebfc91 Compare September 12, 2023 13:51
@@ -1416,7 +1416,7 @@ handle_ice <- function(model, newdata, column, target, centered, show_logodds, s
data = results)
column_value <- as.data.frame(newdata[[column]])
histogram <- stat_count_or_bin(!.is_continuous(newdata[[column]]),
ggplot2::aes(x = .data[[col_name]], y = (.data$..count.. / max(.data$..count..)) * diff(y_range) / 1.61),
ggplot2::aes(x = .data[[col_name]], y = (ggplot2::after_stat(count) / max(ggplot2::after_stat(count))) * diff(y_range) / 1.61),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deprecated (#15651)
This change requires ggplot2 3.3.0 and above which is already required for PDP and ICE plots so no need to change the required version of ggplot2. (All other plots should work with v3.0.0.)

@tomasfryda tomasfryda force-pushed the tomf_GH-6698_-shapley-values-support-for-ensemble-models branch 3 times, most recently from 7063783 to 6516897 Compare September 18, 2023 15:17
Comment on lines 732 to 749
dens = _density(col)
color = (
_uniformize(frame, col_name)[permutation]
if colorize_factors or not frame.isfactor(col_name)
else np.full(frame.nrow, 0.5)
)

if not np.any(np.isfinite(color)) or np.nanmin(color) == np.nanmax(color):
plt.scatter(0, i, alpha=alpha, c="grey")
continue # constant variable; plotting it can throw StopIteration in some versions of matplotlib

plt.scatter(
col,
i + dens * np.random.uniform(-jitter, jitter, size=len(col)),
alpha=alpha,
c=_uniformize(frame, col_name)[permutation]
if colorize_factors or not frame.isfactor(col_name)
else np.full(frame.nrow, 0.5),
c=color,
cmap=colormap
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixes #15743

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is a bug in matplotlib 3.7.1 and in later versions of matplotlib (3.7.2+) is fixed

@tomasfryda tomasfryda requested review from syzonyuliia and removed request for mn-mikke September 26, 2023 08:03
maurever
maurever previously approved these changes Sep 27, 2023
Copy link
Contributor

@maurever maurever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomasfryda, you did an incredible amount of work here in this PR. I checked the DeepLearning part. I tried to check your implementation with paper, and I did not notice anything suspicious. I left only a few suggestions to improve the code. Please make sure you cover all Deeplearning cases in the tests.

Good job! Thanks.

wendycwong
wendycwong previously approved these changes Oct 3, 2023
Arrays.fill(result, 0);
double biasTerm = (_dinfo._intercept ? _betas[_betas.length - 1] : 0);
for (int i = 0; i < _betas.length - (_dinfo._intercept ? 1 : 0); i++) {
idx = _compact ? _dinfo.coefOriginalColumnIndices()[i] : i;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking at one way to calculate Shapley for linear models from here: https://shap.readthedocs.io/en/latest/example_notebooks/tabular_examples/linear_models/Sentiment%20Analysis%20with%20Logistic%20Regression.html

I assume you want to set shap for feature i is phi_i = beta_i*(x_i - mean(x_i)).

You don't need to loop through the whole frame to get to mean(x_j). Just call mean() on each vector. However, we do need to do something special for categorical columns though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need contribution with regard to the background dataset:
phi_i = beta_i * (x_i - bg_i)

where bgis a point from background dataset. The background dataset is required to have "common ground" across the algos so it can be combined in the Stacked Ensemble. And also even for the tree algos, it is recommended to use SHAP calculated on top of the background set in general rather then on the tree structure (for example in Machine Learning for High-Risk Applications by Patrick Hall, James Curtis, Parul Pandey).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For _outputSpace is false. I see, so you are saying for each row j of x, you will need to do this: beta_i*bg_m(i) for m = 1 to size of bg. If you write this out, you will see this:

size bg * b_i*x_j[i]+ sum from m=1 to size bg b_i * bg_m[i]

if we take out the summation, we will have

b_i * x_j[i]+b_i*mean(bg[i])

It looks like for each row of x, you will need only the information of b_imean(bg[i]). Perhaps you can run at the beginning, calculate b_imean(bg[i]) or mean(bg[i]) as an array. I noticed that you also need each row of betabg_n+beta_0. You can generate a new frame that contains betabg_n+beta_0 for each row n.

Then, you can go through your data for x and calculate b_i * x_j[i]+b_i*mean(bg[i])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean output_per_reference=false. The output_space is used for switching between link space and "output" space (in glm between contributions for $\mu = g^{-1}(\eta)$ and $\eta = X\beta$).

Regarding your suggestion, you are right. I think this is low priority as GLM is interpretable by itself and for purely numeric datasets user can provide background dataset with dataset of means. (The same applies for DeepSHAP but the resulting SHAP values would be biased and would correspond to the original DeepSHAP not the default DeepSHAP in the shap package).

I will try to add this (for glm) if I have some spare time.

long _endRow;
Job _job;

public ContributionsWithBackgroundFrameTask(Key<Frame> frKey, Key<Frame> backgroundFrameKey, boolean perReference) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this class is for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to be able to calculate baseline SHAP for each point from frame with each point from background_frame as the reference.

in pseudo-code this class is approximately responsible for:

result = Frame()
for x in frame:
    for bg in background_frame:
        result.append(baselineSHAP(x, bg)

This yields contribution for all the points against all the references. It's not very useful by itself but it's required for calculating the SHAP for Stacked Ensembles. Some other uses can be for example explaining ensemble of models where the models are proprietary and can't be shared, e.g., from one company you get fraud score, from another credit score and you input these values to your model, those companies can share the baseline shap with you and you can find calculate contributions for the input features without knowing the fraud/credit models. (This example is taken from https://www.nature.com/articles/s41467-022-31384-3)

These baseline shap values are often designated as $\phi_i(model, x^{(e)}, x^{(b)})$ where $x^{(e)}$ is the row that we want to explain and $x^{(b)}$ is the reference.

Note the commonly used SHAP values are $$\phi_i(model, x^{(e)}) = \frac{1}{|D|}\sum_{x^{(b)} \in D} \phi_i(model, x^{(e)}, x^{(b)})$$ and this is called marginal SHAP (in older papers interventional SHAP).

@Override
protected void postGlobal() {
NewChunk[] ncs = Stream.of(appendables()).map(vec -> vec.chunkForChunkIdx(0)).toArray(NewChunk[]::new);
for (int i = 0; i < _partialSums.length; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you just call vec.mean()? I know we may run into problems with categorical columns though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I need something like frame.groupby("RowIdx").mean(). The number of rows is nrow(frame)*nrow(background_frame) and the result should have nrow(frame) rows.

wendycwong
wendycwong previously approved these changes Oct 5, 2023
Copy link
Contributor

@wendycwong wendycwong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is so much work! This is not the first time you got a big issue.

Just my suggestions for speedup. Also, can you compare results of your implementation with SHAP values from sklearn or other toolbox?

Arrays.fill(result, 0);
double biasTerm = (_dinfo._intercept ? _betas[_betas.length - 1] : 0);
for (int i = 0; i < _betas.length - (_dinfo._intercept ? 1 : 0); i++) {
idx = _compact ? _dinfo.coefOriginalColumnIndices()[i] : i;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For _outputSpace is false. I see, so you are saying for each row j of x, you will need to do this: beta_i*bg_m(i) for m = 1 to size of bg. If you write this out, you will see this:

size bg * b_i*x_j[i]+ sum from m=1 to size bg b_i * bg_m[i]

if we take out the summation, we will have

b_i * x_j[i]+b_i*mean(bg[i])

It looks like for each row of x, you will need only the information of b_imean(bg[i]). Perhaps you can run at the beginning, calculate b_imean(bg[i]) or mean(bg[i]) as an array. I noticed that you also need each row of betabg_n+beta_0. You can generate a new frame that contains betabg_n+beta_0 for each row n.

Then, you can go through your data for x and calculate b_i * x_j[i]+b_i*mean(bg[i])

for (int i = 0; i < _betas.length - (_dinfo._intercept ? 1 : 0); i++) {
idx = _compact ? _dinfo.coefOriginalColumnIndices()[i] : i;
result[idx] += _betas[i] * (row.get(i) - bgRow.get(i));
biasTerm += _betas[i] * bgRow.get(i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The biasTerm is independent of the dataset. You could have formed it once and then call on it to use it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I will postpone the improvements you suggested to some future PR as it would mainly speedup the use-case without StackedEnsemble and this PR is mainly for StackedEnsemble.

In the use-case with SE, there is as many bias terms as are the rows in the background data set and I could speed it up a bit but It would require me to reorder the rows in all other algos as well so I don't need to sort the contribution frames in the SE (I check for the same order but don't sort).

I already had implemented some version of the marginal shap for GLM but I removed it so I don't add unnecessary complexity (It doesn't really bring much more info than the coefficients in the GLM case).

26045d4#diff-b5a14183b7510c835fd11228e8c99c5e00ad87592933bf2569b3934a7879218fR188-R213

@tomasfryda
Copy link
Contributor Author

Also, can you compare results of your implementation with SHAP values from sklearn or other toolbox?

@wendycwong I added comparison for DeepLearning (DeepSHAP). I don't think we can convert our models to sklearn (or other toolbox) and since SHAP is model dependent I don't see a way how to do it and for most cases I think it's not necessary.

Why it shouldn't be necessary for the most models? The tree-based models and GLM can now provide you with exact SHAP values, i.e., the same value you'd get if you'd calculate the SHAP naively (exponential complexity) and this is what I do in the pyunit_SHAP_NOPASS.py for all those algos. So there are two remaining algos DeepLearning and Stacked Ensembles.

DeepLearning
During the development of DeepSHAP I compared the shap values with the values I get from the shap python package (which is created by the author of the paper that introduced SHAP). To get those values, I had to export the NN layers (weights and biases) to either tensorflow or pytorch. I chose the latter as I'm more familiar with it. I added the hacky export method to the DeepLearning SHAP java test and I ran it, copied to python to generate the shap values and copied it back to the java test. So DeepLearning is tested against the values from the reference implementation.

StackedEnsembles
Here the situation is a bit more complicated. The reference implementation for Generalized DeepSHAP depends on unmerged branch of shap (see the discussion on the reference implementation of the G-DeepSHAP ) which I was unable to compile. But even if I did I would have to somehow turn H2O models into some models supported by the shap package. But the good news is that SHAP has some properties we can test for (and I do) and even more importantly Generalized DeepSHAP is exact for stacked ensemble of linear models (with identity link function) so I can test for that the same way as for the tree algos/GLM (and I do).

@@ -2747,6 +2750,166 @@ Retrieving graphs via R is not yet supported. An `.ipynb demo showing this examp
shap.summary_plot(shap_values, X, plot_type="bar")


Interventional / Marginal SHAP
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hannah-tillman would you be so kind and have a quick look at the documentation, just in case something is not clear and needs rephrasing and/or typos etc.? Thanks in advance!

hannah-tillman
hannah-tillman previously approved these changes Oct 6, 2023
Copy link
Contributor

@hannah-tillman hannah-tillman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General updates & added references for maneuverability - LGTM :)

Copy link
Collaborator

@mn-mikke mn-mikke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great change, respect!!!

@tomasfryda tomasfryda merged commit 75c20c3 into master Oct 10, 2023
@tomasfryda tomasfryda deleted the tomf_GH-6698_-shapley-values-support-for-ensemble-models branch October 10, 2023 09:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants