Add an option of scaling data #247

AdamBacon1257624653 · 2018-10-23T02:37:19Z

As I have been dived into the LIME source code, I found that a parameter named sample_around_instance is provided initializing LimeTabularExplainer.
And it's used within the member method named __data_inverse of LimeTabularExplainer, which is called from explain_instance that is also a member method of LimeTabularExplainer.

The parameter sample_around_instance just branches out two way of initilazing the generated data.
One is based on the data_row which is the original sample data record to be explained, whereas another is based on the mean of all training data.

However, it stucks me why the returned data from __data_inverse is used within explain_instance with only one option scaled based on the mean.
I think it should be provided with another option scaled based on the data_row

Add an option of scaling generated data with data_row

marcotcr · 2018-11-20T19:21:31Z

I think the current behavior is correct: even if we sample around the instance, we want the scaling to be done based on the reference dataset. The point of scaling is just for the weights of features in different scales to be comparable to one another - and this should not depend on the instance being explained, but on dataset statistics.
Please reopen the issue if you think I am wrong.

AdamBacon1257624653 · 2018-11-27T03:31:08Z

You are right, the point of scaling is to be comparable to one another. If we generate data doing the inverse operation of scaling data only based on reference dataset. Then it is ok. But, we do not. We do the inverse operation of scaling data also based on the explained dataset when we sample around the instance. Hence, I think we should scale it back based on the explained data when we sample around the instance.

@marcotcr

AdamBacon1257624653 · 2018-11-27T03:48:08Z

I reopen it over here: #258

@marcotcr

marcotcr · 2018-11-28T01:16:08Z

Sorry, I didn't understand what you meant. Let me try to restate things to see if there is a mismatch in understanding:

Operation 1 scales and centers (-mean) data so that different feature weights in explanation can be comparable
Operation 2 generates data close to the instance. We are multiplying by scaler.scale_ because we sampled from a Normal(0, 1) above, and we want to perturb each feature according to its scale. The other option two lines below does the same, but generates data similar to the training set (around mean) rather than close to the instance.

The two operations are not related to one another, in my mind. I don't see why a change in operation 2 would make us want to change operation 1.

Add an option of scaling data

57f2973

Add an option of scaling generated data with data_row

marcotcr closed this Nov 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add an option of scaling data #247

Add an option of scaling data #247

AdamBacon1257624653 commented Oct 23, 2018

marcotcr commented Nov 20, 2018

AdamBacon1257624653 commented Nov 27, 2018 •

edited

Loading

AdamBacon1257624653 commented Nov 27, 2018 •

edited

Loading

marcotcr commented Nov 28, 2018

Add an option of scaling data #247

Add an option of scaling data #247

Conversation

AdamBacon1257624653 commented Oct 23, 2018

marcotcr commented Nov 20, 2018

AdamBacon1257624653 commented Nov 27, 2018 • edited Loading

AdamBacon1257624653 commented Nov 27, 2018 • edited Loading

marcotcr commented Nov 28, 2018

AdamBacon1257624653 commented Nov 27, 2018 •

edited

Loading

AdamBacon1257624653 commented Nov 27, 2018 •

edited

Loading