Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an option of scaling data #247

Closed

Conversation

AdamBacon1257624653
Copy link

As I have been dived into the LIME source code, I found that a parameter named sample_around_instance is provided initializing LimeTabularExplainer.
And it's used within the member method named __data_inverse of LimeTabularExplainer, which is called from explain_instance that is also a member method of LimeTabularExplainer.

The parameter sample_around_instance just branches out two way of initilazing the generated data.
One is based on the data_row which is the original sample data record to be explained, whereas another is based on the mean of all training data.

However, it stucks me why the returned data from __data_inverse is used within explain_instance with only one option scaled based on the mean.
I think it should be provided with another option scaled based on the data_row
demo

Add an option of scaling generated data with data_row
@marcotcr
Copy link
Owner

I think the current behavior is correct: even if we sample around the instance, we want the scaling to be done based on the reference dataset. The point of scaling is just for the weights of features in different scales to be comparable to one another - and this should not depend on the instance being explained, but on dataset statistics.
Please reopen the issue if you think I am wrong.

@marcotcr marcotcr closed this Nov 20, 2018
@AdamBacon1257624653
Copy link
Author

AdamBacon1257624653 commented Nov 27, 2018

You are right, the point of scaling is to be comparable to one another. If we generate data doing the inverse operation of scaling data only based on reference dataset. Then it is ok. But, we do not. We do the inverse operation of scaling data also based on the explained dataset when we sample around the instance. Hence, I think we should scale it back based on the explained data when we sample around the instance.
image

@marcotcr

@AdamBacon1257624653
Copy link
Author

AdamBacon1257624653 commented Nov 27, 2018

I reopen it over here: #258

@marcotcr

@marcotcr
Copy link
Owner

Sorry, I didn't understand what you meant. Let me try to restate things to see if there is a mismatch in understanding:

  • Operation 1 scales and centers (-mean) data so that different feature weights in explanation can be comparable
  • Operation 2 generates data close to the instance. We are multiplying by scaler.scale_ because we sampled from a Normal(0, 1) above, and we want to perturb each feature according to its scale. The other option two lines below does the same, but generates data similar to the training set (around mean) rather than close to the instance.

The two operations are not related to one another, in my mind. I don't see why a change in operation 2 would make us want to change operation 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants