Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug_one does not show clearly the explanation of how x is predicted #1549

Closed
leandrolma3 opened this issue May 22, 2024 · 6 comments
Closed

Comments

@leandrolma3
Copy link

leandrolma3 commented May 22, 2024

Versions

I'm working using Google Collab with the latest version of Python libraries including River.

Describe your task

I need to get an explanation of how x is predicted to convert them in the format of decision rules (IF-THEM). I've been trying all models in River that implement the concept of the Hoeffding Tree to classify a stream that arrives and get the predicted explanation with debug_one in order to convert them into rules. Unfortunately, the method Debug_one returns for the first streams only the predicted class with no explanation of which attributes and conditional were considered. Again, I've been trying different Hoeffding Tree modeçs and datasets in the River and the same occurs:

Here are some explanation samples that I got using the debug_one method:
Class True:
P(True) = 1.0

Another sample:
Class True:
P(False) = 0.3
P(True) = 0.7

What kind of performance are you expecting?

I expect to get all the attributes and conditional with values used to predict the class for a given data stream as:
Expected explanation :
empty_server_form_handler > 0.5454545454545454
popup_window ≤ 0.2727272727272727
Class False:
P(False) = 0.6
P(True) = 0.4

Steps/code to reproduce

# Sample code to reproduce the performance issue

from river import datasets
from river import drift
from river import tree
from river import metrics
from typing import List

classifier = tree.HoeffdingAdaptiveTreeClassifier(drift_detector=drift.ADWIN(delta=0.001))

def process_chunk(chunk_data: List[dict], chunk_labels: List[bool], metric, chunk_metric):
    i=0
    for xi, yi in zip(chunk_data, chunk_labels):
        y_pred = classifier.predict_one(xi)
        metric.update(yi, y_pred)
        chunk_metric.update(yi, y_pred)
        classifier.learn_one(xi, yi)
        rules = classifier.debug_one(xi)
        # print(f"Number of nodes {classifier.height}")
        if (i==0):
            print(f"Rules for instance {rules}")
            i=1
        # print(f"Rules for instance {rules}")
    print(f"Accuracy for this chunk: {chunk_metric.get()}%")
    return metrics.Accuracy()


stream = datasets.Phishing()
chunk_size = 100
metric = metrics.Accuracy()
chunk_metric = metrics.Accuracy()

chunk_data, chunk_labels = [], []

for x, y in stream:
    chunk_data.append(x)
    chunk_labels.append(y)
    
    if len(chunk_data) == chunk_size:
        chunk_metric = process_chunk(chunk_data, chunk_labels, metric, chunk_metric)
        chunk_data, chunk_labels = [], []

print(f"Final accuracy of the model: {metric.get()}%")

Necessary data

I appreciate all suggestions about this issue.

@smastelini
Copy link
Member

Hi @leandrolma3, thanks for reporting. How many instances have the tree learned before asking for explanations?

If the tree consists of a single (root) node, then the expected output of debug_one is what you report.

@leandrolma3
Copy link
Author

Hi @smastelini, thank you very much for your reply. So, I checked and only after the middle of the 5th chunk processed did the method return a rule with a conditional. This refers to processing about 440 instances of the stream data.

Sorry, if I missed something about the Hoeffding trees, but I was expecting a decision rule even for a single root that would be represented by an attribute chosen from the dataset.

Reading more about I realize that the single (root) node is represented by a classification probability based on the data that arrived, correct? is there some explanation system for a single node of Hoeffding trees?

@smastelini
Copy link
Member

smastelini commented May 23, 2024

Hi @leandrolma3. No, a single node tree does not apply any decision split. The outputs are taken care of by the underlying leaf decision model. By default, classification Hoeffding Trees use either Naive Bayes or majority vote, depending on which of these two options yields the best results.

Take a look at the grace_period params of the trees, which controls the interval before split attempts take place. If you only notice a change in the structure of the trees after around 400 instances, decreasing the grace period might accelerate tree growth. You can also try increasing the delta parameter to achieve the same end.

Keep in mind that from a data streaming standpoint, hundreds and even a few thousand samples might be just the start of the game :D
These models are designed to process potentially infinite streams of data.

@leandrolma3
Copy link
Author

Thank you for your time and great explanation @smastelini. I'm trying to implement some methods to get an explanation about the classification data by Hoeffding Trees before a decision split and your explanation helped me to validate my methodology. Thank you very much.

@smastelini
Copy link
Member

Nice to hear that, @leandrolma3. Please, do not hesitate to ask more questions, if needed.

If your question was answered, can we close this issue?

@leandrolma3
Copy link
Author

Yes @smastelini you did answer. I'll close, thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants