Any insight on having 0% accuracy in this data? #882

craigbrown-nist · 2022-03-17T06:13:24Z

craigbrown-nist
Mar 17, 2022

I have v0.9 and tried several of the Anomaly detectors, and none work very well at all on simple data from NAB (using AWS data with an index and CPU usage, and no dates etc but with a data jump and outliers-partial example below). No parameter change in this particular example affects the scores....

I would appreciate it if anyone could point to where I'm going wrong.
Thanks!

import numpy as np
import pandas as pd
from river import stream
from river import compose
from river import anomaly
from river import metrics
from river import preprocessing

X =  pd.read_csv('NAB/data/realAWSCloudwatch/rds_cpu_utilization_cc0c53-1.csv')

Y = X["value"]
X = X.drop("value", axis = 'columns')
X_y = stream.iter_pandas(X, Y)

model = compose.Pipeline(
    preprocessing.MinMaxScaler(),
    anomaly.ConstantThresholder(
        anomaly.HalfSpaceTrees(seed=42),
        threshold=0.8
    )
)

report = metrics.ClassificationReport()

for x, y in X_y:
    score = model.score_one(x)
    model = model.learn_one(x) 
    report = report.update(y, score) 

report

index	value
1	6.456
2	5.816
3	6.268
4	5.816
5	5.862
6	6.246
7	6.648
8	6.448
9	6.46
10	5.834
11	6.232
12	6.064
13	6.052
14	5.834
15	6.464
16	5.622
17	6.238
18	6.06
19	6.04
20	5.838
21	6.024
22	5.86
...
3079	6.464
3080	6.036
3081	25.1033
3082	17.186
3083	14.452
3084	13.968
3085	13.352
3086	15.6433
3087	14.4533
3088	15.42
3089	18.3333
3090	16.19
3091	15
3092	15.07
3093	15
3094	15.0867
3095	13.9333

Answered by MaxHalford

Mar 17, 2022

Hello. What exact data are you using? This dataset? The only feature is a date, so it's not clear to me what preprocessing you've done.

View full answer

MaxHalford · 2022-03-17T08:48:09Z

MaxHalford
Mar 17, 2022
Maintainer

Hello. What exact data are you using? This dataset? The only feature is a date, so it's not clear to me what preprocessing you've done.

3 replies

craigbrown-nist Mar 17, 2022
Author

Right. I replaced the date by an index - but date should work if formatted - honestly it has not been trivial for me to make a stream and iterating on such a simple data set compared to all the more complex data in the examples!

VaysseRobin Mar 17, 2022
Maintainer

Hello.The problem is that you only use the date as a feature so the model will never detect anomalies in the dates as they are normal. The feature you need to give to the anomaly detector is "value". The model will then give you an anomaly score between 0 (normal value) and 1 (abnormal value). You can't measure the performance of your model if you don't have the information of which value is abnormal or not (which doesn't seem to be available in your dataset). It is unsupervised learning.

You can check the documentation of Half Space Trees here to have an example : https://riverml.xyz/dev/api/anomaly/HalfSpaceTrees/

craigbrown-nist Mar 17, 2022
Author

Indeed! See the new code that is working for predicting points, and we can forget that ever happened.... and thank you.

import numpy as np
import pandas as pd  
from datetime import datetime

from river import stream 
from river import compose
from river import linear_model
from river import preprocessing
# from river import anomaly
from river import metrics

import matplotlib
import matplotlib.pyplot as plt
# Customize matplotlib
matplotlib.rcParams.update(
    {
        'text.usetex': False,
        'font.family': 'stixgeneral',
        'mathtext.fontset': 'stix',
    }
)

X =  pd.read_csv('/NAB/data/realAWSCloudwatch/rds_cpu_utilization_cc0c53.csv', parse_dates = ["timestamp"])

Y = X["value"]
X = X.drop("value", axis = 'columns') 
# X['timestamp'] = pd.to_datetime(X['timestamp']).dt.datetime
print(type(X['timestamp'][0]))

X_y = stream.iter_pandas(X, Y)

for x, y in X_y:
    print(x, y)
    break

def get_datetime(x):
    dt64 = np.datetime64(x['timestamp'])
    ts = (dt64 - np.datetime64('1970-01-01T00:00:00')) / np.timedelta64(1, 's')
    return {'float_date': ts}


model = compose.Pipeline(
    ('float_date', compose.FuncTransformer(get_datetime)),
    ('scale', preprocessing.StandardScaler()),
    ('lin_reg', linear_model.LinearRegression())
)




def evaluate_model(model): 

    metric = metrics.Rolling(metrics.MAE(), 12)

    dates = []
    y_trues = []
    y_preds = []

    for x, y in X_y:
         # Obtain the prior prediction and update the model in one go
        y_pred = model.predict_one(x)
        model.learn_one(x, y)

        # Update the error metric
        metric.update(y, y_pred)

        # Store the true value and the prediction
        
        dates.append(x['timestamp'])
         
        y_trues.append(y)
        y_preds.append(y_pred)

    # Plot the results
    fig, ax = plt.subplots(figsize=(10, 6))
    ax.grid(alpha=0.75)
    ax.plot(dates, y_trues, lw=3, color='#2ecc71', alpha=0.8, label='Ground truth')
    ax.plot(dates, y_preds, lw=3, color='#e74c3c', alpha=0.8, label='Prediction')
    ax.legend()
    ax.set_title(metric)
    
evaluate_model(model)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Any insight on having 0% accuracy in this data? #882

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Any insight on having 0% accuracy in this data? #882

craigbrown-nist Mar 17, 2022

Replies: 1 comment · 3 replies

MaxHalford Mar 17, 2022 Maintainer

craigbrown-nist Mar 17, 2022 Author

VaysseRobin Mar 17, 2022 Maintainer

craigbrown-nist Mar 17, 2022 Author

craigbrown-nist
Mar 17, 2022

Replies: 1 comment 3 replies

MaxHalford
Mar 17, 2022
Maintainer

craigbrown-nist Mar 17, 2022
Author

VaysseRobin Mar 17, 2022
Maintainer

craigbrown-nist Mar 17, 2022
Author