Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different F1 score for same signal for different time_segments_aggregate interval #511

Open
Sid-030591 opened this issue Feb 10, 2024 · 8 comments
Labels
question Further information is requested

Comments

@Sid-030591
Copy link

  • Orion version: orion_ml-0.5.3.dev1-py2.py3-none-any.whl
  • Python version: Python 3.10.12
  • Operating System: Windows 11 Home

Description

I am using AER pipeline for detecting anomalies on a synthetic dataset that I have created. Dataset follows MA 1 characteristics with 7 anomalies added at random instants. Timestamp is sampled at 1 hour (3600). Now, when I run this with time_segments_aggregate with 3600 interval, only 1 out of 7 anomaly is detected and it takes around 15 minutes. On the contrary, when I run the same dataset with time_segments_aggregate with 21600 interval, all 7 anomalies are detected in around 3 minutes time. Could you please explain how interval value is actually having an impact on F1 score? I can understand its impact on the time taken.

What I Did

for file in tqdm(csv_files):
  file_path = os.path.join(folder_path, file)
  our_data = pd.read_csv(file_path)

  our_data = our_data[['timestamp','value']]

  hyperparameters = {
    "mlstars.custom.timeseries_preprocessing.time_segments_aggregate#1": {
        "time_column": "timestamp",
        "interval": 21600,
        "method": "mean"
    },
    'orion.primitives.aer.AER#1': {
        'epochs': 35,
        'verbose': False
    }
}

  orion = Orion(
    pipeline='aer',
    hyperparameters=hyperparameters
  )

anomalies = orion.fit_detect(our_data)
@sarahmish sarahmish added the question Further information is requested label Feb 12, 2024
@sarahmish
Copy link
Collaborator

Hi @Sid-030591, thank you for using Orion!

Please refer to the documentation of time_segments_aggregate to view how the aggregation is made.

With interval=21600 you will aggregate 6 hours into a single value, making your time series shorter and thus the model should be faster.

As for the performance, can you provide a snippet of what the input & output looks like? how many intervals did the pipeline detect?

@Sid-030591
Copy link
Author

Sid-030591 commented Feb 13, 2024

Behavior_description.docx

Hello @sarahmish , thank you for your response.
I have described my observation in attached document. It comes out to be different than what I had initially thought. Nevertheless, this seems interesting to me. Please let me know your understanding on this.

@sarahmish
Copy link
Collaborator

Thanks for the description @Sid-030591!

your reported results make sense, when the threshold is fixed, we always capture the same extreme values (4 standard deviations away from the mean) and therefore obtain the same result. However, when the threshold is dynamic (fixed_threshold=False) the results will change as there is an element of randomness in this approach.

I hope that this answers your question!

@Sid-030591
Copy link
Author

Sid-030591 commented Feb 15, 2024

Thank you @sarahmish for the answer.
I have few follow up questions:
1)I had a quick glance at the find_threshold function where this is handled. Is it possible to give a quick idea of what's the logic there? I can see you are calculating some sort of cost function and optimizing it to get the best z. I am basically, trying to understand the source of randomness in this logic. Also, is it because of this randomness that you chose fixed k one as your default method or is there any other reason?

  1. For the purpose of reproducibility, is there any way to use sort of seed for getting the consistent results?

  2. For the purpose of benchmarking etc, I would like to know how do you do the training and test. For eg: for NAB dataset, say Tweets or adex dataset, total number of datapoints are not a lot. Do you train on the whole timeseries and then test on the same series or is there a general test train split that you maintain? I could not find this part in the code.

  3. I would also like to know how we can run the pipeline on timeseries which have multiple columns (features). I saw a previous issue wherein something similar was answered. Say, i have a dataset with 50 columns. If i have to fit the pipeline individually , it is like an overhead. Of course, I can loop it up but I wanted to understand if the current implementation throws an error for multiple columns or would it still run fine? Would there be a difference say when columns are independent or they are correlated.

Point 3 and 4 are not actual follow ups on the initial topic. But rather than opening a new issue, I thought of writing in this one. Hope this is ok for you. I have been doing some work on AD area and thus have all these questions, Hope this is also ok.

@Sid-030591
Copy link
Author

@sarahmish I think i found the answer to the 3rd and 4th question. So, 0.2 is the validation split that you use for 3rd question. Also, we can train the pipeline on multi columns , anomalies will be found on univariate (single target column to be provided). Please confirm if this is the correct understanding. Also, I would appreciate if you could answer the 1st and 2nd question.

@sarahmish
Copy link
Collaborator

Hi @Sid-030591, I was referring to the randomness of the model, and thereafter the error values.

find_anomalies with dynamic threshold is a more sensitive to changes from one to another, whereas when you make it fixed, it becomes more consistent.

I would first recommend referring to issue #375 to see some detail on find_anomalies. I also made a notebook so you can see how the error values change from one run to another, as a result, the detection is substantially different when fixed_threshold=False while being more consistent when fixed_threshold=True. I also want to note that with every run you'll get something different.

Let me know if you have other questions!

@Sid-030591
Copy link
Author

Sid-030591 commented Feb 22, 2024

Hello @sarahmish , thank you for your response. I understand the point wrt randomness in the used model and also in the post processing step. I would like to know one more point here - so let's say you are doing some performance benchmarking/comparison. Say, AER model with reg ratio as 0.5 (default) and 0.3, 0.4. Now, we will get different F1 values. How should we conclude regarding what part of this difference is coming from the inherent randomness and what part is from the actual change ( which is in the reg ratio value) - specially when the values are not too different (let's say). One way could be to run many simulations and possibly take an average to come up with a better estimate. What's you understanding on this?

@sarahmish
Copy link
Collaborator

Depending on the variable you are changing you can attribute this change. For example, reg_ratio determines the weight of the forward/backward regression and reconstruction. Higher values for reg_ratio means that you are emphasizing the importance of regression model, and vice versa.

In practice, however, running the same model will yield close results but not identical. To reduce the variability, I recommend running the model for n iterations to see some consistency in the results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants