Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve Coordinated Omission documentation and debug output #300

Merged
merged 2 commits into from
Jul 2, 2021

Conversation

jeremyandrews
Copy link
Member

  • Always log the GooseUser thread's cadence (the time it takes to run through all GooseTasks) in the requests file (if enabled)
  • Generate an INFO level log message if Coordinated Omission detects an abnormally slow request, for example:
    10:10:02 [INFO] coordinated omission alert 6.957s into goose attack: "GET http://apache/node/8848" [200] took abnormally long (2932 ms), task name: "(Anon) node page"
    
  • update the README with improved documenation

@LionsAd
Copy link
Collaborator

LionsAd commented Jul 2, 2021

I would add how long Goose expects it to take not just the abnormally long (42 seconds) normal is 13 seconds. etc.

@jeremyandrews jeremyandrews merged commit a15914e into tag1consulting:main Jul 2, 2021
@jeremyandrews
Copy link
Member Author

It doesn’t actually know this. It’s comparing an individual request against the entire time to run ALL requests. (There’s logic in git history to track all requests, but it’s too much overhead).

@jeremyandrews jeremyandrews deleted the co-info branch July 3, 2021 05:40
@jeremyandrews
Copy link
Member Author

These are the requests that will trigger Coordinated Omission Mitigation, but they are not themselves generated by Coordinated Omission Mitigation.

In other words, these are actual requests that were made to the server, and report the actual time it took. You'd have to look further in the requests file to find the subsequent back-filled "requests" generated by Coordinated Omission Mitigation. (But there's not a 1:1 match, as Goose is monitoring the per GooseUser cadence, not the per request cadence.)

@bbros-dev
Copy link
Contributor

Okay. This sort of spelunking and cross correlating just to get data to analyze if what issue #305 aims to eliminate.

@@ -527,9 +527,27 @@ If something causes the response to a request to take abnormally long, raw Goose

### Mitigation

Goose attempts to mitigate Coordinated Omission by back-filling the metrics with the statistically expected requests. To do this, it tracks the normal "cadence" of each `GooseUser`, timing how long it takes to loop through all `GooseTasks` in the assigned `GooseTaskSet`. By default, Goose will trigger Coordinated Omission Mitigation if the time to loop through a `GooseTaskSet` takes more than twice as long as the average time of all previous loops. In this case, on the next loop through the `GooseTaskSet` when tracking the actual metrics for each subsequent request in all `GooseTasks` it will also add in statistically generated "requests" with a `response_time` starting at the unexpectedly long request time, then again with that `response_time` minus the normal "cadence", continuing to generate a metric then subtract the normal "cadence" until arriving at the expected "respone_time". In this way, Goose is able to estimate the actual effect of a slowdown.
Goose attempts to mitigate Coordinated Omission by back-filling the metrics with the statistically expected requests. To do this, it tracks the normal "cadence" of each `GooseUser`, timing how long it takes to loop through all `GooseTasks` in the assigned `GooseTaskSet`. By default, Goose will trigger Coordinated Omission Mitigation if the time to loop through a `GooseTaskSet` takes more than twice as long as the average time of all previous loops. In this case, on the next loop through the `GooseTaskSet` when tracking the actual metrics for each subsequent request in all `GooseTasks` it will also add in statistically generated "requests" with a `response_time` starting at the unexpectedly long request time, then again with that `response_time` minus the normal "cadence", continuing to generate a metric then subtract the normal "cadence" until arriving at the expected `response_time`. In this way, Goose is able to estimate the actual effect of a slowdown.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

takes more than twice as long as the average time of all previous loops

I expected this to refer to the configured CO threshold, minimum, average, maximum?


If the `--requests-file` is enabled, requests back-filled by Coordinated Omission Mitigation show up in the generated log file, even though they were not actually sent to the server. In the following example, Coordinated Omission Mitigation was triggered when the server took 11,965 milliseconds to loop through all requests, instead of the average cadence of 3,162 milliseconds. This causes it to backfill a block of requests that statistically should have happened, with a `response_time` decreasing by the expected request cadence.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3,162 milliseconds

Typo? 3,161 milliseconds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants