Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make latency impact of decision logs predictable #5724

Closed
mjungsbluth opened this issue Mar 4, 2023 · 11 comments · Fixed by #7446
Closed

Make latency impact of decision logs predictable #5724

mjungsbluth opened this issue Mar 4, 2023 · 11 comments · Fixed by #7446
Assignees
Labels
design feature-request investigating Issues being actively investigated monitoring Issues related to decision log and status plugins optimization

Comments

@mjungsbluth
Copy link
Contributor

What is the underlying problem you're trying to solve?

The current decision log plugin runs in two variants: unbounded (all decisions are kept) and bound (decisions are discarded if the buffer has an overflow). The actual trade-off is between auditability and availability. However, in the first unbounded case, the likelihood of an OOM kill actually grows if the decision log API gets overloaded, still loosing decisions.

On top, there are quite heavy locks on the the decision log plugin that force for example the encoding of decision in single file. When measuring raw performance of a fleet of OPAs (~50 instances at 30,000 rps) we measured a one order of magnitude higher p99 latency with vs without decision logs turned on.

Describe the ideal solution

If we change the trade-off to auditability vs latency guarantees, a lock-free ring buffer with a fixed size could be used as an alternative to the existing solution. This would limit the used memory in both cases.
In case auditability is favoured, offered chunks would be tried until it can be put in the buffer (this creates back pressure and increases latency). In case low latency is favoured, offered chunks that cannot be placed in the buffer can be discarded.
In both cases, this can be achieved without holding any locks.

@ashutosh-narkar ashutosh-narkar added design distribution Issues related to the bundle plugin optimization runtime and removed design distribution Issues related to the bundle plugin labels Mar 7, 2023
@ashutosh-narkar ashutosh-narkar added design monitoring Issues related to decision log and status plugins and removed runtime labels Mar 8, 2023
@ashutosh-narkar ashutosh-narkar self-assigned this Mar 9, 2023
@ashutosh-narkar ashutosh-narkar added the requires-investigation Issues not under active investigation but requires one label Mar 9, 2023
@ashutosh-narkar ashutosh-narkar added investigating Issues being actively investigated and removed requires-investigation Issues not under active investigation but requires one labels Apr 20, 2023
@ashutosh-narkar
Copy link
Member

@mjungsbluth currently, OPA assumes an unlimited buffer size and then users can set one. If there was a ring buffer with a fixed size, how would that size be determined? Or do we make this a mandatory config option? It's not clear how we would select the "right" value.

On top, there are quite heavy locks on the the decision log plugin that force for example the encoding of decision in single file. When measuring raw performance of a fleet of OPAs (~50 instances at 30,000 rps) we measured a one order of magnitude higher p99 latency with vs without decision logs turned on.

Can you share more details about the setup, policy etc. ? I am asking since I ran a performance test and p99 with and w/o decision logging was pretty similar hence wanted to understand a bit more here.

However, in the first unbounded case, the likelihood of an OOM kill actually grows if the decision log API gets overloaded, still loosing decisions.

You're right. What if we allowed the user to specify a memory threshold beyond which OPA starts dropping old logs till memory usage goes below the threshold. Few unknowns here but the idea is OPA keeps around more logs than the bounded approach w/o losing it all due to a OOM scenario.

@mjungsbluth
Copy link
Contributor Author

Hi @ashutosh-narkar,

I would in either case suggest to go with sth user supplied first. As an example if a buffer size is specified, an additional flag (sth line overflow_strategy) could be „drop“ (current default behavior) or „backpressure“ could determine that you rather would like to keep all logs and rather delay the decision. Other variants would be to return a 500/err if a maximum latency is reached (This would typically be more monitoring friendly)

The setup has been shared as part of our relationship with Styra but we essentially saturated 50 OPA instances with 30,000 rps with a static „allow“ rule. I have to check but it is possible that only the 99.5 percentile was off which at our scale would already an issue for clients.

As for the ring buffer, even a standard golang channel with max size is better than the mutex. It just looked suspicious that inside the lock serialization and other caching is going on and not just protecting the shared memory. The issue with the latency might also be outside the decision log plugin but it was easy to reproduce with turning just the plugin off an on.

@Pushpalanka
Copy link
Contributor

Pushpalanka commented Aug 27, 2024

On a related note sharing this additional information in case it helps.
This is from a 60s profile capture during a load test with a simple allow rule. It was observed decision logging is taking more than 50% of evaluation time.

image

@ashutosh-narkar
Copy link
Member

We did a POC that implements a ring buffer based decision logger. Below are some end-to-end latency results. The tests use a very simple policy. The code is on this branch if y'all want to try it out. Any feedback would be great!

OPA v0.69.0

$ echo 'POST http://localhost:8181/v1/data/foo/allow' | vegeta attack --duration=300s -rate=500 | tee results.bin | vegeta report
Requests      [total, rate, throughput]         150000, 500.00, 500.00
Duration      [total, attack, wait]             5m0s, 5m0s, 267.833µs
Latencies     [min, mean, 50, 90, 95, 99, max]  104.541µs, 689.407µs, 341.12µs, 504.51µs, 605.328µs, 14.288ms, 77.812ms
Bytes In      [total, mean]                     23700000, 158.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:150000
Error Set:

dev

$ echo 'POST http://localhost:8181/v1/data/foo/allow' | vegeta attack --duration=300s -rate=500 | tee results.bin | vegeta report
Requests      [total, rate, throughput]         150000, 500.00, 500.00
Duration      [total, attack, wait]             5m0s, 5m0s, 435.583µs
Latencies     [min, mean, 50, 90, 95, 99, max]  115.75µs, 378.295µs, 352.412µs, 483.206µs, 524.698µs, 784.635µs, 20.485ms
Bytes In      [total, mean]                     23700000, 158.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:150000
Error Set:

@mjungsbluth
Copy link
Contributor Author

@mefarazath can you try to incorporate the mentioned branch in the Skipper based benchmarks you‘re currently running. Esp. If you start measuring the p99.5 and p99.9 percentiles

@mjungsbluth
Copy link
Contributor Author

@ashutosh-narkar this looks very promising! esp. the latency stability. In your tests, could you capture the p99.5 and 99.9 percentiles as well? This is where we could see the biggest difference, esp. when pushing the instances to cpu saturation…

@mefarazath
Copy link

mefarazath commented Nov 1, 2024

I tried the PoC using the Skipper OPA filter benchmarks outlined here with both the OPA v0.68.0 (current version used in skipper) and the PoC branch.

4 scenarios in the benchmark were,

  1. minimal -> Simple allow := true policy
  2. with-body -> allow with request body parameter
  3. jwt-validation -> allow with JWT validation
  4. policy-with-data-lookup -> allow with a datasource lookup (a PR is pending to add this to skipper repo)

Noticed a significant improvement in p99, p99.5, p99.9 (in addition to average/mean) compared to v0.68.0.
So this looks promising indeed :)

➜  decision_logs_improvements benchstat v0.68.0_1s.txt fix_1s.txt                                                                                                                                                                       10ms
goos: darwin
goarch: arm64
pkg: github.com/zalando/skipper/filters/openpolicyagent/opaauthorizerequest
cpu: Apple M1 Max
                                                                     │ v0.68.0_1s.txt │              fix_1s.txt              │
                                                                     │     sec/op     │    sec/op     vs base                │
AuthorizeRequest/minimal/with-decision-logs-10                           54.83µ ±  3%   12.26µ ±  2%  -77.64% (p=0.000 n=10)
AuthorizeRequest/minimal/with-decision-logs/P50-10                       118.9µ ±  3%   102.1µ ±  3%  -14.15% (p=0.000 n=10)
AuthorizeRequest/minimal/with-decision-logs/P99-10                     15321.6µ ±  1%   455.7µ ±  1%  -97.03% (p=0.000 n=10)
AuthorizeRequest/minimal/with-decision-logs/P99.5-10                   15716.0µ ±  2%   547.4µ ±  1%  -96.52% (p=0.000 n=10)
AuthorizeRequest/minimal/with-decision-logs/P99.9-10                   16828.7µ ± 15%   891.5µ ±  6%  -94.70% (p=0.000 n=10)
AuthorizeRequest/with-body/with-decision-logs-10                         56.94µ ±  2%   25.92µ ±  9%  -54.48% (p=0.000 n=10)
AuthorizeRequest/with-body/with-decision-logs/P50-10                     166.4µ ±  1%   235.6µ ±  5%  +41.63% (p=0.000 n=10)
AuthorizeRequest/with-body/with-decision-logs/P99-10                   14108.5µ ±  2%   767.0µ ± 13%  -94.56% (p=0.000 n=10)
AuthorizeRequest/with-body/with-decision-logs/P99.5-10                 14611.9µ ±  2%   937.6µ ± 15%  -93.58% (p=0.000 n=10)
AuthorizeRequest/with-body/with-decision-logs/P99.9-10                  15.808m ±  3%   1.629m ± 17%  -89.70% (p=0.000 n=10)
AuthorizeRequest/jwt-validation/with-decision-logs-10                    72.04µ ±  3%   22.39µ ±  3%  -68.92% (p=0.000 n=10)
AuthorizeRequest/jwt-validation/with-decision-logs/P50-10                215.6µ ±  3%   191.8µ ±  6%  -11.05% (p=0.000 n=10)
AuthorizeRequest/jwt-validation/with-decision-logs/P99-10              15629.4µ ±  1%   661.1µ ±  4%  -95.77% (p=0.000 n=10)
AuthorizeRequest/jwt-validation/with-decision-logs/P99.5-10            16062.0µ ±  2%   787.9µ ±  5%  -95.09% (p=0.000 n=10)
AuthorizeRequest/jwt-validation/with-decision-logs/P99.9-10             20.174m ±  4%   1.336m ±  8%  -93.38% (p=0.000 n=10)
AuthorizeRequest/policy-with-data-lookup/with-decision-logs-10           72.18µ ±  4%   30.76µ ±  2%  -57.38% (p=0.000 n=10)
AuthorizeRequest/policy-with-data-lookup/with-decision-logs/P50-10       274.1µ ±  7%   274.3µ ±  2%        ~ (p=0.565 n=10)
AuthorizeRequest/policy-with-data-lookup/with-decision-logs/P99-10     14347.6µ ±  1%   809.3µ ±  4%  -94.36% (p=0.000 n=10)
AuthorizeRequest/policy-with-data-lookup/with-decision-logs/P99.5-10   15226.4µ ±  2%   944.0µ ±  4%  -93.80% (p=0.000 n=10)
AuthorizeRequest/policy-with-data-lookup/with-decision-logs/P99.9-10    18.926m ±  4%   1.452m ± 12%  -92.33% (p=0.000 n=10)
geomean                                                                  2.168m         307.1µ        -85.83%

                                                               │ v0.68.0_1s.txt │              fix_1s.txt              │
                                                               │      B/op      │     B/op      vs base                │
AuthorizeRequest/minimal/with-decision-logs-10                     79.42Ki ± 0%   54.52Ki ± 0%  -31.35% (p=0.000 n=10)
AuthorizeRequest/with-body/with-decision-logs-10                   88.54Ki ± 0%   62.43Ki ± 0%  -29.49% (p=0.000 n=10)
AuthorizeRequest/jwt-validation/with-decision-logs-10             119.18Ki ± 0%   89.60Ki ± 0%  -24.82% (p=0.000 n=10)
AuthorizeRequest/policy-with-data-lookup/with-decision-logs-10    123.89Ki ± 0%   96.63Ki ± 0%  -22.00% (p=0.000 n=10)
geomean                                                            100.9Ki        73.68Ki       -27.01%

                                                               │ v0.68.0_1s.txt │             fix_1s.txt              │
                                                               │   allocs/op    │  allocs/op   vs base                │
AuthorizeRequest/minimal/with-decision-logs-10                      1.339k ± 0%   1.055k ± 0%  -21.18% (p=0.000 n=10)
AuthorizeRequest/with-body/with-decision-logs-10                    1.399k ± 0%   1.096k ± 0%  -21.66% (p=0.000 n=10)
AuthorizeRequest/jwt-validation/with-decision-logs-10               1.806k ± 0%   1.520k ± 0%  -15.86% (p=0.000 n=10)
AuthorizeRequest/policy-with-data-lookup/with-decision-logs-10      2.529k ± 0%   2.217k ± 0%  -12.34% (p=0.000 n=10)
geomean                                                             1.710k        1.405k       -17.85%

Dumping my benchstat results here incase anyone is interested.

Note: Percentile measurement was introduced in skipper benchmarks with PR (yet to be merged)

@adetunjiakintundeakinde
Copy link

adetunjiakintundeakinde commented Nov 29, 2024

@ashutosh-narkar I wanted to say that I noticed a significant improvement in latencies and decision logging when using the ring-buffer poc. I used a policy and ran system was running at 2.5k RPS (requests per second) over a 2-minute test.

Without ring buffer implementation

Requests [total, rate, throughput] 359308, 2847.43, 82.54
Duration [total, attack, wait] 2m34s, 2m6s, 27.404s
Latencies [min, mean, 50, 90, 95, 99, max] 122.375µs, 37.885s, 30.575s, 1m0s, 1m1s, 1m17s, 2m2s
Bytes In [total, mean] 0, 0.00
Bytes Out [total, mean] 0, 0.00
Success [ratio] 3.53%
Status Codes [code:count] 0:346631 200:12677

Using the ring buffer implementation

Requests [total, rate, throughput] 360001, 3000.01, 2999.96
Duration [total, attack, wait] 2m0s, 2m0s, 1.542ms
Latencies [min, mean, 50, 90, 95, 99, max] 226.833µs, 13.918ms, 1.353ms, 7.088ms, 18.205ms, 362.479ms, 3.015s
Bytes In [total, mean] 0, 0.00
Bytes Out [total, mean] 0, 0.00
Success [ratio] 100.00%
Status Codes [code:count] 0:1 200:360000

@sspaink
Copy link
Contributor

sspaink commented Mar 17, 2025

@mjungsbluth a PR has been made to add the new event buffer: #7446
It behaves the same way as the POC with some implementation differences and adds a way to configure the new buffer.

If you have time could you give the PR a look and make sure it will work for you? Any extra review/testing would be greatly appreciated. Thanks!

@mefarazath
Copy link

@mjungsbluth a PR has been made to add the new event buffer: #7446 It behaves the same way as the POC with some implementation differences and adds a way to configure the new buffer.

If you have time could you give the PR a look and make sure it will work for you? Any extra review/testing would be greatly appreciated. Thanks!

I did a quick run pointing to the PR using the benchmark. The results looks promising (old=buffer type 'size', new=buffer type 'event').

goos: darwin
goarch: arm64
pkg: github.com/zalando/skipper/filters/openpolicyagent/opaauthorizerequest
cpu: Apple M1 Max
                                                    │      old      │                 new                  │
                                                    │    sec/op     │    sec/op     vs base                │
WithPercentiles/minimal-with-decision-logs-10          52.354µ ± 1%   8.887µ ±  6%  -83.03% (p=0.000 n=10)
WithPercentiles/minimal-with-decision-logs/P50-10      112.02µ ± 2%   59.31µ ±  0%  -47.05% (p=0.000 n=10)
WithPercentiles/minimal-with-decision-logs/P95-10       702.5µ ± 2%   220.5µ ± 14%  -68.61% (p=0.000 n=10)
WithPercentiles/minimal-with-decision-logs/P99-10     15469.8µ ± 1%   543.8µ ± 12%  -96.48% (p=0.000 n=10)
WithPercentiles/minimal-with-decision-logs/P99.9-10    16.143m ± 1%   1.499m ± 17%  -90.72% (p=0.000 n=10)
geomean                                                 1.006m        156.8µ        -84.41%

                                              │     old      │                 new                  │
                                              │     B/op     │     B/op      vs base                │
WithPercentiles/minimal-with-decision-logs-10   68.50Ki ± 0%   44.69Ki ± 0%  -34.76% (p=0.000 n=10)

                                              │     old     │                new                 │
                                              │  allocs/op  │ allocs/op   vs base                │
WithPercentiles/minimal-with-decision-logs-10   1084.0 ± 0%   815.0 ± 0%  -24.82% (p=0.000 n=10)

In benchmark tests to avoid the impact of decision log upload, i usually set the reporting.min_delay_seconds to a high value (like 300 secs). This worked fine with buffer_type size.
With buffer_type event, i noticed that unless I set the buffer_size_limit_events to a higher value the benchmark would hang (I am sure this is something to do with the way my benchmark test is setup).

  				"reporting": {
					"buffer_type": "event",
					"buffer_size_limit_events": 1000000,	
					"min_delay_seconds": 300,
					"max_delay_seconds": 600				
				}

@sspaink what is the expected behaviour, if the buffer_size_limit_events is exceeded before the min_delay_seconds is reached (ie. the limit for events is exceeded before the decision upload happens)?

@sspaink
Copy link
Contributor

sspaink commented Mar 18, 2025

@mefarazath once buffer_size_limit_events is reached the oldest events is dropped to make room for the new incoming event. The reasoning behind this default is because it matches the behavior of the current buffer when buffer_size_limit_bytes is set, which also drops events when the limit is reached. The current default size is too low, it is only a 100 at the moment.

I created a separate ticket to track adding the support to never drop events, which will be in a separate PR: #7454

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design feature-request investigating Issues being actively investigated monitoring Issues related to decision log and status plugins optimization
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

7 participants