Make latency impact of decision logs predictable #5724

mjungsbluth · 2023-03-04T14:56:54Z

What is the underlying problem you're trying to solve?

The current decision log plugin runs in two variants: unbounded (all decisions are kept) and bound (decisions are discarded if the buffer has an overflow). The actual trade-off is between auditability and availability. However, in the first unbounded case, the likelihood of an OOM kill actually grows if the decision log API gets overloaded, still loosing decisions.

On top, there are quite heavy locks on the the decision log plugin that force for example the encoding of decision in single file. When measuring raw performance of a fleet of OPAs (~50 instances at 30,000 rps) we measured a one order of magnitude higher p99 latency with vs without decision logs turned on.

Describe the ideal solution

If we change the trade-off to auditability vs latency guarantees, a lock-free ring buffer with a fixed size could be used as an alternative to the existing solution. This would limit the used memory in both cases.
In case auditability is favoured, offered chunks would be tried until it can be put in the buffer (this creates back pressure and increases latency). In case low latency is favoured, offered chunks that cannot be placed in the buffer can be discarded.
In both cases, this can be achieved without holding any locks.

ashutosh-narkar · 2023-04-21T01:34:54Z

@mjungsbluth currently, OPA assumes an unlimited buffer size and then users can set one. If there was a ring buffer with a fixed size, how would that size be determined? Or do we make this a mandatory config option? It's not clear how we would select the "right" value.

On top, there are quite heavy locks on the the decision log plugin that force for example the encoding of decision in single file. When measuring raw performance of a fleet of OPAs (~50 instances at 30,000 rps) we measured a one order of magnitude higher p99 latency with vs without decision logs turned on.

Can you share more details about the setup, policy etc. ? I am asking since I ran a performance test and p99 with and w/o decision logging was pretty similar hence wanted to understand a bit more here.

However, in the first unbounded case, the likelihood of an OOM kill actually grows if the decision log API gets overloaded, still loosing decisions.

You're right. What if we allowed the user to specify a memory threshold beyond which OPA starts dropping old logs till memory usage goes below the threshold. Few unknowns here but the idea is OPA keeps around more logs than the bounded approach w/o losing it all due to a OOM scenario.

mjungsbluth · 2023-04-24T20:25:23Z

Hi @ashutosh-narkar,

I would in either case suggest to go with sth user supplied first. As an example if a buffer size is specified, an additional flag (sth line overflow_strategy) could be „drop“ (current default behavior) or „backpressure“ could determine that you rather would like to keep all logs and rather delay the decision. Other variants would be to return a 500/err if a maximum latency is reached (This would typically be more monitoring friendly)

The setup has been shared as part of our relationship with Styra but we essentially saturated 50 OPA instances with 30,000 rps with a static „allow“ rule. I have to check but it is possible that only the 99.5 percentile was off which at our scale would already an issue for clients.

As for the ring buffer, even a standard golang channel with max size is better than the mutex. It just looked suspicious that inside the lock serialization and other caching is going on and not just protecting the shared memory. The issue with the latency might also be outside the decision log plugin but it was easy to reproduce with turning just the plugin off an on.

Pushpalanka · 2024-08-27T15:05:50Z

On a related note sharing this additional information in case it helps.
This is from a 60s profile capture during a load test with a simple allow rule. It was observed decision logging is taking more than 50% of evaluation time.

ashutosh-narkar · 2024-10-19T16:48:40Z

We did a POC that implements a ring buffer based decision logger. Below are some end-to-end latency results. The tests use a very simple policy. The code is on this branch if y'all want to try it out. Any feedback would be great!

OPA v0.69.0

$ echo 'POST http://localhost:8181/v1/data/foo/allow' | vegeta attack --duration=300s -rate=500 | tee results.bin | vegeta report
Requests      [total, rate, throughput]         150000, 500.00, 500.00
Duration      [total, attack, wait]             5m0s, 5m0s, 267.833µs
Latencies     [min, mean, 50, 90, 95, 99, max]  104.541µs, 689.407µs, 341.12µs, 504.51µs, 605.328µs, 14.288ms, 77.812ms
Bytes In      [total, mean]                     23700000, 158.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:150000
Error Set:

dev

$ echo 'POST http://localhost:8181/v1/data/foo/allow' | vegeta attack --duration=300s -rate=500 | tee results.bin | vegeta report
Requests      [total, rate, throughput]         150000, 500.00, 500.00
Duration      [total, attack, wait]             5m0s, 5m0s, 435.583µs
Latencies     [min, mean, 50, 90, 95, 99, max]  115.75µs, 378.295µs, 352.412µs, 483.206µs, 524.698µs, 784.635µs, 20.485ms
Bytes In      [total, mean]                     23700000, 158.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:150000
Error Set:

mjungsbluth · 2024-10-19T20:01:18Z

@mefarazath can you try to incorporate the mentioned branch in the Skipper based benchmarks you‘re currently running. Esp. If you start measuring the p99.5 and p99.9 percentiles

mjungsbluth · 2024-10-19T20:04:56Z

@ashutosh-narkar this looks very promising! esp. the latency stability. In your tests, could you capture the p99.5 and 99.9 percentiles as well? This is where we could see the biggest difference, esp. when pushing the instances to cpu saturation…

mefarazath · 2024-11-01T19:10:40Z

I tried the PoC using the Skipper OPA filter benchmarks outlined here with both the OPA v0.68.0 (current version used in skipper) and the PoC branch.

4 scenarios in the benchmark were,

minimal -> Simple allow := true policy
with-body -> allow with request body parameter
jwt-validation -> allow with JWT validation
policy-with-data-lookup -> allow with a datasource lookup (a PR is pending to add this to skipper repo)

Noticed a significant improvement in p99, p99.5, p99.9 (in addition to average/mean) compared to v0.68.0.
So this looks promising indeed :)

➜  decision_logs_improvements benchstat v0.68.0_1s.txt fix_1s.txt                                                                                                                                                                       10ms
goos: darwin
goarch: arm64
pkg: github.com/zalando/skipper/filters/openpolicyagent/opaauthorizerequest
cpu: Apple M1 Max
                                                                     │ v0.68.0_1s.txt │              fix_1s.txt              │
                                                                     │     sec/op     │    sec/op     vs base                │
AuthorizeRequest/minimal/with-decision-logs-10                           54.83µ ±  3%   12.26µ ±  2%  -77.64% (p=0.000 n=10)
AuthorizeRequest/minimal/with-decision-logs/P50-10                       118.9µ ±  3%   102.1µ ±  3%  -14.15% (p=0.000 n=10)
AuthorizeRequest/minimal/with-decision-logs/P99-10                     15321.6µ ±  1%   455.7µ ±  1%  -97.03% (p=0.000 n=10)
AuthorizeRequest/minimal/with-decision-logs/P99.5-10                   15716.0µ ±  2%   547.4µ ±  1%  -96.52% (p=0.000 n=10)
AuthorizeRequest/minimal/with-decision-logs/P99.9-10                   16828.7µ ± 15%   891.5µ ±  6%  -94.70% (p=0.000 n=10)
AuthorizeRequest/with-body/with-decision-logs-10                         56.94µ ±  2%   25.92µ ±  9%  -54.48% (p=0.000 n=10)
AuthorizeRequest/with-body/with-decision-logs/P50-10                     166.4µ ±  1%   235.6µ ±  5%  +41.63% (p=0.000 n=10)
AuthorizeRequest/with-body/with-decision-logs/P99-10                   14108.5µ ±  2%   767.0µ ± 13%  -94.56% (p=0.000 n=10)
AuthorizeRequest/with-body/with-decision-logs/P99.5-10                 14611.9µ ±  2%   937.6µ ± 15%  -93.58% (p=0.000 n=10)
AuthorizeRequest/with-body/with-decision-logs/P99.9-10                  15.808m ±  3%   1.629m ± 17%  -89.70% (p=0.000 n=10)
AuthorizeRequest/jwt-validation/with-decision-logs-10                    72.04µ ±  3%   22.39µ ±  3%  -68.92% (p=0.000 n=10)
AuthorizeRequest/jwt-validation/with-decision-logs/P50-10                215.6µ ±  3%   191.8µ ±  6%  -11.05% (p=0.000 n=10)
AuthorizeRequest/jwt-validation/with-decision-logs/P99-10              15629.4µ ±  1%   661.1µ ±  4%  -95.77% (p=0.000 n=10)
AuthorizeRequest/jwt-validation/with-decision-logs/P99.5-10            16062.0µ ±  2%   787.9µ ±  5%  -95.09% (p=0.000 n=10)
AuthorizeRequest/jwt-validation/with-decision-logs/P99.9-10             20.174m ±  4%   1.336m ±  8%  -93.38% (p=0.000 n=10)
AuthorizeRequest/policy-with-data-lookup/with-decision-logs-10           72.18µ ±  4%   30.76µ ±  2%  -57.38% (p=0.000 n=10)
AuthorizeRequest/policy-with-data-lookup/with-decision-logs/P50-10       274.1µ ±  7%   274.3µ ±  2%        ~ (p=0.565 n=10)
AuthorizeRequest/policy-with-data-lookup/with-decision-logs/P99-10     14347.6µ ±  1%   809.3µ ±  4%  -94.36% (p=0.000 n=10)
AuthorizeRequest/policy-with-data-lookup/with-decision-logs/P99.5-10   15226.4µ ±  2%   944.0µ ±  4%  -93.80% (p=0.000 n=10)
AuthorizeRequest/policy-with-data-lookup/with-decision-logs/P99.9-10    18.926m ±  4%   1.452m ± 12%  -92.33% (p=0.000 n=10)
geomean                                                                  2.168m         307.1µ        -85.83%

                                                               │ v0.68.0_1s.txt │              fix_1s.txt              │
                                                               │      B/op      │     B/op      vs base                │
AuthorizeRequest/minimal/with-decision-logs-10                     79.42Ki ± 0%   54.52Ki ± 0%  -31.35% (p=0.000 n=10)
AuthorizeRequest/with-body/with-decision-logs-10                   88.54Ki ± 0%   62.43Ki ± 0%  -29.49% (p=0.000 n=10)
AuthorizeRequest/jwt-validation/with-decision-logs-10             119.18Ki ± 0%   89.60Ki ± 0%  -24.82% (p=0.000 n=10)
AuthorizeRequest/policy-with-data-lookup/with-decision-logs-10    123.89Ki ± 0%   96.63Ki ± 0%  -22.00% (p=0.000 n=10)
geomean                                                            100.9Ki        73.68Ki       -27.01%

                                                               │ v0.68.0_1s.txt │             fix_1s.txt              │
                                                               │   allocs/op    │  allocs/op   vs base                │
AuthorizeRequest/minimal/with-decision-logs-10                      1.339k ± 0%   1.055k ± 0%  -21.18% (p=0.000 n=10)
AuthorizeRequest/with-body/with-decision-logs-10                    1.399k ± 0%   1.096k ± 0%  -21.66% (p=0.000 n=10)
AuthorizeRequest/jwt-validation/with-decision-logs-10               1.806k ± 0%   1.520k ± 0%  -15.86% (p=0.000 n=10)
AuthorizeRequest/policy-with-data-lookup/with-decision-logs-10      2.529k ± 0%   2.217k ± 0%  -12.34% (p=0.000 n=10)
geomean                                                             1.710k        1.405k       -17.85%

Dumping my benchstat results here incase anyone is interested.

Note: Percentile measurement was introduced in skipper benchmarks with PR (yet to be merged)

adetunjiakintundeakinde · 2024-11-29T17:22:25Z

@ashutosh-narkar I wanted to say that I noticed a significant improvement in latencies and decision logging when using the ring-buffer poc. I used a policy and ran system was running at 2.5k RPS (requests per second) over a 2-minute test.

Without ring buffer implementation

Requests [total, rate, throughput] 359308, 2847.43, 82.54
Duration [total, attack, wait] 2m34s, 2m6s, 27.404s
Latencies [min, mean, 50, 90, 95, 99, max] 122.375µs, 37.885s, 30.575s, 1m0s, 1m1s, 1m17s, 2m2s
Bytes In [total, mean] 0, 0.00
Bytes Out [total, mean] 0, 0.00
Success [ratio] 3.53%
Status Codes [code:count] 0:346631 200:12677

Using the ring buffer implementation

Requests [total, rate, throughput] 360001, 3000.01, 2999.96
Duration [total, attack, wait] 2m0s, 2m0s, 1.542ms
Latencies [min, mean, 50, 90, 95, 99, max] 226.833µs, 13.918ms, 1.353ms, 7.088ms, 18.205ms, 362.479ms, 3.015s
Bytes In [total, mean] 0, 0.00
Bytes Out [total, mean] 0, 0.00
Success [ratio] 100.00%
Status Codes [code:count] 0:1 200:360000

sspaink · 2025-03-17T14:09:12Z

@mjungsbluth a PR has been made to add the new event buffer: #7446
It behaves the same way as the POC with some implementation differences and adds a way to configure the new buffer.

If you have time could you give the PR a look and make sure it will work for you? Any extra review/testing would be greatly appreciated. Thanks!

mefarazath · 2025-03-18T11:21:29Z

@mjungsbluth a PR has been made to add the new event buffer: #7446 It behaves the same way as the POC with some implementation differences and adds a way to configure the new buffer.

If you have time could you give the PR a look and make sure it will work for you? Any extra review/testing would be greatly appreciated. Thanks!

I did a quick run pointing to the PR using the benchmark. The results looks promising (old=buffer type 'size', new=buffer type 'event').

goos: darwin
goarch: arm64
pkg: github.com/zalando/skipper/filters/openpolicyagent/opaauthorizerequest
cpu: Apple M1 Max
                                                    │      old      │                 new                  │
                                                    │    sec/op     │    sec/op     vs base                │
WithPercentiles/minimal-with-decision-logs-10          52.354µ ± 1%   8.887µ ±  6%  -83.03% (p=0.000 n=10)
WithPercentiles/minimal-with-decision-logs/P50-10      112.02µ ± 2%   59.31µ ±  0%  -47.05% (p=0.000 n=10)
WithPercentiles/minimal-with-decision-logs/P95-10       702.5µ ± 2%   220.5µ ± 14%  -68.61% (p=0.000 n=10)
WithPercentiles/minimal-with-decision-logs/P99-10     15469.8µ ± 1%   543.8µ ± 12%  -96.48% (p=0.000 n=10)
WithPercentiles/minimal-with-decision-logs/P99.9-10    16.143m ± 1%   1.499m ± 17%  -90.72% (p=0.000 n=10)
geomean                                                 1.006m        156.8µ        -84.41%

                                              │     old      │                 new                  │
                                              │     B/op     │     B/op      vs base                │
WithPercentiles/minimal-with-decision-logs-10   68.50Ki ± 0%   44.69Ki ± 0%  -34.76% (p=0.000 n=10)

                                              │     old     │                new                 │
                                              │  allocs/op  │ allocs/op   vs base                │
WithPercentiles/minimal-with-decision-logs-10   1084.0 ± 0%   815.0 ± 0%  -24.82% (p=0.000 n=10)

In benchmark tests to avoid the impact of decision log upload, i usually set the reporting.min_delay_seconds to a high value (like 300 secs). This worked fine with buffer_type size.
With buffer_type event, i noticed that unless I set the buffer_size_limit_events to a higher value the benchmark would hang (I am sure this is something to do with the way my benchmark test is setup).

  				"reporting": {
					"buffer_type": "event",
					"buffer_size_limit_events": 1000000,	
					"min_delay_seconds": 300,
					"max_delay_seconds": 600				
				}

@sspaink what is the expected behaviour, if the buffer_size_limit_events is exceeded before the min_delay_seconds is reached (ie. the limit for events is exceeded before the decision upload happens)?

sspaink · 2025-03-18T22:06:53Z

@mefarazath once buffer_size_limit_events is reached the oldest events is dropped to make room for the new incoming event. The reasoning behind this default is because it matches the behavior of the current buffer when buffer_size_limit_bytes is set, which also drops events when the limit is reached. The current default size is too low, it is only a 100 at the moment.

I created a separate ticket to track adding the support to never drop events, which will be in a separate PR: #7454

mjungsbluth added the feature-request label Mar 4, 2023

ashutosh-narkar added design distribution Issues related to the bundle plugin optimization runtime and removed design distribution Issues related to the bundle plugin labels Mar 7, 2023

ashutosh-narkar mentioned this issue Mar 8, 2023

default behaviour should be to throw away decision logs instead of running OOM if its unable to send the decision logs #4659

Closed

ashutosh-narkar added design monitoring Issues related to decision log and status plugins and removed runtime labels Mar 8, 2023

ashutosh-narkar self-assigned this Mar 9, 2023

ashutosh-narkar added the requires-investigation Issues not under active investigation but requires one label Mar 9, 2023

ashutosh-narkar mentioned this issue Apr 20, 2023

Respect server request to drop decision logs #5638

Closed

ashutosh-narkar added investigating Issues being actively investigated and removed requires-investigation Issues not under active investigation but requires one labels Apr 20, 2023

mjungsbluth mentioned this issue Jun 6, 2024

Reduce locks during decision logging #6797

Merged

johanfylling assigned johanfylling and unassigned ashutosh-narkar Jun 13, 2024

ashutosh-narkar added this to Open Policy Agent Aug 5, 2024

ashutosh-narkar moved this to In Progress in Open Policy Agent Aug 5, 2024

mjungsbluth mentioned this issue Oct 7, 2024

enable opa-wasm zalando/skipper#3261

Open

mefarazath mentioned this issue Oct 21, 2024

Improve opa filter benchmarks zalando/skipper#3278

Closed

johanfylling assigned sspaink Feb 25, 2025

sspaink mentioned this issue Mar 12, 2025

feat: new event-based decisions log buffer implementation #7446

Merged

sspaink mentioned this issue Mar 18, 2025

Decision log plugin: don't drop events when reaching buffer limit, slow down and retry #7454

Open

sspaink mentioned this issue Mar 18, 2025

Decision log plugin: trigger upload as soon as upload_size_limit_bytes is reached #7455

Open

sspaink closed this as completed in #7446 Mar 26, 2025

github-project-automation bot moved this from In Progress to Done in Open Policy Agent Mar 26, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make latency impact of decision logs predictable #5724

Make latency impact of decision logs predictable #5724

mjungsbluth commented Mar 4, 2023

ashutosh-narkar commented Apr 21, 2023

mjungsbluth commented Apr 24, 2023

Pushpalanka commented Aug 27, 2024 •

edited

Loading

ashutosh-narkar commented Oct 19, 2024

mjungsbluth commented Oct 19, 2024

mjungsbluth commented Oct 19, 2024

mefarazath commented Nov 1, 2024 •

edited

Loading

adetunjiakintundeakinde commented Nov 29, 2024 •

edited

Loading

sspaink commented Mar 17, 2025

mefarazath commented Mar 18, 2025

sspaink commented Mar 18, 2025

Make latency impact of decision logs predictable #5724

Make latency impact of decision logs predictable #5724

Comments

mjungsbluth commented Mar 4, 2023

What is the underlying problem you're trying to solve?

Describe the ideal solution

ashutosh-narkar commented Apr 21, 2023

mjungsbluth commented Apr 24, 2023

Pushpalanka commented Aug 27, 2024 • edited Loading

ashutosh-narkar commented Oct 19, 2024

mjungsbluth commented Oct 19, 2024

mjungsbluth commented Oct 19, 2024

mefarazath commented Nov 1, 2024 • edited Loading

adetunjiakintundeakinde commented Nov 29, 2024 • edited Loading

sspaink commented Mar 17, 2025

mefarazath commented Mar 18, 2025

sspaink commented Mar 18, 2025

Pushpalanka commented Aug 27, 2024 •

edited

Loading

mefarazath commented Nov 1, 2024 •

edited

Loading

adetunjiakintundeakinde commented Nov 29, 2024 •

edited

Loading