Global ratelimiter helper: usage-tracking fallback-capable rate.Limiter #6028

Groxx · 2024-05-18T21:13:13Z

This commit largely exists to break this out for separate review, as it's relatively simple and isolated.

In a later PR, this will be used by the "limiter"-side logic to track and enforce usage of a ratelimit. At a high level it does only a couple things:

Tracks calls to Allow() bool so we can find out how much a limit is actually used (so we can distribute the allowances fairly)
Updates ratelimit rates when requested
Automatically falls back to different ratelimit if the system seems unhealthy

^ all of which is done in an atomic-like way so concurrent operations do not block each other, as true precision is not necessary. It only has to be "good enough", and performance is more important than precision.

codecov · 2024-05-18T21:30:25Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 67.32%. Comparing base (02c7efb) to head (6369bc2).
Report is 24 commits behind head on master.

Additional details and impacted files

Files	Coverage Δ
...ommon/quotas/global/collection/internal/limiter.go	`100.00% <100.00%> (ø)`

... and 21 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 02c7efb...6369bc2. Read the comment docs.

Groxx · 2024-05-21T19:46:23Z

this might change if we go with #6030, e.g. it could be implemented as a wrapper and then it could collect accurate reserve->cancel metrics too.

I'm not sure if that's currently necessary - I believe all our reserve->cancel logic uses per-host limiters as the first level, rather than something intended to be global. but that does seem like something we will eventually change, so it'd be risky to leave without at least documenting.

common/quotas/global/collection/internal/limiter.go

Shaddoll · 2024-05-22T19:55:05Z

common/quotas/global/collection/internal/limiter.go

+ // - this limit is legitimately unused and no data exists for it
+ // - ring re-sharding led to losing track of the limit, and it is now owned by a host with insufficient data
+ //
+ // To mitigate the impact of the second case, "insufficient data" responses from aggregating hosts


perhaps, it's worth adding some metrics tracking when we fallback

yeah, I'm planning on metrics. that'll be outside these objects though, and it's why Collect() returns the "use fallback" bool.

If there is a transition from non-fallback to fallback or vice versa, the counter will be inaccurate.

if you mean the limiting behavior will be inaccurate, e.g. it'll allow / deny more requests than it should:

yes, but we don't have any great way to resolve that. *rate.Limiter doesn't allow directly manipulating its tokens, and there isn't even a true "correct" behavior if this happens because it implies we don't know the correct behavior - we haven't been able to sync with other instances.

it'll only be incorrect for one "time to fill the ratelimiter's burstable tokens" period though, which is currently 1 second (limit == burst in basically all of our uses). generally speaking that'd mean ~2x overage at worst, assuming all hosts switched at the same time.

I could try to empty the tokens when switching, which would be safer for us because it wouldn't allow exceeding either limit, but we don't have an efficient API for that - ideally it'd get tokens and try to e.g. binary-search Allow(n) to empty most or all of them, since it's racing with other calls. Right now it'd need around thousands of calls per limiter to drain, which is definitely excessive.

The #6030 wrapper could do this relatively precisely and efficiently, as it can lock everything while draining. If we think 1 second of overage is worth preventing, I suspect it'd be relatively easy to add a "drain the limiter" API for this kind of purpose. I'm... ~50% "yes, worth it"? Can definitely be convinced to build that.

tl;dr from IRL, from clarifying some of the surroundings here:

a layer above this, which holds all these individual tracking-and-fallback-able limiters, will be periodically calling Collect() and it'll emit a counter for "using fallback" so we can monitor how often fallbacks are used.

but "uses fallback" should only occur in practice:

temporarily after startup / when a key is first used (no data, must use fallback)

this might be worth tracking separately, the failed-update counter can show when this is happening (negative value)

when the global limiter is disabled and rejects all requests (falling back is desired here)

when something cataclysmic is happening and global limit data is being lost rapidly, e.g. many host crashes -> ring changes rapidly -> does not return weights before crashing too. (falling back is desirable here too, as the old values are old enough to not trust)

so the sustained existence of fallback-using things when we expect the global ratelimiter to be working is a concern, but not individual "failed update" events or temporary use.

the exact metrics emitted are going to be in a later diff, that all happens outside of this object.

common/quotas/global/collection/internal/limiter.go

Global ratelimiter helper: usage-tracking fallback-capable rate.Limit

8873565

Groxx changed the title ~~Global ratelimiter helper: usage-tracking fallback-capable rate.Limit~~ Global ratelimiter helper: usage-tracking fallback-capable rate.Limiter May 18, 2024

Groxx marked this pull request as ready for review May 18, 2024 21:21

Groxx requested review from Shaddoll, neil-xie, davidporter-id-au, agautam478, jakobht, 3vilhamster, sankari165, dkrotx, taylanisikdemir and demirkayaender as code owners May 18, 2024 21:21

Shaddoll reviewed May 22, 2024

View reviewed changes

common/quotas/global/collection/internal/limiter.go Show resolved Hide resolved

Shaddoll reviewed May 22, 2024

View reviewed changes

Groxx commented May 22, 2024

View reviewed changes

common/quotas/global/collection/internal/limiter.go Outdated Show resolved Hide resolved

common/quotas/global/collection/internal/limiter.go Outdated Show resolved Hide resolved

doc improvements

6369bc2

Shaddoll approved these changes May 24, 2024

View reviewed changes

Groxx merged commit 0aac302 into uber:master May 24, 2024
20 checks passed

Groxx deleted the ratelimiter-limiter branch May 24, 2024 22:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global ratelimiter helper: usage-tracking fallback-capable rate.Limiter #6028

Global ratelimiter helper: usage-tracking fallback-capable rate.Limiter #6028

Groxx commented May 18, 2024 •

edited

codecov bot commented May 18, 2024 •

edited

Groxx commented May 21, 2024 •

edited

Shaddoll May 22, 2024

Groxx May 22, 2024

Shaddoll May 23, 2024

Groxx May 24, 2024 •

edited

Groxx May 24, 2024 •

edited

Global ratelimiter helper: usage-tracking fallback-capable rate.Limiter #6028

Global ratelimiter helper: usage-tracking fallback-capable rate.Limiter #6028

Conversation

Groxx commented May 18, 2024 • edited

codecov bot commented May 18, 2024 • edited

Codecov Report

Groxx commented May 21, 2024 • edited

Shaddoll May 22, 2024

Choose a reason for hiding this comment

Groxx May 22, 2024

Choose a reason for hiding this comment

Shaddoll May 23, 2024

Choose a reason for hiding this comment

Groxx May 24, 2024 • edited

Choose a reason for hiding this comment

Groxx May 24, 2024 • edited

Choose a reason for hiding this comment

Groxx commented May 18, 2024 •

edited

codecov bot commented May 18, 2024 •

edited

Groxx commented May 21, 2024 •

edited

Groxx May 24, 2024 •

edited

Groxx May 24, 2024 •

edited