Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add task isolation leaking #6544

Merged
merged 1 commit into from
Dec 13, 2024
Merged

Conversation

natemort
Copy link
Member

@natemort natemort commented Dec 5, 2024

What changed?
Add taskIsolationDuration, the time during which tasks will only be routed to pollers belonging to the same isolation group as them*.

Additionally replace the hardcoded constants of 1 minute and 10s for startup flexibilility and recent poller lookback respectively with a single value: TaskIsolationPollerWindow.

For the sake of measuring impact we now include the original partition group within the partition config prior to dispatching the task. This value isn't persisted within the tasks table and is only necessary so that a task that is forwarded after abandoning isolation will emit the correct metric. No behavior depends on this value, it is only used in metrics.

Add integration tests for tasklist isolation, as well as unit tests for getIsolationGroupForTask.

Update the matching simulator to include this new value.

Why?

  • Adding a failsafe for tasklist isolation to ensure latency doesn't exceed a configurable threshold.

How did you test it?

  • Unit, integration, and simulation tests

Potential risks

  • This is a major change to tasklist isolation, but it's guarded behind a feature flag. Within Uber we've broadly disabled this feature and we're unaware of any major usage outside of Uber.

Release notes

Documentation Changes

service/matching/tasklist/task.go Show resolved Hide resolved
service/matching/tasklist/task_list_manager_test.go Outdated Show resolved Hide resolved
service/matching/tasklist/task_list_manager.go Outdated Show resolved Hide resolved
service/matching/tasklist/task_list_manager_test.go Outdated Show resolved Hide resolved
@natemort natemort force-pushed the isolation_leak branch 2 times, most recently from 42d3c1c to 75c58b0 Compare December 12, 2024 20:51
Add taskIsolationDuration, the time during which tasks will only be routed to pollers belonging to the same isolation group as them*.

Additionally replace the hardcoded constants of 1 minute and 10s for startup flexibilility and recent poller lookback respectively with a single value: TaskIsolationPollerWindow.

For the sake of measuring impact we now include the original partition group within the partition config prior to dispatching the task. This value isn't persisted within the tasks table and is only necessary so that a task that is forwarded after abandoning isolation will emit the correct metric. No behavior depends on this value, it is only used in metrics.

Add integration tests for tasklist isolation, as well as unit tests for getIsolationGroupForTask.

Update the matching simulator to include this new value.
@Shaddoll Shaddoll merged commit 6f41871 into cadence-workflow:master Dec 13, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants