Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process events instantly and consistently, stop skipping the events due to "batching" #844

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

nolar
Copy link
Owner

@nolar nolar commented Oct 3, 2021

TL;DR: Process the events/changes asap, with no artificial delays (even 0.1s), stop skipping some of the events under high load, prevent double-execution if 3rd parties also patch the resources.

Background

A long time ago, at the dawn of Kopf (5a0b2da, #42, #43), the arrived events for a resource objects were packed into "batches". Only the last event that arrived within the 0.1-second window was processed, all preceding events were ignored.

Originally, it was made to address an issue in the official kubernetes client (which was later replaced by pykube-ng, which in turn was later replaced by raw HTTP queries via aiohttp).

Besides, though not documented, this short time window ensured consistency in some rare cases when a resource was patched by 3rd parties while being processed by Kopf: once Kopf performed its patch, it instantly got both events from the 3rd party and from itself, and only processed the latest state with its own annotations.

Problems

The time-based approach to consistency led to several negative effects noticeable on slow networks between the operator and apiservers (e.g. when operators are executed not in the cluster) or under high load (e.g. with too many resources or too many modifications of several resources).

  • @on.event handlers were not executed for valuable but intermediate events — because they were packed into batches and discarded in favour of the latest event only.
  • All handlers were delayed by some time (0.1 seconds by default; configurable) instead of instant processing.
  • Under the high intensity of changes, the batch could be expanded too much and too long — if all events arrived within 0.1 seconds after each other, potentially leading either to significant delays in processing or no processing at all (if the stream of intense changes never pauses/stops).
  • Double-processing/double-execution of handlers if 3rd parties also patched the resource (including handlers, if they were not doing this via Kopf's patching machinery) — explained below.

These effects were reported and investigated in #729, also directly or indirectly mentioned in #718, #732, #784 (comment). (This PR supersedes and closes #829.)

Besides, code-wise:

  • Added complexity to the code and to tests.

Double-processing

In the mentioned cases (slow networks and/or high load), the version patched by Kopf could arrive later than 0.1 seconds after the version patched by the 3rd party. As a result, the batch was not formed and these events were processed separately. In turn, since the intermediate state of the 3rd-party version did not contain Kopf's annotations about the successful processing of the resource, the handlers were re-executed (double-executed). And only after Kopf's patched version arrived, the operator went idle as expected.

Here is how it happened on the timeline, visually, with an artificial delay of 3 seconds:

         /-- kubectl creates an object (state a=s0)
         | ... sleep 1s
         |    /-- kubectl patches the spec.field with the patch "p1", creates state "b"=s0+p1
         |    |      /-- Kopf patches with annotations (state c=s0+p1+p2)
         |    |      |    /-- Kopf patches with annotations (the same state d=s0+p1+p2+p3, d==c)
         ↓    ↓      |    |
----+-//-aaaaabbbbbbbcccccdddddddddddddddddd--> which state is stored in kubernetes
         ↓    ↓      ↑↓   ↑↓
         |    |      ||   |\----3s----\
         |    |      |\---+3s----\    |
         |    \----3s+---\|      |    |
         \----3s----\|   ||      |    |
                    ↓↑   ↓↑      ↓    ↓
----+-//------------aaaaabbbbbbbbcccccdddddd--> which state is seen by the operator
    ↓               ↓↑   ↓↑      ↓    ↓
    |               ||   ||      |    \-- Kopf gets the state "d"=s0+p1+p2+p3, sees the annotations, goes idle.
    |               ||   ||      \-- Kopf gets the state "c"=s0+p1+p2, sees the annotations, goes idle.
    |               ||   ||
    |               ||   |\-- Kopf reacts, executes handlers (2ND TIME), adds annotations with a patch (p3)
    |               ||   \-- Kopf gets the state "b"=s0+p1 with NO annotations of "p2" yet.
    |               ||       !BUG!: "c"=s0+p1+p2 is not seen yet, though "c"/"p2" exists by now!
    |               ||
    |               |\-- Kopf reacts, executes handlers (1ST TIME), adds annotations with a patch (p2)
    |               \-- Kopf gets a watch-event (state a)
    \-- Kopf starts watching the resource

Solution

The proposed solution introduces a wait for consistency of the resource after it is patched: since the PATCH operation returns the patched resource, we can get its resource version and expect this version in the watch-stream. All states arrived before the expected version are considered inconsistent and thus are not processed at least for the high-level state-dependent handlers.

This is how it looks on the timeline with the same artificial delay of 3 seconds:

         /-- kubectl creates an object (state a=s0)
         | ... sleep 1s
         |    /-- kubectl patches the spec.field with the patch "p1", creates state "b"=s0+p1
         |    |      /-- Kopf patches with annotations (state c=s0+p1+p2)
         ↓    ↓      |
----+-//-aaaaabbbbbbbccccccccccccccccccc-> which state is stored in kubernetes
         ↓    ↓      ↑↓
         |    |      |\----3s----\
         |    \----3s+---\       |
         \----3s----\|   |       |
                    ↓↑   ↓       ↓ 
----+-//------------aaaaabbbbbbbbcccccc-> which state is seen by the operator
    ↓               ↓↑⇶⇶⇶⇶⇶⇶⇶⇶⇶⇶⇶⇶↓  Kopf's own patch "p2" enables the consistency expectation for 5s OR version "c"
    |               ||   |       |
    |               ||   |       \-- Kopf gets a consistent state "c"=s0+p1+p2 as expected, thus goes idle.
    |               ||   |
    |               ||   \-- Kopf executes ONLY the low-level handlers over the state "b"=s0+p1.
    |               ||   \~~~~~~⨳ inconsistency mode: wait until a new event (then discard it) OR timeout (then process it) 
    |               ||
    |               |\-- Kopf reacts, executes handlers, adds annotations with a patch (p2)
    |               \-- Kopf gets a watch-event (state a)
    \-- Kopf starts watching the resource

In case the expected version does not arrive within some reasonable time window (5-10 seconds), assume that it will not arrive at all and reset the consistency waiting as if the consistency is reached (even if not). This is a rare case, mostly impossible, and is needed only as a safe-guard: it is better to double-process the resource and cause side-effects than to cease its processing forever.

Time-based batching is removed completely as outdated and not adding any benefit.

User effects

As a result of this fix, all mentioned problems are addressed:

  • Low-level events are never skipped, each and every event is handled via @on.event(), indexed and passed to daemons.
  • High-level handlers are never double-executed — because only the consistent states are processed, so the 3rd party patches without Kopf's annotations do not trigger these handlers.
  • The processing happens instantly with no artificial delay for batching.
  • Batches do never expand infinitely (because there are no batches anymore).

TODOs

  • Verify that the solution is principally correct, explore the edge-cases manually.
    • Multiple change-detecting handlers, with several patches in a row.
    • Combined change-detecting and low-level handlers (without patches).
    • Combined change-detecting and low-level handlers (with patches).
    • Indexing of resources with long inconsistency.
    • Live body updates in daemons/timers.
  • Adjust tests: remove the time-based ones, add the resource-version-based ones.
  • Adjust the docs where applicable.

@nolar nolar added enhancement New feature or request refactoring Code cleanup without new features added labels Oct 3, 2021
@lgtm-com
Copy link

lgtm-com bot commented Oct 3, 2021

This pull request introduces 3 alerts when merging e856f7d into c1745b7 - view on LGTM.com

new alerts:

  • 3 for Unused import

@lgtm-com
Copy link

lgtm-com bot commented Oct 3, 2021

This pull request introduces 3 alerts when merging 051496e into c1745b7 - view on LGTM.com

new alerts:

  • 3 for Unused import

@paxbit
Copy link

paxbit commented Feb 17, 2022

Hi @nolar,

could you please give a short update on this PR? Is it still on track?

Cheers,
paxbit

@lgtm-com
Copy link

lgtm-com bot commented Aug 7, 2022

This pull request introduces 1 alert when merging 1e66e5f into 7b45690 - view on LGTM.com

new alerts:

  • 1 for Unused import

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request refactoring Code cleanup without new features added
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants