New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draft interfaces for replication admission control v2 #124484

Draft

sumeerbhola wants to merge 144 commits into cockroachdb:master from sumeerbhola:rac_v2

Collaborator

sumeerbhola commented May 21, 2024

This is extremely rough, and will likely be hard to understand.
Meant for folks working on replication AC v2 prototype.

blathers-crl bot commented May 21, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

Member

cockroach-teamcity commented May 21, 2024

This change is

sumeerbhola force-pushed the rac_v2 branch 2 times, most recently from 7223dca to 519e123 Compare

May 22, 2024 15:05

blathers-crl bot commented May 23, 2024

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

sumeerbhola force-pushed the rac_v2 branch from c366108 to be40587 Compare

May 26, 2024 02:47

kvoli reviewed

View reviewed changes

pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go Show resolved Hide resolved

pav-kv reviewed

View reviewed changes

pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go Outdated

Comment on lines 202 to 206

+// Ready must be called on every tick/ready of Raft since we cannot tolerate
+// state transitions from
+//
+// StateReplicate => {state in all states: state != StateReplicate} =>
+// StateReplicate

Collaborator

pav-kv May 29, 2024

Are you saying that we can't tolerate an ABA problem in which we go out and back into StateReplicate and can't tell if the node has always been in this state or not?

pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go Outdated

+// Ready must be called on every tick/ready of Raft since we cannot tolerate
+// state transitions from
+//
+// StateReplicate => {state in all states: state != StateReplicate} =>

Collaborator

pav-kv May 29, 2024

Is it "any" rather than "all"?

pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go Outdated

Comment on lines 211 to 215

+// should not miss any transitions. If stale messages received in a step can
+// cause transitions out and back without observation, we can add a monotonic
+// counter for each follower inside Raft (this is just local state at the
+// leader), which will be incremented on every state transition and expose
+// that via the RaftInterface.

Collaborator

pav-kv May 29, 2024

One such counter is the Term of the raft instance. But it still doesn't cover cases when one term goes out/in StateReplicate.

pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go Outdated

Comment on lines 248 to 249

		// nextUpperBound is populated iff in StateReplicate. All entries >=
		// nextUpperBound have not yet had MsgApps constructed.

Collaborator

pav-kv May 29, 2024

In case it impacts correctness reasoning somewhere, "have not yet had MsgApps constructed" has an implicit asterisk: "by this connected stream / StateReplicate" (at some concrete in-memory "epoch" if we number them).

There can be in-flight MsgApps at higher indices, and we may even get acks for them (beyond the Next index). E.g. if we just transitioned out / back in StateReplicate and lost some tracking information.

pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go Outdated

Comment on lines 254 to 259

+ // notice such transitions both in HandleRaftEvent and SetReplicas. We
+ // should *not* construct MsgApps for a StateReplicate follower that is
+ // disconnected -- there is no timeliness guarantee on how long a follower
+ // will stay in StateReplicate despite it being down, and by sending such a
+ // follower MsgApps that are not being received we are defeating flow
+ // control (since we will have advanced nextUpperBound).

Collaborator

pav-kv May 29, 2024

Do we intend to stay in StateReplicate if the follower is disconnected? Today, there is a ReportUnreachable call that puts the flow to StateProbe. If this method is not called, raft will be sending MsgApps until it reaches max-inflight limits and stalls (there won't be more than one probe because a disconnected follower does not reply to heartbeats).

I'm guessing we're moving control over "raft will be sending MsgApps until it reaches max-inflight limits and stalls" to AC. So we may choose not to send any MsgApps, not call ReportUnreachable, and stay in StateReplicate. However, when we reconnect, there is a non-zero chance of a reject and needing to probe.

But I guess we won't need any special handling for this disconnected state because we will unmount the "connected stream", and won't be sending any messages as a result.

pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go Outdated

+ FollowerTransportConnected(storeID roachpb.StoreID) bool
+ // HighestEntryIndex is the highest index assigned in the log, and produced
+ // in Ready.Entries(). If there have been no entries produced, since this
+ // replica became the leader, this is the commit index.

Collaborator

pav-kv May 29, 2024

A newly elected leader can have a suffix of entries which are not yet committed. Should we initialize to the log size, rather than to the commit index?

A thing about the commit index on the new leader: it doesn't know it before it commits a dummy entry at its term. So, by the time the leader learns the commit index, it has already appended an entry to its log.

Do we need to tie this definition to Ready.Entries()? We can get the size of the log more directly, by accessing raft.raftLog.lastEntryID(). Would it suffice to say that this returns the last index in the log at all times, and not mention the "becoming leader" case?

pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go

Comment on lines +282 to +629

		// entry that causes maxSize to be equaled or exceeded. This implies at
		// least one entry will be returned in the MsgApp on success.

Collaborator

pav-kv May 29, 2024

Today, various raft log internals return a prefix exceeding the maxSize only if it's a single entry; otherwise it's always <= maxSize, and the offender entry gets dropped from the response.

On a few occasions, I considered this logic unfortunate (we waste a read) and corner-casey, and contemplated to simplify it into the one you described: allow exceeding the size always.

Is this nuance critical here, or is an implementation detail?

pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go Outdated

+ // least one entry will be returned in the MsgApp on success.
+ //
+ // Returns raft.ErrCompacted error if log truncated. If no error,
+ // nextUpperBound is advanced to be equal to end. If raft.ErrCompacted is

Collaborator

pav-kv May 29, 2024

It's probably advanced to the index following the last returned entry, which is not always end if maxSize is exceeded.

sumeerbhola requested review from kvoli and pav-kv

May 30, 2024 15:29

sumeerbhola commented

View reviewed changes

Collaborator Author

sumeerbhola left a comment

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)

pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go line 205 at r18 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

Is it "any" rather than "all"?

I think we are saying the same thing -- it was meant as the "for all", "for any" universal quantification, though it was probably a misuse in expressing "set comprehension". Removed now.

pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go line 206 at r18 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

Are you saying that we can't tolerate an ABA problem in which we go out and back into StateReplicate and can't tell if the node has always been in this state or not?

I've dropped all this language for a few reasons:

We want to keep a replicaSendStream briefly around in StateProbe, given that a single nack causes a transition to StateProbe. So we need to be able to tolerate a regression in indexToSend.
You brought up the old ack case ("There can be in-flight MsgApps at higher indices, and we may even get acks for them"), which means indexToSend can advance while in StateReplicate.
This requirement of observing all state transitions is too strong, and is unnecessary given we need to handle the previous bullets.

So the code should just react to the state that we are currently in, and make any adjustments needed based on the current (Match, Next).

pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go line 215 at r18 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

One such counter is the Term of the raft instance. But it still doesn't cover cases when one term goes out/in StateReplicate.

Dropped this part.

pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go line 249 at r18 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

In case it impacts correctness reasoning somewhere, "have not yet had MsgApps constructed" has an implicit asterisk: "by this connected stream / StateReplicate" (at some concrete in-memory "epoch" if we number them).

There can be in-flight MsgApps at higher indices, and we may even get acks for them (beyond the Next index). E.g. if we just transitioned out / back in StateReplicate and lost some tracking information.

Done

sumeerbhola commented

View reviewed changes

Collaborator Author

sumeerbhola left a comment

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)

pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go line 259 at r18 (raw file):

Do we intend to stay in StateReplicate if the follower is disconnected? Today, there is a ReportUnreachable call that puts the flow to StateProbe.

We will keep that functionality. In that either the RaftTransport queue will overflow, or the circuit breaker will trip (as discussed in https://cockroachlabs.slack.com/archives/C06UFBJ743F/p1716468360084809) even if nothing is being sent, and the transition to StateProbe will happen. At that point we will not be sending any more MsgApps even if there are flow tokens available because the replicaSendStream will be closed.

The code below needs some updating to reflect this reality. Specifically, the replicateSoftDisconnected is unnecessary, and I will remove it.

pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go line 274 at r18 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

A newly elected leader can have a suffix of entries which are not yet committed. Should we initialize to the log size, rather than to the commit index?

A thing about the commit index on the new leader: it doesn't know it before it commits a dummy entry at its term. So, by the time the leader learns the commit index, it has already appended an entry to its log.

Do we need to tie this definition to Ready.Entries()? We can get the size of the log more directly, by accessing raft.raftLog.lastEntryID(). Would it suffice to say that this returns the last index in the log at all times, and not mention the "becoming leader" case?

Good point. Changed to

	// LastEntryIndex is the highest index assigned in the log.
	LastEntryIndex() uint64

pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go line 283 at r18 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

Today, various raft log internals return a prefix exceeding the maxSize only if it's a single entry; otherwise it's always <= maxSize, and the offender entry gets dropped from the response.

On a few occasions, I considered this logic unfortunate (we waste a read) and corner-casey, and contemplated to simplify it into the one you described: allow exceeding the size always.

Is this nuance critical here, or is an implementation detail?

I think the nuance is important here, in that for throttled stores, the tokens are precious and we'd rather go over slightly, rather than under utilize the tokens a replicaSendStream has grabbed and stashed in sendQueue.deductedForScheduler. It will return them, if it underutilizes, but there a delay incurred until it returns, when no one is using these available tokens.

Your point about wasting a read makes me even more convinced that we should do this.

pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go line 286 at r18 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

It's probably advanced to the index following the last returned entry, which is not always end if maxSize is exceeded.

Ah yes. This was a mistake -- thanks. Fixed.

kvoli reviewed

View reviewed changes

pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/token_counter.go Outdated

+ allRequiredAvailable := true
+ // Check whether the handle has available tokens, if not we keep the
+ // select case and keep on waiting.
+ // TODO(kvoli): Currently we are only trying to deduct and signal the next

Collaborator

kvoli Jun 4, 2024

Chatting w/ @sumeerbhola:

We should unblock the next waiter immediately. Utilize the normal back-pressure the goroutine scheduling is giving us.

Refresh is only being closed when the replica set changes. Otherwise, we expect this to successfully wait or get cancelled.

Worst case burst of GOMAXPROCS acquiring tokens when signaled.

TODO(kvoli): update code to try deducting immediately after being signaled on a handle's channel

pav-kv force-pushed the rac_v2 branch from d6bc2fb to 48387b3 Compare

July 10, 2024 18:12

kvoli force-pushed the rac_v2 branch 3 times, most recently from 8bf6167 to 276caaa Compare

July 16, 2024 18:36

kvoli force-pushed the rac_v2 branch from 0f8f1b5 to c9e3726 Compare

July 26, 2024 17:06

sumeerbhola and others added 12 commits

July 31, 2024 15:31


draft interfaces for replication admission control v2

a6f89bd


kvflowcontroller: modify bucket implementation to support waiting for…

5ad24dd

… quorum

The existing implementation allowed for sequentially waiting on all
buckets (btw, these should not be called bucket, since this is not a
token bucket). We now need to wait for a quroum in some cases.

A combination of bucket.TokensAvailable and WaitForHandlesAndChannels
is used for such waiting.


cleanup the store streams interfaces and move them into a separate file

e852122


rough code for RangeController.{WaitForEval,HandleRaftEvent}

8015c0d


kvflowcontrol: [dnm] add basic store send stream token watcher impl

Introduce a basic implementation of the StoreStreamTokenWatcher
interface. The implementation will notify callers when there are tokens
available for a `WorkClass,SendTokenCounter` pair. The caller should
then attempt deducting tokens.

Epic: none
Release note: None


copy over Tracker and tweak it

84dca64


replicaSendStream, some code

06e192e


replicaSendStream.advanceNextRaftIndexAndQueued etc.

e599117


replicaSendStream, some code

672bfa2


RangeControllerImpl, partial code

fbe01fa


unify SentTokenCounter and EvalTokenCounter


copied bucket to make tokenCounter

6b3a76f

WaitForEval is the new waiting interface.

sumeerbhola and others added 25 commits

July 31, 2024 19:37


call makeConsistentInStateReplicateLocked in more cases to reduce cod…

82aea69

…e duplication


kvserver: fix repl.mu deadlock

896e18a

LastEntryIndex would not unlock the replica mu after read locking. Defer
the unlock. Also clean up the replica mu locking in
flow_control_replica_integration handleRaftEvent.

Epic: none
Release note: None


fix makeConsistentWhenUnexpectedPopLocked: we need to fix stats in al…

…l cases

since send-queue has been popped from


.*: add rac v2 metrics

7674c59

Epic: none
Release note: None


raft: wrap entering state snapshot error

99b45bc

Epic: none
Release note: None


kvflowconnectedstream: add info to inherited wc assertion

a99595b

Also include the state of the send stream and replica state info.

Epic: none
Release note: None


roachtest: add perturbation tests

b2ab868

This commit adds a set of roachtests and a framework to measure the
impact to latency of various perturbations to the system.

The roachtest names all have the format
perturbation/[full|dev|metamorphic]/[perturbationType]

The four perturbationType variations added in this initial commit are:
restart
partition
add-node
decommission

Epic: CRDB-37514

Release note: None


remove inheritedPri assertion and the inheritance logic.

9d74cc4

Since we only consume 4MB of tokens when scheduled, there may be
remaining tokens of regular work class, but we may have already
popped all the regular work class entries. In this case we lower
bound the inherited priority to be RaftNormalPri instead of
incorrectly fataling.


makeConsistentWhenUnexpectedPopLocked needs to return all eval tokens

0b2ce26

since the entries popped have unknown size and are not being tracked.
If we don't do this, we will leak eval tokens.


base: disable raft quota and max inflights

6c0283c

Disable the quota pool and maximum inflight messages/bytes by setting a
large value. Replication flow control now performs this function.

NOTE: This is a hack, we should deprecate and integrate RAC into these
functions explicitly.

Epic: none
Release note: None


kvflowconnectedstream: don't include tracker in stream string

4745ee6

The `Tracker.String()` method includes every tracked entry, which can be
in the thousands+. Print the stream instead e.g. `t1/s1`.

Epic: none
Release note: None


roachtest: ignore latency below 10ms in perturbation tests

bb097d3

The admission control perturbation latency tests assert that the pXX
latency does not increase by more than some % when perturbing, and post
perturbation, when compared to the baseline before perturbation.

When the latency being measured is small, the relative increases can be
much larger, failing the test, despite a similar user quality of
service.

Ignore latency results below 10ms when checking if a change is
acceptable. i.e. latency results less than 10ms are considered
acceptable regardless of the baseline.

Epic: CRDB-37514
Release note: None


raft: fix term race in SetAdmitted

7949aaa

Send a MsgAppResp only if the admitted indices are in the context of a
new leader, i.e. the last accepted term == raft.Term.

Epic: none
Release note: none


kvflowconnectedstream: copy scheduled map

5fd44b6

Avoid concurrently iterating and deleting entries in the scheduled
replica map.

Epic: none
Release note: None


kvflowconnectedstream: sync scheduled replicas

eb2dbbe

Epic: none
Release note: None


roachtest: make ac latency output order consistent

b90a0a7

The final validation output was not consistent as it relied on iterating
over a map.

Sort the map keys first, to allow a consistent output order

Epic: none
Release note: None


kvserver: inform lease transfer with flow state

6b7b6f7

Previously, it was possible to transfer a lease to a replica which was
behind or likely struggle keeping once becoming the leaseholder.

Introduce a new `RangeController` method `ValidLeaseTarget`, which
determines if the given replica would be a suitable leaseholder, based
off the state of its send queue, eval/send tokens and stream state.

Epic: none
Release note: None


roachtest: extend dead store timer in ac latency test

d5ef728

Avoid stores up-replicating away from the target node, reducing the
backlog of work.

Epic: none
Release note: None


kvflowconnectedstream: include tokens deducted in blocked stream log

9ae7725

Include the delta (returned-deducted) as well as the deducted tokens
when logging streams which are blocked.

Epic: none
Release note: None


hack42: send queue is always drained using RaftLowPri

8dfb306

There is some justification in a code comment.


kvflowconnectedstream: update blocked stream log spacing

2fa2ff9

For readability, include spaces between '-' of returned and deducted
stream tokens.

Epic: none
Release note: None


kvflowconnectedstream: include stream <eval/send> in blocked stream log

c80290b

To more easily determine which stream type is blocked (no tokens).

Epic: none
Release note: None


kvflowconnectedstream: pprint negative token values

c7e03a5

When printing token values, they were first converted to an unsigned
integer, which causes erroneous output. Introduce a pretty print
function for tokens, which accounts for negative tokens and formats
correctly.

Epic: none
Release note: None


kvserver: limit needs lease queue spam

1f2a8cb

We would previously log on every occurrence of a non-leaseholder replica
not processing through the replica queue, due to requiring a lease. This
occurs semi-frequently in a dynamic cluster, with leaseholder changes.

Mute needs lease logging, other queue errors will still be logged.

Epic: none
Release note: None


kvflowconnectedstream: comment unreachable hack42 code

e3311bf

Required to pass linter to build.

Epic: none
Release note: None

pav-kv force-pushed the rac_v2 branch from 2064b27 to e3311bf Compare

July 31, 2024 19:12

kvoli and others added 2 commits

August 1, 2024 10:37


kvflowconnectedstream: remove deadlock on token setting change

5c2a50e

Remove re-entrant deadlock within the setting change handler for v2
tokens.

Epic: none
Release note: None


add an interface to aid in understanding integration calls

d1a1590

sumeerbhola force-pushed the rac_v2 branch from 99412c6 to d1a1590 Compare

August 5, 2024 14:51


clarify how to fix a todo regarding wasteful ready enqueueing

a5051a9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment