-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
draft interfaces for replication admission control v2 #124484
base: master
Are you sure you want to change the base?
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
7223dca
to
519e123
Compare
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
// Ready must be called on every tick/ready of Raft since we cannot tolerate | ||
// state transitions from | ||
// | ||
// StateReplicate => {state in all states: state != StateReplicate} => | ||
// StateReplicate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying that we can't tolerate an ABA problem in which we go out and back into StateReplicate
and can't tell if the node has always been in this state or not?
// Ready must be called on every tick/ready of Raft since we cannot tolerate | ||
// state transitions from | ||
// | ||
// StateReplicate => {state in all states: state != StateReplicate} => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it "any" rather than "all"?
// should not miss any transitions. If stale messages received in a step can | ||
// cause transitions out and back without observation, we can add a monotonic | ||
// counter for each follower inside Raft (this is just local state at the | ||
// leader), which will be incremented on every state transition and expose | ||
// that via the RaftInterface. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One such counter is the Term
of the raft instance. But it still doesn't cover cases when one term goes out/in StateReplicate
.
// nextUpperBound is populated iff in StateReplicate. All entries >= | ||
// nextUpperBound have not yet had MsgApps constructed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case it impacts correctness reasoning somewhere, "have not yet had MsgApps constructed" has an implicit asterisk: "by this connected stream / StateReplicate" (at some concrete in-memory "epoch" if we number them).
There can be in-flight MsgApp
s at higher indices, and we may even get acks for them (beyond the Next
index). E.g. if we just transitioned out / back in StateReplicate
and lost some tracking information.
// notice such transitions both in HandleRaftEvent and SetReplicas. We | ||
// should *not* construct MsgApps for a StateReplicate follower that is | ||
// disconnected -- there is no timeliness guarantee on how long a follower | ||
// will stay in StateReplicate despite it being down, and by sending such a | ||
// follower MsgApps that are not being received we are defeating flow | ||
// control (since we will have advanced nextUpperBound). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we intend to stay in StateReplicate
if the follower is disconnected? Today, there is a ReportUnreachable
call that puts the flow to StateProbe
. If this method is not called, raft will be sending MsgApps
until it reaches max-inflight limits and stalls (there won't be more than one probe because a disconnected follower does not reply to heartbeats).
I'm guessing we're moving control over "raft will be sending MsgApps
until it reaches max-inflight limits and stalls" to AC. So we may choose not to send any MsgApps
, not call ReportUnreachable
, and stay in StateReplicate
. However, when we reconnect, there is a non-zero chance of a reject and needing to probe.
But I guess we won't need any special handling for this disconnected state because we will unmount the "connected stream", and won't be sending any messages as a result.
FollowerTransportConnected(storeID roachpb.StoreID) bool | ||
// HighestEntryIndex is the highest index assigned in the log, and produced | ||
// in Ready.Entries(). If there have been no entries produced, since this | ||
// replica became the leader, this is the commit index. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A newly elected leader can have a suffix of entries which are not yet committed. Should we initialize to the log size, rather than to the commit index?
A thing about the commit index on the new leader: it doesn't know it before it commits a dummy entry at its term. So, by the time the leader learns the commit index, it has already appended an entry to its log.
Do we need to tie this definition to Ready.Entries()
? We can get the size of the log more directly, by accessing raft.raftLog.lastEntryID()
. Would it suffice to say that this returns the last index in the log at all times, and not mention the "becoming leader" case?
// entry that causes maxSize to be equaled or exceeded. This implies at | ||
// least one entry will be returned in the MsgApp on success. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today, various raft log internals return a prefix exceeding the maxSize
only if it's a single entry; otherwise it's always <= maxSize
, and the offender entry gets dropped from the response.
On a few occasions, I considered this logic unfortunate (we waste a read) and corner-casey, and contemplated to simplify it into the one you described: allow exceeding the size always.
Is this nuance critical here, or is an implementation detail?
// least one entry will be returned in the MsgApp on success. | ||
// | ||
// Returns raft.ErrCompacted error if log truncated. If no error, | ||
// nextUpperBound is advanced to be equal to end. If raft.ErrCompacted is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably advanced to the index following the last returned entry, which is not always end
if maxSize
is exceeded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)
pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go
line 205 at r18 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
Is it "any" rather than "all"?
I think we are saying the same thing -- it was meant as the "for all", "for any" universal quantification, though it was probably a misuse in expressing "set comprehension". Removed now.
pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go
line 206 at r18 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
Are you saying that we can't tolerate an ABA problem in which we go out and back into
StateReplicate
and can't tell if the node has always been in this state or not?
I've dropped all this language for a few reasons:
- We want to keep a replicaSendStream briefly around in
StateProbe
, given that a single nack causes a transition toStateProbe
. So we need to be able to tolerate a regression inindexToSend
. - You brought up the old ack case ("There can be in-flight
MsgApp
s at higher indices, and we may even get acks for them"), which meansindexToSend
can advance while in StateReplicate. - This requirement of observing all state transitions is too strong, and is unnecessary given we need to handle the previous bullets.
So the code should just react to the state that we are currently in, and make any adjustments needed based on the current (Match, Next).
pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go
line 215 at r18 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
One such counter is the
Term
of the raft instance. But it still doesn't cover cases when one term goes out/inStateReplicate
.
Dropped this part.
pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go
line 249 at r18 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
In case it impacts correctness reasoning somewhere, "have not yet had MsgApps constructed" has an implicit asterisk: "by this connected stream / StateReplicate" (at some concrete in-memory "epoch" if we number them).
There can be in-flight
MsgApp
s at higher indices, and we may even get acks for them (beyond theNext
index). E.g. if we just transitioned out / back inStateReplicate
and lost some tracking information.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)
pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go
line 259 at r18 (raw file):
Do we intend to stay in
StateReplicate
if the follower is disconnected? Today, there is aReportUnreachable
call that puts the flow toStateProbe
.
We will keep that functionality. In that either the RaftTransport queue will overflow, or the circuit breaker will trip (as discussed in https://cockroachlabs.slack.com/archives/C06UFBJ743F/p1716468360084809) even if nothing is being sent, and the transition to StateProbe will happen. At that point we will not be sending any more MsgApps even if there are flow tokens available because the replicaSendStream
will be closed.
The code below needs some updating to reflect this reality. Specifically, the replicateSoftDisconnected
is unnecessary, and I will remove it.
pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go
line 274 at r18 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
A newly elected leader can have a suffix of entries which are not yet committed. Should we initialize to the log size, rather than to the commit index?
A thing about the commit index on the new leader: it doesn't know it before it commits a dummy entry at its term. So, by the time the leader learns the commit index, it has already appended an entry to its log.
Do we need to tie this definition to
Ready.Entries()
? We can get the size of the log more directly, by accessingraft.raftLog.lastEntryID()
. Would it suffice to say that this returns the last index in the log at all times, and not mention the "becoming leader" case?
Good point. Changed to
// LastEntryIndex is the highest index assigned in the log.
LastEntryIndex() uint64
pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go
line 283 at r18 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
Today, various raft log internals return a prefix exceeding the
maxSize
only if it's a single entry; otherwise it's always<= maxSize
, and the offender entry gets dropped from the response.On a few occasions, I considered this logic unfortunate (we waste a read) and corner-casey, and contemplated to simplify it into the one you described: allow exceeding the size always.
Is this nuance critical here, or is an implementation detail?
I think the nuance is important here, in that for throttled stores, the tokens are precious and we'd rather go over slightly, rather than under utilize the tokens a replicaSendStream
has grabbed and stashed in sendQueue.deductedForScheduler
. It will return them, if it underutilizes, but there a delay incurred until it returns, when no one is using these available tokens.
Your point about wasting a read makes me even more convinced that we should do this.
pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go
line 286 at r18 (raw file):
Previously, pav-kv (Pavel Kalinnikov) wrote…
It's probably advanced to the index following the last returned entry, which is not always
end
ifmaxSize
is exceeded.
Ah yes. This was a mistake -- thanks. Fixed.
allRequiredAvailable := true | ||
// Check whether the handle has available tokens, if not we keep the | ||
// select case and keep on waiting. | ||
// TODO(kvoli): Currently we are only trying to deduct and signal the next |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chatting w/ @sumeerbhola:
We should unblock the next waiter immediately. Utilize the normal back-pressure the goroutine scheduling is giving us.
Refresh is only being closed when the replica set changes. Otherwise, we expect this to successfully wait or get cancelled.
Worst case burst of GOMAXPROCS acquiring tokens when signaled.
TODO(kvoli): update code to try deducting immediately after being signaled on a handle's channel
8bf6167
to
276caaa
Compare
… quorum The existing implementation allowed for sequentially waiting on all buckets (btw, these should not be called bucket, since this is not a token bucket). We now need to wait for a quroum in some cases. A combination of bucket.TokensAvailable and WaitForHandlesAndChannels is used for such waiting.
Introduce a basic implementation of the StoreStreamTokenWatcher interface. The implementation will notify callers when there are tokens available for a `WorkClass,SendTokenCounter` pair. The caller should then attempt deducting tokens. Epic: none Release note: None
WaitForEval is the new waiting interface.
LastEntryIndex would not unlock the replica mu after read locking. Defer the unlock. Also clean up the replica mu locking in flow_control_replica_integration handleRaftEvent. Epic: none Release note: None
…l cases since send-queue has been popped from
Epic: none Release note: None
Epic: none Release note: None
Also include the state of the send stream and replica state info. Epic: none Release note: None
This commit adds a set of roachtests and a framework to measure the impact to latency of various perturbations to the system. The roachtest names all have the format perturbation/[full|dev|metamorphic]/[perturbationType] The four perturbationType variations added in this initial commit are: restart partition add-node decommission Epic: CRDB-37514 Release note: None
Since we only consume 4MB of tokens when scheduled, there may be remaining tokens of regular work class, but we may have already popped all the regular work class entries. In this case we lower bound the inherited priority to be RaftNormalPri instead of incorrectly fataling.
since the entries popped have unknown size and are not being tracked. If we don't do this, we will leak eval tokens.
Disable the quota pool and maximum inflight messages/bytes by setting a large value. Replication flow control now performs this function. NOTE: This is a hack, we should deprecate and integrate RAC into these functions explicitly. Epic: none Release note: None
The `Tracker.String()` method includes every tracked entry, which can be in the thousands+. Print the stream instead e.g. `t1/s1`. Epic: none Release note: None
The admission control perturbation latency tests assert that the pXX latency does not increase by more than some % when perturbing, and post perturbation, when compared to the baseline before perturbation. When the latency being measured is small, the relative increases can be much larger, failing the test, despite a similar user quality of service. Ignore latency results below 10ms when checking if a change is acceptable. i.e. latency results less than 10ms are considered acceptable regardless of the baseline. Epic: CRDB-37514 Release note: None
Send a MsgAppResp only if the admitted indices are in the context of a new leader, i.e. the last accepted term == raft.Term. Epic: none Release note: none
Avoid concurrently iterating and deleting entries in the scheduled replica map. Epic: none Release note: None
Epic: none Release note: None
The final validation output was not consistent as it relied on iterating over a map. Sort the map keys first, to allow a consistent output order Epic: none Release note: None
Previously, it was possible to transfer a lease to a replica which was behind or likely struggle keeping once becoming the leaseholder. Introduce a new `RangeController` method `ValidLeaseTarget`, which determines if the given replica would be a suitable leaseholder, based off the state of its send queue, eval/send tokens and stream state. Epic: none Release note: None
Avoid stores up-replicating away from the target node, reducing the backlog of work. Epic: none Release note: None
Include the delta (returned-deducted) as well as the deducted tokens when logging streams which are blocked. Epic: none Release note: None
There is some justification in a code comment.
For readability, include spaces between '-' of returned and deducted stream tokens. Epic: none Release note: None
To more easily determine which stream type is blocked (no tokens). Epic: none Release note: None
When printing token values, they were first converted to an unsigned integer, which causes erroneous output. Introduce a pretty print function for tokens, which accounts for negative tokens and formats correctly. Epic: none Release note: None
We would previously log on every occurrence of a non-leaseholder replica not processing through the replica queue, due to requiring a lease. This occurs semi-frequently in a dynamic cluster, with leaseholder changes. Mute needs lease logging, other queue errors will still be logged. Epic: none Release note: None
Required to pass linter to build. Epic: none Release note: None
Remove re-entrant deadlock within the setting change handler for v2 tokens. Epic: none Release note: None
This is extremely rough, and will likely be hard to understand.
Meant for folks working on replication AC v2 prototype.