Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

draft interfaces for replication admission control v2 #124484

Draft
wants to merge 144 commits into
base: master
Choose a base branch
from

Conversation

sumeerbhola
Copy link
Collaborator

This is extremely rough, and will likely be hard to understand.
Meant for folks working on replication AC v2 prototype.

Copy link

blathers-crl bot commented May 21, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@sumeerbhola sumeerbhola force-pushed the rac_v2 branch 2 times, most recently from 7223dca to 519e123 Compare May 22, 2024 15:05
Copy link

blathers-crl bot commented May 23, 2024

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

Comment on lines 202 to 206
// Ready must be called on every tick/ready of Raft since we cannot tolerate
// state transitions from
//
// StateReplicate => {state in all states: state != StateReplicate} =>
// StateReplicate
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying that we can't tolerate an ABA problem in which we go out and back into StateReplicate and can't tell if the node has always been in this state or not?

// Ready must be called on every tick/ready of Raft since we cannot tolerate
// state transitions from
//
// StateReplicate => {state in all states: state != StateReplicate} =>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it "any" rather than "all"?

Comment on lines 211 to 215
// should not miss any transitions. If stale messages received in a step can
// cause transitions out and back without observation, we can add a monotonic
// counter for each follower inside Raft (this is just local state at the
// leader), which will be incremented on every state transition and expose
// that via the RaftInterface.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One such counter is the Term of the raft instance. But it still doesn't cover cases when one term goes out/in StateReplicate.

Comment on lines 248 to 249
// nextUpperBound is populated iff in StateReplicate. All entries >=
// nextUpperBound have not yet had MsgApps constructed.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case it impacts correctness reasoning somewhere, "have not yet had MsgApps constructed" has an implicit asterisk: "by this connected stream / StateReplicate" (at some concrete in-memory "epoch" if we number them).

There can be in-flight MsgApps at higher indices, and we may even get acks for them (beyond the Next index). E.g. if we just transitioned out / back in StateReplicate and lost some tracking information.

Comment on lines 254 to 259
// notice such transitions both in HandleRaftEvent and SetReplicas. We
// should *not* construct MsgApps for a StateReplicate follower that is
// disconnected -- there is no timeliness guarantee on how long a follower
// will stay in StateReplicate despite it being down, and by sending such a
// follower MsgApps that are not being received we are defeating flow
// control (since we will have advanced nextUpperBound).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we intend to stay in StateReplicate if the follower is disconnected? Today, there is a ReportUnreachable call that puts the flow to StateProbe. If this method is not called, raft will be sending MsgApps until it reaches max-inflight limits and stalls (there won't be more than one probe because a disconnected follower does not reply to heartbeats).

I'm guessing we're moving control over "raft will be sending MsgApps until it reaches max-inflight limits and stalls" to AC. So we may choose not to send any MsgApps, not call ReportUnreachable, and stay in StateReplicate. However, when we reconnect, there is a non-zero chance of a reject and needing to probe.

But I guess we won't need any special handling for this disconnected state because we will unmount the "connected stream", and won't be sending any messages as a result.

FollowerTransportConnected(storeID roachpb.StoreID) bool
// HighestEntryIndex is the highest index assigned in the log, and produced
// in Ready.Entries(). If there have been no entries produced, since this
// replica became the leader, this is the commit index.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A newly elected leader can have a suffix of entries which are not yet committed. Should we initialize to the log size, rather than to the commit index?

A thing about the commit index on the new leader: it doesn't know it before it commits a dummy entry at its term. So, by the time the leader learns the commit index, it has already appended an entry to its log.

Do we need to tie this definition to Ready.Entries()? We can get the size of the log more directly, by accessing raft.raftLog.lastEntryID(). Would it suffice to say that this returns the last index in the log at all times, and not mention the "becoming leader" case?

Comment on lines +282 to +629
// entry that causes maxSize to be equaled or exceeded. This implies at
// least one entry will be returned in the MsgApp on success.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today, various raft log internals return a prefix exceeding the maxSize only if it's a single entry; otherwise it's always <= maxSize, and the offender entry gets dropped from the response.

On a few occasions, I considered this logic unfortunate (we waste a read) and corner-casey, and contemplated to simplify it into the one you described: allow exceeding the size always.

Is this nuance critical here, or is an implementation detail?

// least one entry will be returned in the MsgApp on success.
//
// Returns raft.ErrCompacted error if log truncated. If no error,
// nextUpperBound is advanced to be equal to end. If raft.ErrCompacted is
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably advanced to the index following the last returned entry, which is not always end if maxSize is exceeded.

@sumeerbhola sumeerbhola requested review from kvoli and pav-kv May 30, 2024 15:29
Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)


pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go line 205 at r18 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

Is it "any" rather than "all"?

I think we are saying the same thing -- it was meant as the "for all", "for any" universal quantification, though it was probably a misuse in expressing "set comprehension". Removed now.


pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go line 206 at r18 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

Are you saying that we can't tolerate an ABA problem in which we go out and back into StateReplicate and can't tell if the node has always been in this state or not?

I've dropped all this language for a few reasons:

  • We want to keep a replicaSendStream briefly around in StateProbe, given that a single nack causes a transition to StateProbe. So we need to be able to tolerate a regression in indexToSend.
  • You brought up the old ack case ("There can be in-flight MsgApps at higher indices, and we may even get acks for them"), which means indexToSend can advance while in StateReplicate.
  • This requirement of observing all state transitions is too strong, and is unnecessary given we need to handle the previous bullets.

So the code should just react to the state that we are currently in, and make any adjustments needed based on the current (Match, Next).


pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go line 215 at r18 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

One such counter is the Term of the raft instance. But it still doesn't cover cases when one term goes out/in StateReplicate.

Dropped this part.


pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go line 249 at r18 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

In case it impacts correctness reasoning somewhere, "have not yet had MsgApps constructed" has an implicit asterisk: "by this connected stream / StateReplicate" (at some concrete in-memory "epoch" if we number them).

There can be in-flight MsgApps at higher indices, and we may even get acks for them (beyond the Next index). E.g. if we just transitioned out / back in StateReplicate and lost some tracking information.

Done

Copy link
Collaborator Author

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @kvoli and @pav-kv)


pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go line 259 at r18 (raw file):

Do we intend to stay in StateReplicate if the follower is disconnected? Today, there is a ReportUnreachable call that puts the flow to StateProbe.

We will keep that functionality. In that either the RaftTransport queue will overflow, or the circuit breaker will trip (as discussed in https://cockroachlabs.slack.com/archives/C06UFBJ743F/p1716468360084809) even if nothing is being sent, and the transition to StateProbe will happen. At that point we will not be sending any more MsgApps even if there are flow tokens available because the replicaSendStream will be closed.

The code below needs some updating to reflect this reality. Specifically, the replicateSoftDisconnected is unnecessary, and I will remove it.


pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go line 274 at r18 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

A newly elected leader can have a suffix of entries which are not yet committed. Should we initialize to the log size, rather than to the commit index?

A thing about the commit index on the new leader: it doesn't know it before it commits a dummy entry at its term. So, by the time the leader learns the commit index, it has already appended an entry to its log.

Do we need to tie this definition to Ready.Entries()? We can get the size of the log more directly, by accessing raft.raftLog.lastEntryID(). Would it suffice to say that this returns the last index in the log at all times, and not mention the "becoming leader" case?

Good point. Changed to

	// LastEntryIndex is the highest index assigned in the log.
	LastEntryIndex() uint64

pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go line 283 at r18 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

Today, various raft log internals return a prefix exceeding the maxSize only if it's a single entry; otherwise it's always <= maxSize, and the offender entry gets dropped from the response.

On a few occasions, I considered this logic unfortunate (we waste a read) and corner-casey, and contemplated to simplify it into the one you described: allow exceeding the size always.

Is this nuance critical here, or is an implementation detail?

I think the nuance is important here, in that for throttled stores, the tokens are precious and we'd rather go over slightly, rather than under utilize the tokens a replicaSendStream has grabbed and stashed in sendQueue.deductedForScheduler. It will return them, if it underutilizes, but there a delay incurred until it returns, when no one is using these available tokens.

Your point about wasting a read makes me even more convinced that we should do this.


pkg/kv/kvserver/kvflowcontrol/kvflowconnectedstream/stream.go line 286 at r18 (raw file):

Previously, pav-kv (Pavel Kalinnikov) wrote…

It's probably advanced to the index following the last returned entry, which is not always end if maxSize is exceeded.

Ah yes. This was a mistake -- thanks. Fixed.

allRequiredAvailable := true
// Check whether the handle has available tokens, if not we keep the
// select case and keep on waiting.
// TODO(kvoli): Currently we are only trying to deduct and signal the next
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chatting w/ @sumeerbhola:

We should unblock the next waiter immediately. Utilize the normal back-pressure the goroutine scheduling is giving us.

Refresh is only being closed when the replica set changes. Otherwise, we expect this to successfully wait or get cancelled.

Worst case burst of GOMAXPROCS acquiring tokens when signaled.

TODO(kvoli): update code to try deducting immediately after being signaled on a handle's channel

sumeerbhola and others added 12 commits July 31, 2024 15:31
… quorum

The existing implementation allowed for sequentially waiting on all
buckets (btw, these should not be called bucket, since this is not a
token bucket). We now need to wait for a quroum in some cases.

A combination of bucket.TokensAvailable and WaitForHandlesAndChannels
is used for such waiting.
Introduce a basic implementation of the StoreStreamTokenWatcher
interface. The implementation will notify callers when there are tokens
available for a `WorkClass,SendTokenCounter` pair. The caller should
then attempt deducting tokens.

Epic: none
Release note: None
WaitForEval is the new waiting interface.
sumeerbhola and others added 25 commits July 31, 2024 19:37
LastEntryIndex would not unlock the replica mu after read locking. Defer
the unlock. Also clean up the replica mu locking in
flow_control_replica_integration handleRaftEvent.

Epic: none
Release note: None
…l cases

since send-queue has been popped from
Epic: none
Release note: None
Epic: none
Release note: None
Also include the state of the send stream and replica state info.

Epic: none
Release note: None
This commit adds a set of roachtests and a framework to measure the
impact to latency of various perturbations to the system.

The roachtest names all have the format
perturbation/[full|dev|metamorphic]/[perturbationType]

The four perturbationType variations added in this initial commit are:
restart
partition
add-node
decommission

Epic: CRDB-37514

Release note: None
Since we only consume 4MB of tokens when scheduled, there may be
remaining tokens of regular work class, but we may have already
popped all the regular work class entries. In this case we lower
bound the inherited priority to be RaftNormalPri instead of
incorrectly fataling.
since the entries popped have unknown size and are not being tracked.
If we don't do this, we will leak eval tokens.
Disable the quota pool and maximum inflight messages/bytes by setting a
large value. Replication flow control now performs this function.

NOTE: This is a hack, we should deprecate and integrate RAC into these
functions explicitly.

Epic: none
Release note: None
The `Tracker.String()` method includes every tracked entry, which can be
in the thousands+. Print the stream instead e.g. `t1/s1`.

Epic: none
Release note: None
The admission control perturbation latency tests assert that the pXX
latency does not increase by more than some % when perturbing, and post
perturbation, when compared to the baseline before perturbation.

When the latency being measured is small, the relative increases can be
much larger, failing the test, despite a similar user quality of
service.

Ignore latency results below 10ms when checking if a change is
acceptable. i.e. latency results less than 10ms are considered
acceptable regardless of the baseline.

Epic: CRDB-37514
Release note: None
Send a MsgAppResp only if the admitted indices are in the context of a
new leader, i.e. the last accepted term == raft.Term.

Epic: none
Release note: none
Avoid concurrently iterating and deleting entries in the scheduled
replica map.

Epic: none
Release note: None
The final validation output was not consistent as it relied on iterating
over a map.

Sort the map keys first, to allow a consistent output order

Epic: none
Release note: None
Previously, it was possible to transfer a lease to a replica which was
behind or likely struggle keeping once becoming the leaseholder.

Introduce a new `RangeController` method `ValidLeaseTarget`, which
determines if the given replica would be a suitable leaseholder, based
off the state of its send queue, eval/send tokens and stream state.

Epic: none
Release note: None
Avoid stores up-replicating away from the target node, reducing the
backlog of work.

Epic: none
Release note: None
Include the delta (returned-deducted) as well as the deducted tokens
when logging streams which are blocked.

Epic: none
Release note: None
There is some justification in a code comment.
For readability, include spaces between '-' of returned and deducted
stream tokens.

Epic: none
Release note: None
To more easily determine which stream type is blocked (no tokens).

Epic: none
Release note: None
When printing token values, they were first converted to an unsigned
integer, which causes erroneous output. Introduce a pretty print
function for tokens, which accounts for negative tokens and formats
correctly.

Epic: none
Release note: None
We would previously log on every occurrence of a non-leaseholder replica
not processing through the replica queue, due to requiring a lease. This
occurs semi-frequently in a dynamic cluster, with leaseholder changes.

Mute needs lease logging, other queue errors will still be logged.

Epic: none
Release note: None
Required to pass linter to build.

Epic: none
Release note: None
kvoli and others added 2 commits August 1, 2024 10:37
Remove re-entrant deadlock within the setting change handler for v2
tokens.

Epic: none
Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants