fix: client kill preempts in atomic section on shutdown #5283

kostasrim · 2025-06-12T06:53:13Z

The problem is that TlsSocket::Shutdown is preemptive because it flushes its buffer on the socket. However this violated the FiberAtomicGuard when traversing the connection list on each shard. To fix this, we move the shutdown call to another fiber.

Signed-off-by: kostas <[email protected]>

kostasrim · 2025-06-12T06:54:13Z

src/server/server_family.cc

+      facade::Connection::WeakRef ref = std::move(kill_list.front());
+      kill_list.pop_front();
+      facade::Connection* conn = ref.Get();
+      // TODO think how to handle migration for eval. See RequestAsyncMigration


Need to think what we can do here

kostasrim · 2025-06-12T06:55:08Z

src/server/server_family.cc

@@ -535,14 +535,39 @@ void ClientKill(CmdArgList args, absl::Span<facade::Listener*> listeners, SinkRe

  const bool is_admin_request = cntx->conn()->IsPrivileged();

+  std::vector<util::fb2::Fiber> fibers(pp->size() * listeners.size());


IMO we could also have a single fiber per shard which sleeps and wake up when we push work to its queue. IMO I have no preference

I do not understand the reason for all the complexity here and below.
I would imagine to have an array per thread and you just add all the matching connections to the array in their owning thread.

Then the next step is to run pool->AwaitFiberOnAll that accesses an array in its own thread and sequentially shuts down the connections.

yeah I agree it was an overkill

There is a small issue -- plz see my other comment

romange · 2025-06-18T07:59:59Z

src/server/server_family.cc

@@ -499,7 +499,7 @@ void ClientId(CmdArgList args, SinkReplyBuilder* builder, ConnectionContext* cnt
 }

 void ClientKill(CmdArgList args, absl::Span<facade::Listener*> listeners, SinkReplyBuilder* builder,
-                ConnectionContext* cntx) {
+                ConnectionContext* cntx, util::ProactorPool* pp) {


you do not need to pass it - it's accessible via shard_set->pool()

kostasrim · 2025-06-18T10:38:54Z

src/server/server_family.cc

+      facade::Connection* conn = tcon.Get();
+      if (conn && conn->socket()->proactor()->GetPoolIndex() == p->GetPoolIndex()) {


@romange

The problem here is that we might migrate a connection between preemptions of Shutdown. Calling tcon.Get() on a connection that was migrated can become a data race.

Maybe the solution is to simply disable the migrations for each connection we are about to kill ? WDYT ?

And now I am thinking about it, when we call Borrow() we should disable migrations. If a connection migrates after borrow, we cached socket_->proactor()->GetPoolIndex() within ConnectionWeakRef which is no longer valid. The DCHECK() within Get() won't protect us either, it uses the cached yet no longer valid proactor index

we run with migrate_connections=false in production so it's not an issue.
i think your checks are good enough for this.

sounds good

romange · 2025-06-18T10:52:18Z

src/server/server_family.cc

-    listener->TraverseConnections(cb);
-  }
+  auto cb = [&](unsigned idx, ProactorBase* p) mutable {
+    // Step 1 aggregate the per thread connections from all listeners


if you do both steps locally in-thread, there is not reason to declare thread_connections as a global object outside. Instead just define vector<facade::Connection::WeakRef> here

Yes I thought about that. My issue was that we need to pass this thread local to traverse_cb. This implies either defining the traverse_cb within this lambda which makes it difficult to read or to bind it via [&local] { travere_cb(local, ...); }

I am fine with any of these -- I will do whatever you find more readable :)

i am fine with either option (slight preference to first).

sounds good I will follow up

fix: client kill preempts in atomic section on shutdown

3c63c92

Signed-off-by: kostas <[email protected]>

kostasrim self-assigned this Jun 12, 2025

kostasrim commented Jun 12, 2025

View reviewed changes

kostasrim requested a review from romange June 18, 2025 07:12

romange reviewed Jun 18, 2025

View reviewed changes

comments

7b46162

kostasrim commented Jun 18, 2025

View reviewed changes

remove unused

f176f63

romange reviewed Jun 18, 2025

View reviewed changes

move traverse_cb

93d2117

kostasrim requested a review from romange June 18, 2025 11:50

romange approved these changes Jun 18, 2025

View reviewed changes

kostasrim enabled auto-merge (squash) June 18, 2025 12:42

kostasrim merged commit 3e3298e into main Jun 18, 2025
10 checks passed

kostasrim deleted the kpr5 branch June 18, 2025 12:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: client kill preempts in atomic section on shutdown #5283

fix: client kill preempts in atomic section on shutdown #5283

Uh oh!

kostasrim commented Jun 12, 2025

Uh oh!

kostasrim Jun 12, 2025

Uh oh!

kostasrim Jun 12, 2025

Uh oh!

romange Jun 18, 2025 •

edited

Loading

Uh oh!

kostasrim Jun 18, 2025

Uh oh!

romange Jun 18, 2025

Uh oh!

kostasrim Jun 18, 2025

Uh oh!

kostasrim Jun 18, 2025 •

edited

Loading

Uh oh!

romange Jun 18, 2025

Uh oh!

kostasrim Jun 18, 2025

Uh oh!

romange Jun 18, 2025

Uh oh!

kostasrim Jun 18, 2025

Uh oh!

romange Jun 18, 2025

Uh oh!

kostasrim Jun 18, 2025

Uh oh!

Uh oh!

Uh oh!

		@@ -535,14 +535,39 @@ void ClientKill(CmdArgList args, absl::Span<facade::Listener*> listeners, SinkRe

		const bool is_admin_request = cntx->conn()->IsPrivileged();

		std::vector<util::fb2::Fiber> fibers(pp->size() * listeners.size());

		facade::Connection* conn = tcon.Get();
		if (conn && conn->socket()->proactor()->GetPoolIndex() == p->GetPoolIndex()) {

fix: client kill preempts in atomic section on shutdown #5283

fix: client kill preempts in atomic section on shutdown #5283

Uh oh!

Conversation

kostasrim commented Jun 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

romange Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kostasrim Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

romange Jun 18, 2025 •

edited

Loading

kostasrim Jun 18, 2025 •

edited

Loading