fix(kad): correctly handle `NoKnownPeers` error when bootstrap #5349

stormshield-frb · 2024-04-30T13:32:15Z

Description

After testing master, we encountered a bug due to #4838 when doing automatic or periodic bootstrap if the node has no known peers.

Since it failed immediately, I though there was no need to call the bootstrap_status.on_started method. But not doing so never resets the periodic timer inside bootstrap_status resulting in getting stuck to try to bootstrap every time poll is called on kad::Behaviour.

Notes & open questions

N/A

Change checklist

I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
A changelog entry has been made in the appropriate crates

protocols/kad/CHANGELOG.md

guillaumemichel · 2024-04-30T14:33:44Z

This looks a bit hacky, wouldn't it be better to modify the bootstrap Status instead (e.g poll_next_bootstrap)?

stormshield-frb · 2024-05-02T07:39:21Z

This looks a bit hacky, wouldn't it be better to modify the bootstrap Status instead (e.g poll_next_bootstrap)?

I'm not sure to understand what you mean. on_started and on_finished are intended for that purpose.

Even if I would update the Status directly, we would not be able to remove on_started completely since the end user could still manually trigger a bootstrap, and we would not be able to remove on_finish at all since there is currently no way to detect a bootstrap has finished outside exploring query_finished or query_timeout. And since a bootstrap can also fail immediately, we have to handle that there.

I agree that it would feel better to have a passive way to learn that a bootstrap did start or finish but I don't see how to implement that in a reasonably simple manner.

The reason we need to know if a bootstrap as started or finished is because we don't want to cascade bootstrap requests. When a bootstrap is triggered (no matter if it was automatic, periodic or manual), we reset the automatic and periodic timer to their initial value.

protocols/kad/src/behaviour.rs

guillaumemichel · 2024-06-11T12:21:34Z

Not checking whether the routing table is empty would take us back to before this commit. IMO it is good that this check was introduced, allowing the bootstrap to fail fast.

If the check isn't performed, a new query for self is created, and the first time next() is called, the query state is immediately set to Finished. IMO it would be better to have an error message saying that the bootstrap failed because the routing table is empty, rather than looking for an empty OutboundQueryProgressed.

protocols/kad/src/behaviour.rs

stormshield-frb · 2024-06-11T12:40:09Z

Not checking whether the routing table is empty would take us back to before this commit. IMO it is good that this check was introduced, allowing the bootstrap to fail fast.

If the check isn't performed, a new query for self is created, and the first time next() is called, the query state is immediately set to Finished. IMO it would be better to have an error message saying that the bootstrap failed because the routing table is empty, rather than looking for an empty OutboundQueryProgressed.

I'm not sure about that. If a check is needed when the routing table is empty, why is it not done for the other queries like get_closest_peers, get_record or get_providers ? In those cases, a Query is always started however it is possible no actual query is emitted on the wire. Sure get_record and get_providers can have information from their local cache, but get_closest_peers does not.

guillaumemichel · 2024-06-11T12:55:42Z

I agree that the empty routing table check should be consistent with other queries, and it would even be better if the check was unified (e.g when creating the query?). But this can be left for a follow up PR.

mergify · 2024-06-12T18:12:15Z

This pull request has merge conflicts. Could you please resolve them @stormshield-frb? 🙏

guillaumemichel · 2024-06-18T11:20:43Z

@stormshield-frb what do you think of b548fc7? IMO it is a slightly cleaner, because if the routing table is empty, we simply reset the timers, and we don't need to increase and then decrease the count of bootstrap requests, and we don't need to wake the waker.

If you disagree, we can revert to the last commit.

jxs

LGTM, thanks François, and Gui for the review!

dariusc93 reviewed Apr 30, 2024

View reviewed changes

protocols/kad/CHANGELOG.md Outdated Show resolved Hide resolved

stormshield-frb force-pushed the fix/automatic-bootstrap-bug branch 2 times, most recently from 67e9aca to ec9898f Compare May 2, 2024 07:42

guillaumemichel reviewed May 2, 2024

View reviewed changes

protocols/kad/src/behaviour.rs Outdated Show resolved Hide resolved

stormshield-frb force-pushed the fix/automatic-bootstrap-bug branch 2 times, most recently from f2d3e48 to f0ee433 Compare June 5, 2024 10:36

stormshield-frb mentioned this pull request Jun 5, 2024

Kademlia bootstrap gets stuck forever in some cases #5432

Closed

guillaumemichel mentioned this pull request Jun 6, 2024

Bootstrap node can't add the connected node into peers using examples/distributed-key-value-store in public network #5445

Closed

nazar-pc mentioned this pull request Jun 6, 2024

DSN sync can get stuck indefinitely subspace/subspace#2729

Closed

stormshield-frb force-pushed the fix/automatic-bootstrap-bug branch from f0ee433 to 94c11d9 Compare June 10, 2024 10:08

guillaumemichel reviewed Jun 11, 2024

View reviewed changes

protocols/kad/src/behaviour.rs Outdated Show resolved Hide resolved

fix(kad): always trigger a query when bootstrapping

08148ad

stormshield-frb force-pushed the fix/automatic-bootstrap-bug branch 3 times, most recently from 345424b to 2922bde Compare June 13, 2024 10:11

just addressing the problem

f1dfb2b

stormshield-frb force-pushed the fix/automatic-bootstrap-bug branch from 2922bde to f1dfb2b Compare June 13, 2024 10:12

jxs added this to the v0.54.0 milestone Jun 13, 2024

adding reset_timers fn

b548fc7

guillaumemichel approved these changes Jun 18, 2024

View reviewed changes

jxs approved these changes Jun 18, 2024

View reviewed changes

jxs added the send-it label Jun 18, 2024

Merge branch 'master' into fix/automatic-bootstrap-bug

3354d78

mergify bot merged commit 32e917f into libp2p:master Jun 18, 2024
72 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(kad): correctly handle `NoKnownPeers` error when bootstrap #5349

fix(kad): correctly handle `NoKnownPeers` error when bootstrap #5349

stormshield-frb commented Apr 30, 2024 •

edited by jxs

Loading

guillaumemichel commented Apr 30, 2024

stormshield-frb commented May 2, 2024

guillaumemichel commented Jun 11, 2024

stormshield-frb commented Jun 11, 2024

guillaumemichel commented Jun 11, 2024

mergify bot commented Jun 12, 2024

guillaumemichel commented Jun 18, 2024

jxs left a comment

fix(kad): correctly handle NoKnownPeers error when bootstrap #5349

fix(kad): correctly handle NoKnownPeers error when bootstrap #5349

Conversation

stormshield-frb commented Apr 30, 2024 • edited by jxs Loading

Description

Notes & open questions

Change checklist

guillaumemichel commented Apr 30, 2024

stormshield-frb commented May 2, 2024

guillaumemichel commented Jun 11, 2024

stormshield-frb commented Jun 11, 2024

guillaumemichel commented Jun 11, 2024

mergify bot commented Jun 12, 2024

guillaumemichel commented Jun 18, 2024

jxs left a comment

Choose a reason for hiding this comment

fix(kad): correctly handle `NoKnownPeers` error when bootstrap #5349

fix(kad): correctly handle `NoKnownPeers` error when bootstrap #5349

stormshield-frb commented Apr 30, 2024 •

edited by jxs

Loading