-
Notifications
You must be signed in to change notification settings - Fork 8
gossip backpressure leads to unresponsive states and / or ∞ discovery cycle #47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Have been playing around a little more with iroh-gossip trying to get an understanding of the apis and it's capabilities / limitations and have a question regarding the gossip messages. I see that the little swarm of 20 nodes arranges neatly into clusters. haven't figured out yet how to split "real connections" from "passive" address book entries. But my question is about the messages. There should be 10 messages received by every single node and I see quite a variation in numbers of messages and specially the ordering of the messages received. When I send all messages from a single node ( the one that subscribed to the topic / created the ticket ) things look a bit more homogenous but still, some nodes receive only very few or no messages at all. When I send the messages from randomly selected nodes things look very different from node to node in terms of numbers and ordering. The data is based on discovery_stream and the gossip receiver event loop of every node. Are there any examples implemented in rust only that replicate such scenarios with a bunch of nodes and gossip messages being sent around? |
Hi, do you mind describing in a bit more detail what happens? How do you observe "10, 5 not so much"? What are the symptoms, what behaviour do you expect. Also what is your network topology? Please describe your test and observed problems in as much detail as possible so we know what to look for. |
Hi flub. I'm not really sure what is happening but it's basically an "uncontrolled" discovery cycle after N message with many magic disco events ( I think the logs are the best description for the behaviour ). Sometimes that loop blocks the terminal too ( happened 1-2 times ). what I expect in particular when starting all nodes on a "trivial" loopback network interface is a "constrained" number of discovery events ( should somehow correlate with the number of nodes ). All nodes are started on the same network resp. same machine ( haven't ventured into any distributed setups yet ). Did a little test setup on github. I can adjust this to whatever makes sense to test. I have very little detailed knowledge about iroh so not really sure what makes sense to test in detail. https://github.com/adiibanez/iroh_ex/actions/runs/14074228206/job/39414287222. I see similar behaviour if I change the code to the commented version below: I'm starting to suspect that there is something going on related to tokio?
|
Could you point to line numbers in the logs you posted for the "uncontrolled discovery cycle"? |
@flub To summarize: this is supposed to be 100k messages.
|
Line numbers: iroh_test_5_nodes_100k_messages.log:305046:2025-03-25T16:39:08.690946Z INFO iroh_ex: 💬 FROM: a9cc5ffeed MSG: 40074 iroh_test_10_nodes_100k_messages.log:438808:2025-03-25T16:41:21.578386Z INFO iroh_ex: 💬 FROM: e542d760bb MSG: 2886 Let me know if I can provide more specific logs / tests |
Apologies, I'm not sure I'm understanding your questions correctly. I assume you mean that after this line an "uncontrolled discovery cycle" is happening? But I don't see such a thing. Connections are bing kept alive between 5 nodes, sure. But time is ticking along nicely and connections look healthy to me. |
I'm not really familiar with the standard behaviours and what is considered normal lifecycle of endpoint / node / connection, ... But I can tell with some degree of confidence that what I'm seeing is very likely not supposed to be happening. Are we on the same page that 100k messages are supposed to be delivered to all peers in the gossip topic? Not only 41k messages or whatever the logs show. I currently even wait for a few seconds between connecting nodes and sending out any messages. also wait another few seconds after all the messages have been sent. As mentioned happy to adjust the tests to whatever makes sense or helps identifiying the root of that behavior. 2025-03-26T10:20:37.301704Z INFO iroh_ex: 💬 FROM: b8b5230cc3 MSG: MSG:41915 This is supposed to continue up to 100k messages. Instead it stops processing any gossip messages and does a whole lot and specially continuously ( until the processes are killed resp. the test times out and kills ) the following: 2025-03-26T10:20:39.169913Z WARN ep{me=f2b8b196c0}:magicsock:poll_recv:disco_in{node=02eac76f11 src=Udp(192.168.64.1:57639)}:handle_pong{m=Pong { tx_id: TransactionId(0x348F8E47B9FDE6CBDE5607AC), ping_observed_addr: Udp(192.168.64.1:59074) } src=Udp(192.168.64.1:57639)}: iroh::magicsock::node_map::node_state: received pong with unknown transaction id tx=348f8e47b9fde6cbde5607ac The most I kept it going in that state was ~10-15min ... iroh, my way of using iroh, ... never recovered from that "discovery loop". The zip file contains megabytes of this type of activity. |
@flub Not directly related to the issue but somehow in the context of my attempt at understanding constraints of broadcasting messages to a number of nodes. I completely abandoned sending many messages and I'm currently only using 10nodes and 10 messages sent from a randomly selected node with random delay. I wait 2 seconds after connecting the nodes before any messages are sent. No matter if I don't wait at all after sending those messages or wait 1sec / 30sec So far I have never seen all nodes receiving 10 messages. I see 6, 7, 8, 9, 10 messages on each of the nodes. But never 10messages on 10 nodes. There is a datastructure at the end of the logs showing that behaviour. If I send all messages from the very same single node ( eg. the one that "creates" the ticket ) I see 10 messages on 10 nodes indeed. Sometimes I have to wait 10-30 sec; a few times I saw 10 messages without any waiting. While waiting 1sec - 30sec I also see a lot of the very same discovery messages like in the "unresponsive state" described above. That's why I post this comment and upload the logs. Is what I'm describing to be expected or is there something else going on in my integration of iroh? |
I can say that the elixir binding iroh_ex has been stabilized at least to a degree to make it usable. While still a bit rough. There is a livebook in case that has any relevance https://github.com/adiibanez/iroh_ex it's a bit "hacky" but helped me figure out what was going on instead of staring at logfiles. The relevant stuff resp. UI elements are at the end of the livebook. Regarding: sending 100k vs. only seeing 40k messages being processed an then stop / lock up ... I was really hammering away in those scenarios where things became "unresponsive". basically unbounded concurrency from elixir side while sending out the messages. In the meantime I'm limiting that to the number of nodes. haven't done any further testing in that regard nor tested more than 100 nodes / 10k messages but things seem to start getting funky / unresponsive with ~ 2x node count concurrency. Maybe some type of backpressure / overload protection is required? There were a number of issues in my rust / elixir integration too; Some of those issues are solved by getting out of the way of the iroh event receiver loop as much as possible. Resp. not doing anything else than just forwarding messages to elixir via a dedicated tokio mpsc channel. Not quite sure why that event loop is so delicate. Thought I had rust concurrency figured out to a midlevel degree from earlier projects; but obviously I didn't. One of the conclusions after this rust concurrency experience was looking for actor like approaches: https://github.com/slawlor/ractor ... modeled after erlang / elixir ;) https://slawlor.github.io/ractor/assets/rustconf2024_presentation.pdf I'll keep stabilizing the lib. But could use some insider inputs on a few topics: Could I get some information about what the expected swarm shape looks like? for 50-100 nodes I see mostly 2-5 peers per node, some times a few nodes with only 1 peer and very rarely I see very few nodes not connecting at all. Is there something like a connection rampup time that has to be respected before starting to stream away with messages? currently I send these type of events to elixir: |
iroh-gossip is based on PlumTree, which constructs a minimal spanning tree when messages are broadcast. Due to the way the protocol works, it's most efficient if the node that's broadcasting stays the same as much as possible (but can still tolerate arbitrary nodes broadcasting).
Nodes not connecting at all shouldn't happen. We generally aim for a 5 connection limit, as iroh-gossip is supposed to work on mobile devices by default, but you can configure that limit. Yeah, there will be a rampup time, while HyParView messages propagate around the network to establish the active/passive views. That said, there's absolutely a good chance that we're losing more messages than we have to due to some bug. E.g. if some nodes end up with 0 connections to the rest of the swarm for longer periods, that'd be weird I think. |
@matheus23 thanks for the explanations, very helpful. And good to know about the plumtree broadcasting node efficiency. I'm not aiming at 100% gossip reliability. My usecase is sending sensor data for collaboration / visualization ( most sensors not so high sampling rate, some quite high like ecg / imu. resp. as high as feasible without introducing issues ). But some basic "dynamic" batch size strategies are already in place. My questions are more about what I need to do to facilitate the best possible scenario for iroh to optimize swarm forming from elixir node "orchestrator" side. Resp. I'm trying to not actively step on it's toes regarding rust / tokio / event loops ... . as I might have been doing earlier. Any suggestions regarding the described counts of peers? Resp. potential reasons why some nodes only use 1-2 peers while others use 3-5. I observe a "swarm forming" phase with changes but after that it remains static. Should that be a continous process? Maybe I'm still dropping some event loop unintentionally? Does waiting 1-3 seconds sound reasonable? Or should I rather do some basic calculations based on a certain number of nodes with eg. N peers established. Or not bother at all and just go ahead with streaming data? The tests I'm doing with quick "swarm" iterations might not be representative for what will be going on with real devices. I'm just not really familiar with the behaviours of "open" swarms to be honest. Or how iroh handles "queued" up messages internally. That's why I'm asking. The 0-1 peers node scenario is very rare. Just grabbed the opportunity to take a screenshot. |
@b5 As discussed here are a bunch of logs.
The test can be configured with whatever settings required. 100 nodes work just fine, 10, 5 not so much.
https://github.com/adiibanez/iroh_ex/blob/main/test/iroh_ex_test.exs#L6
This is quite an artificial test not really reflecting what I need iroh to do.
Nevertheless I would be interested in understanding the constraints when operating with small swarm sizes and lots of small messages. Eg. a bunch of high sampling rate sensors.
iroh_backpressure_disco.zip
The text was updated successfully, but these errors were encountered: