Link behavior with multiple RNodes #567
-
I've been experimenting with link example code provided in reticulum package and modified with my almost non existing python skills to act like a server and client(s) with RNodes. I understand that RNodes (as being LoRA) has bandwidth constrains but I am now wondering what is flow control in this case? I can run my 'link server' on PC connected to TCP transport and few clients polling that server and everything works good. My server code sends announces periodically (like every 60 s) and each client grab that announce and establish link with the server. After which clients transmit packet to link and wait for ack from server and loops this with 10 s delay between packets. This works good with TCP transport, but with RNodes I face issues. Question 1: Does RNode interface handle flow control / channel busy situations how? It seems that if I run parallel clients (3 pcs) towards 1 server, links gets timeout and terminates pretty randomly. I've been monitoring my 4 pcs RNodes (1 server, 3 clients) with SDR and I cannot determine is there some channel busy monitoring on RNodes or how should I handle this situation? Do I have to play with delays by myself or how RNode interface should be used in this case? Question 2: Same question applies to announce and clients connection attempt after announce. What happens if several clients hear announce and tries to connect almost immediately after that? Picture here just two clients towards one server: |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 14 replies
-
Update: I focused first to 'announce to connection' between one 'link server' and three 'link clients'. Maybe I am missing some rns fundamentals here, but from "announce to connect" randomly fails for me between multiple RNodes. Question: Maybe someone could comment how this should be implemented or is my approach totally wrong? |
Beta Was this translation helpful? Give feedback.
-
Okay, some of this may seem simple, and I'm sorry if you know much of what I'm talking about, but I find it best to start from the ground up. First, the RNode uses CSMA (Carrier Sense Multiple Access) meaning only a single radio can transmit at once. That means the first thing we need to do is look at bandwidth usage. Depending on your parameters, you can expect data rates similar to 80s and 90s modems. SF8BW250CR5 gets you around 6250 baud (bits per second) while SF11BW125CR5 will get you around 537 baud. Calculator here. Anything below 500 will likely fail in use, especially the way you're expecting. I'm also going to say you're announcing way too fast. Usually once per day is enough, and I'm usually unhappy if I announce more than once every six hours, or once every hour for super-specialty cases, like the EchoBot for people on the Testnet just so they don't see an empty net. Less of an issue than it was a year ago. Now, your net your rules (so long as it's not linked to anyone else's) but not only will that potentially trigger flood control, you're going to be very disappointed in what happens on your own net. Section 4.6.3 in the manual has example sizes for packets, bearing in mind any data contained therein is in excess of that. The announce is 167 bytes plus the appdata (and likely also the ratchet for the newest version of the RNS) so we'll use 200 bytes for planning numbers. For the fast settings shown, this makes each announce around 214ms. On the slow, long range settings, it's 2.9 seconds, Link establishment takes multiple packets of around 100 bytes, each taking around 107-1500 ms, making multiple radios attempting to respond within a second problematic. Even with CSMA operating effectively, they'll keep delaying their broadcasts because the band is full of other traffic. You can watch the waterfall displays on the radio to see the state of the airwaves, and if it's just one solid block of white, there's no room to talk on the air. Providing a delay in seconds before responding to the announce (which you shouldn't do, but more on that later) based on a random hash (I'd choose the destination and just select, say 1-16 seconds based on the last byte) may help you overcome collision issues. Also, there's a built-in limit of 2% of the bandwidth allocated for announces. Based on the speed settings above, this comes out to 940 and 75 bytes per minute for announces. This means either the announcer is taking up around a quarter of the allocated announce time, or it's using all of it and blocking itself and all other announces for two minutes. This means nobody will receive any keys for that period, making all new nodes unreachable. I believe your problem is in there, but some followups. There's no reason to announce that quickly. An announce will be cached on any system that hears it, and if you know a destination but not their keys, any transport node will provide their copy on request. While it may initially seem like a good idea for new nodes on a network, a node only needs to know a destination or see an announce once every week or so and it'll operate just fine. We've covered the traffic implications of too rapid an announce. And you really shouldn't set up links on announce. It's just a bad idea for reasons you're learning. You can open links when you need them to transmit or receive data, and can happily shut them down and establish new ones as needed. Doing this even drastically improves security, as it uses new symmetric keys. I'm not entirely sure your use case, but creating links should happen when they need to be made. I think this would also solve most of your problems, unless every node is trying to send data all the time starting immediately, which is another bandwidth issue. Since you have an SDR, I suggest setting up a single client and server, then monitoring the airwaves. See the duty cycle with the settings you're using and determine if your current model is scalable to the number of nodes you'd like to use. So, my suggestions:
If none of these seem to be the problem, let us know and we'll try to troubleshoot, but I think it's an issue of thinking of Reticulum in the way you think of TCP, which gets people into trouble. Especially over RNode, it's a very quiet system that doesn't require constant network-level handholding. Of course, if your data needs exceed what the hardware can do, then that's another issue that can't be resolved in software. |
Beta Was this translation helpful? Give feedback.
-
Tested a bit more and I believe this is something related to CSMA functionality? I created simplified test where I separated announce to connect to manual process and just focused sending from three client nodes to one server node simultaneously:
Any thoughts from this? |
Beta Was this translation helpful? Give feedback.
-
Guys this was the science I was hoping for. Markqvist comment "statistically speaking, the nature of the setup provides that randomness itself" is exactly the thing I was thinking. My desktop development setup is far from that and brought up this issue, probably without an real cause at the first place. However this setup of mine (4 x RNodes) fails to work if I choose to connect directly as an answer for announce. So maybe I implement random delay based on faragher comment to over come this. Other thing I found out yesterday is that when operating 3 pcs link connection to same node, if I choose to send within ~2 seconds from other node sending, it timeouts. I'll try to produce some detailed information about that later on. |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot for all the great input everyone, and welcome to the discussion @GUVWAF, appreciate your input and specificity!
This is reasonable, and how I originally intended it when I rewrote the CSMA to work for SX126x as well, but as it turned out, practical reality was quite a bit different than the theoretical performance of the system outlined in the datasheet. In a real-world noisy channel, there is a lot of false CAD detections from the modem, and raising the DCD status in the firmware on a single event leads to substantially increased latency, and in many cases worse real-world CSMA performance. While I would have loved to optimise it to a more ideal solution, the method of waiting 3 milliseconds, and checking the modem CAD status again was arrived upon experimentally as a reliable way to filter out false CAD events, without sacrificing too much performance or incurring too much extra latency. It's not ideal, but the best practical solution I could come up with within the time limits I was working under.
Yes, but only if P is high enough that simultaneous transmission is likely to occur! I think it's safe to say, though, that the current P value and slope can bear adjustment. There was actually a typo in the P value calculation, that made pMax even higher than it should have been, further compounding the problem. I'm adjusting and improving this slightly now, but finding a balance between latency and collision probability is not easy, especially given the wide range of modulation characteristics and thereby on-air symbol times of LoRa in general. To illustrate more visually how the P-curves affect latency and collision probability, consider the following two examples plotting the compound probability of transmission after Y number of CSMA slots, for different P-curves (the X-axis being airtime, Y-axis being number of slots passed and the Z-axis being the probability of transmission): Now, collision probability is obviously not plotted here directly, but you should get the idea. One of the main contributors to latency in this system is of course the CSMA slot time, and as @GUVWAF noticed, it is currently fixed at 50ms. If you look in this line of the source code, you will see that it was actually the intention to set the slot time dynamically based on the LoRa symbol-time of the configured modulation parameters, but it was ultimately disabled and fixed at 50ms, since (apparently without much system to it), it would perform very badly for some SF/BW combinations. Again, I didn't have enough time to fully diagnose this and design a slot time calculation that actually worked well in all cases, so I had to settle on a simpler approach that worked reasonably well in almost all cases. If we can create a way of calculating ideal slot times for all possible SF/BW combinations, that will be a first big step towards optimising the CSMA performance and latency. As with all things CSMA, I think careful modelling of the system is very important, but as experience have shown, validating those models in practical, real-world setups is just as important, because things don't always act exactly like the theory and datasheets would have us presume! I will probably release an update of the firmware relatively soon, with some conservative improvements to the CSMA. As you may have noticed in the above graphs, I added a backoff speed parameter, which slews the P-curve forward in time, so the drop-off occurs more rapidly. I will probably down-adjust the default P-values as well, which should help a bit too. All of that being said, those changes will not fundamentally solve the problems, and to really get somewhere here, we need to have a reliable way of automatically calculating optimal slot times for different modulation configurations, since this will allow much more room for "modulating" the P-curves without incurring too much latency. Adding a contention window is also potentially a very good approach, but again, it has a much greater effect on latency when your slot time is measured in tens of milliseconds, not microseconds like in 802.11. |
Beta Was this translation helpful? Give feedback.
After giving this a proper think-over, there most probably isn't anything strange going on with the CSMA itself - it's working as it was designed to do, it's just that I had forgotten one quite important, but deliberate, assumption that the algorithm makes:
That in clear-channel conditions, it is statistically quite improbable that distinct devices will receive an outbound packet from their host at almost the exact same time (within a millisecond or two). This is a reasonable assumption to make for how networks normally function, but in your case, you directly circumvented that assumption by simple synchronizing all transmitters almost perfectly ;)
In clear-channel conditions, the CSMA P-…