Link behavior with multiple RNodes #567

resiliencetheatre · 2024-10-06T01:57:47Z

resiliencetheatre
Oct 6, 2024

I've been experimenting with link example code provided in reticulum package and modified with my almost non existing python skills to act like a server and client(s) with RNodes. I understand that RNodes (as being LoRA) has bandwidth constrains but I am now wondering what is flow control in this case?

I can run my 'link server' on PC connected to TCP transport and few clients polling that server and everything works good.

My server code sends announces periodically (like every 60 s) and each client grab that announce and establish link with the server. After which clients transmit packet to link and wait for ack from server and loops this with 10 s delay between packets. This works good with TCP transport, but with RNodes I face issues.

Question 1: Does RNode interface handle flow control / channel busy situations how? It seems that if I run parallel clients (3 pcs) towards 1 server, links gets timeout and terminates pretty randomly. I've been monitoring my 4 pcs RNodes (1 server, 3 clients) with SDR and I cannot determine is there some channel busy monitoring on RNodes or how should I handle this situation? Do I have to play with delays by myself or how RNode interface should be used in this case?

Question 2: Same question applies to announce and clients connection attempt after announce. What happens if several clients hear announce and tries to connect almost immediately after that?

Picture here just two clients towards one server:

Answered by markqvist

Oct 8, 2024

After giving this a proper think-over, there most probably isn't anything strange going on with the CSMA itself - it's working as it was designed to do, it's just that I had forgotten one quite important, but deliberate, assumption that the algorithm makes:

That in clear-channel conditions, it is statistically quite improbable that distinct devices will receive an outbound packet from their host at almost the exact same time (within a millisecond or two). This is a reasonable assumption to make for how networks normally function, but in your case, you directly circumvented that assumption by simple synchronizing all transmitters almost perfectly ;)

In clear-channel conditions, the CSMA P-…

View full answer

resiliencetheatre · 2024-10-06T04:57:34Z

resiliencetheatre
Oct 6, 2024
Author

Update:

I focused first to 'announce to connection' between one 'link server' and three 'link clients'. Maybe I am missing some rns fundamentals here, but from "announce to connect" randomly fails for me between multiple RNodes.

Question: Maybe someone could comment how this should be implemented or is my approach totally wrong?

1 reply

markqvist Oct 6, 2024
Maintainer

Btw, thanks a lot for providing such a concise and detailed explanation and diagrams of the intended system!

faragher · 2024-10-06T10:57:35Z

faragher
Oct 6, 2024

Okay, some of this may seem simple, and I'm sorry if you know much of what I'm talking about, but I find it best to start from the ground up.

First, the RNode uses CSMA (Carrier Sense Multiple Access) meaning only a single radio can transmit at once. That means the first thing we need to do is look at bandwidth usage. Depending on your parameters, you can expect data rates similar to 80s and 90s modems. SF8BW250CR5 gets you around 6250 baud (bits per second) while SF11BW125CR5 will get you around 537 baud. Calculator here. Anything below 500 will likely fail in use, especially the way you're expecting.

I'm also going to say you're announcing way too fast. Usually once per day is enough, and I'm usually unhappy if I announce more than once every six hours, or once every hour for super-specialty cases, like the EchoBot for people on the Testnet just so they don't see an empty net. Less of an issue than it was a year ago. Now, your net your rules (so long as it's not linked to anyone else's) but not only will that potentially trigger flood control, you're going to be very disappointed in what happens on your own net.

Section 4.6.3 in the manual has example sizes for packets, bearing in mind any data contained therein is in excess of that. The announce is 167 bytes plus the appdata (and likely also the ratchet for the newest version of the RNS) so we'll use 200 bytes for planning numbers. For the fast settings shown, this makes each announce around 214ms. On the slow, long range settings, it's 2.9 seconds, Link establishment takes multiple packets of around 100 bytes, each taking around 107-1500 ms, making multiple radios attempting to respond within a second problematic. Even with CSMA operating effectively, they'll keep delaying their broadcasts because the band is full of other traffic. You can watch the waterfall displays on the radio to see the state of the airwaves, and if it's just one solid block of white, there's no room to talk on the air.

Providing a delay in seconds before responding to the announce (which you shouldn't do, but more on that later) based on a random hash (I'd choose the destination and just select, say 1-16 seconds based on the last byte) may help you overcome collision issues.

Also, there's a built-in limit of 2% of the bandwidth allocated for announces. Based on the speed settings above, this comes out to 940 and 75 bytes per minute for announces. This means either the announcer is taking up around a quarter of the allocated announce time, or it's using all of it and blocking itself and all other announces for two minutes. This means nobody will receive any keys for that period, making all new nodes unreachable.

I believe your problem is in there, but some followups.

There's no reason to announce that quickly. An announce will be cached on any system that hears it, and if you know a destination but not their keys, any transport node will provide their copy on request. While it may initially seem like a good idea for new nodes on a network, a node only needs to know a destination or see an announce once every week or so and it'll operate just fine. We've covered the traffic implications of too rapid an announce.

And you really shouldn't set up links on announce. It's just a bad idea for reasons you're learning. You can open links when you need them to transmit or receive data, and can happily shut them down and establish new ones as needed. Doing this even drastically improves security, as it uses new symmetric keys. I'm not entirely sure your use case, but creating links should happen when they need to be made. I think this would also solve most of your problems, unless every node is trying to send data all the time starting immediately, which is another bandwidth issue.

Since you have an SDR, I suggest setting up a single client and server, then monitoring the airwaves. See the duty cycle with the settings you're using and determine if your current model is scalable to the number of nodes you'd like to use.

So, my suggestions:

Unless you have new nodes joining constantly and it's absolutely, positively critical they all see an announce practically immediately, turn your announce speed way way down. As shown, this varies from problematic to preventing any announces from coming through at all.
Reassess your needs and why you're trying to establish these links. Establish links as you need them, not on connect, and you can leave them open for data savings, but don't simply make them proactively.
If you absolutely need to have them respond to an announce, program in a delay that spreads out their responses over time.
Determine your bandwidth needs. If you have N nodes and require B bytes per second, then your required bandwidth is (N*B*8) Baud. Conversely, you can see how many nodes a channel supports with Baud/(B*8). Reducing your data requirements has a phenomenal effect on the overall health of the network.
Stop sending keep-alives. Ten seconds is far too short a time to loop anything with more than one or two nodes. Reticulum handles all of this, and on a rather long timeline.
Stop thinking in seconds. At these data rates, with multiple access, you have to plan in the scope of minutes and hours. Announce very infrequently, don't send heartbeats every few seconds, just set up proofs if you need an ack, then expect the system is stable between the checks Reticulum itself does.

If none of these seem to be the problem, let us know and we'll try to troubleshoot, but I think it's an issue of thinking of Reticulum in the way you think of TCP, which gets people into trouble. Especially over RNode, it's a very quiet system that doesn't require constant network-level handholding.

Of course, if your data needs exceed what the hardware can do, then that's another issue that can't be resolved in software.

5 replies

resiliencetheatre Oct 6, 2024
Author

This was excellent answer. My announce rate is high and other delays are small mainly because of development testing. But I will definitely re-think it later on and most probably I'll even go out of band keying ultimately. Agree also to all you mentioned about LoRA bandwidth, my payload rate is really small ultimately - but I need to check why my current implementation does not work.

Announce and link connect based on announce just fails if I add multiple clients (eg. 3 is enough).

But keep you all posted, because I am just on beginning on this journey and I am surely already admiring work you guys have done with rns!

markqvist Oct 6, 2024
Maintainer

I would have said more or less the same thing as @faragher, did above, but he beat me to it, and it's a great answer.

I'd like to further emphasize, that you really need to think in different terms than traditional networking stacks when designing systems with Reticulum. It is a much more passive and requester-oriented system. Endpoints that need to obtain data should request it from the source when it is needed, and handle things gracefully when those sources are not available.

Furthermore, don't think about your applications in terms of everyone being reliant on talking to a central "server", and everyone else than that server being a "client". One of the strengths of Reticulum is that everyone can reach everyone, and this can dramatically improve overall efficiency and functionality in low-bandwidth situations. Try to design your applications with a distributed architecture in mind, where no central point of failure will compromise functionality.

Remember, depending on the amount of data you need to send, you can even just forego the link establishment all-together, and send single packets back and forth between destinations. This approach can be incredibly useful in distributed and low-bandwidth situations. As long as you enable ratchets on your destinations, you get more or less the same security as with links.

In your particular situation, I'm guessing that you're announcing often as a way to have new peers discover some central service or system. That works, of course, but it's incredibly inefficient, since it keeps utilizing a lot of bandwidth, even when there are no new peers around (which will probably be most of the time). A more "Reticulum-esque" way to achieve the same goal, is to have the peers actively search for the required service they are looking for once they come online. You could achieve this in different ways:

Have the peer emit an announce with specific app-data included that signals to other types of peers that it is looking for a particular service. The service-hosting peer sets up an announce handler for the specific announce type, and when one arrives can then announce itself once (or not at all, based on logic in the handler), and discovery is now established.
An even better approach may be to use the PLAIN destination type for these service discovery searches, since they will not be propagated or routed over the network, and thus only discover services that are locally available.
If you do need service discovery to occur over multiple hops, it's also possible to for example look at the announce hops in your announce handler, and only respond to announces that are any given maximum number of hops away.

These point are of course generalizations, and exceptions of course exist, but when starting out making things with Reticulum, it's really useful to turn your entire concept of networked systems upside down. Reticulum does things very differently, and it's hard to wrap your head around in the beginning, but when you start thinking in the way it was designed, you will see that it solves a lot of problems, that were simply unsolvable within the scope of other systems.

I've been following your work in the videos you've posted as well, and very interested to see where this is going! Keep it up, Edgemap looks very nice!

faragher Oct 6, 2024

Something else that I thought of is trying to make your system as efficient as possible. One of the issues I've heard regarding telemetry is that if you're going down a highway at 60 miles per hour, your position will change too quickly for slow data rates. However, a GPS that emits NMEA sentences can also tell you the speed at which you're going. So you could configure it to send telemetry at a certain rate in time, or a certain rate in distance (as distance = velocity*time, you don't even need to do any positional calculations) and not only is it configuration-free if you hand it to a person or strap it to a plane, but also you don't get excessive information while a vehicle is parked.

Now, if you have an override so every person in a bus isn't reporting at increased rate when the bus is in motion, all the better.

While I have no idea if you're doing anything with GPS, it's that kind of whole-project thinking that can really pay dividends without sacrificing capability.

markqvist Oct 6, 2024
Maintainer

That's a great idea too. The venerable old APRS actually had something like that, I think it was called "smart beaconing" or something similar.

resiliencetheatre Oct 7, 2024
Author

Thank you guys! Lot's of good comments. I indeed like to breakaway from 'client server' model and that's my goal. My initial plan is to have each 'edgemap node' with rns on board and link server and client on each node. Each server announces and each client connects (to server). I don't know is this wrong approach, but my plan is to have connectivity up between every node and each client sends data and each server receives it from others. So every node talks to each other without centralized server.

With this approach all nodes will talk to 'every available' nodes while they are in 'range' (in network).

But initially it seems that I cannot use received announce as trigger to start connecting from clients to announced server. Because when three RNode connected clients tries to initialize connection based on received announce, they fail. I am pretty sure that RNodes starts to send at the same time and collision happens. This based on my SDR waterfall observations.

This part was something I could not get. Why this happens if there should be collision detection on RF level ?

resiliencetheatre · 2024-10-07T02:14:26Z

resiliencetheatre
Oct 7, 2024
Author

Tested a bit more and I believe this is something related to CSMA functionality?

I created simplified test where I separated announce to connect to manual process and just focused sending from three client nodes to one server node simultaneously:

If I send simultaneously, only one packet goes through and gets ACK from server node. [ ERROR ]
If I send with 1 s interval, all three packets gets through and gets ACK's from server node. [ SUCCESS]

Any thoughts from this?

7 replies

markqvist Oct 7, 2024
Maintainer

It indeed seems like there is some CSMA funkiness going on here. What is puzzling me a bit is that it seems that either:

All three RNodes are tramsitting at the exact same time, thus colliding packets, but one is still being decoded correctly by the server.
Only one RNode acquires the free CSMA slot, and the other two hold in CSMA wait, but for some reason never transmit.

Either way, this should not happen. Both cases are strange. In the first case, actually decoding one of the packets seem improbably, especially if the devices are all being received at more or less the same signal level. In the second, this could only happen if the devices for some reason dropped the waiting packets from their outgoing queue, which is also something that should definitely not happen.

The intended behaviour is that one of the device will acquire the CSMA slot first, and transmit, while the other two detect that the channel is busy and wait transmitting until it is free, based on a semi-random adaptive persistence parameter, calculated like so:

#define _e 2.71828183
#define _S 10.0
float csma_slope(float u) { return (pow(_e,_S*u-_S/2.0))/(pow(_e,_S*u-_S/2.0)+1.0); }
void update_csma_p() {
  csma_p = (uint8_t)((1.0-(csma_p_min+(csma_p_max-csma_p_min)*csma_slope(airtime)))*255.0);
}

Please note that this approach is obviously not perfect. Without centralised channel control, we can only minimize the probability of collisions, not completely avoid it. The probability of collisions can be lowered by decreasing the maximimum allowed persistence value, or flattening the CSMA slope, but at the cost of added latency. That being said, if this happens consistently, there is something wrong here.

Could you give me the LoRa channel parameters you're using (don't care about the frequency and power)? Then I can calculate and verify the various CSMA parameters such as slot time and persistence slopes for this particular scenario.

resiliencetheatre Oct 7, 2024
Author

I use defaults from documentation:

   [[RNode LoRa Interface]]
    type = RNodeInterface
    interface_enabled = True
    port = /dev/ttyUSB0
    frequency = 867200000
    bandwidth = 125000
    txpower = 22
    spreadingfactor = 8
    codingrate = 5

markqvist Oct 8, 2024
Maintainer

After giving this a proper think-over, there most probably isn't anything strange going on with the CSMA itself - it's working as it was designed to do, it's just that I had forgotten one quite important, but deliberate, assumption that the algorithm makes:

That in clear-channel conditions, it is statistically quite improbable that distinct devices will receive an outbound packet from their host at almost the exact same time (within a millisecond or two). This is a reasonable assumption to make for how networks normally function, but in your case, you directly circumvented that assumption by simple synchronizing all transmitters almost perfectly ;)

In clear-channel conditions, the CSMA P-value will be quite high (around 0.9). This means that even with three separate devices, there's a high probability (~73%) that all three devices will transmit on the CSMA slot, but only in the exceedingly rare event that they all receive an outbound packet from the host simultaneously!

To illustrate how the CSMA P-curve actually looks, here's a graph of it for different S values:

The latest release of the firmware uses S=10.0. I'm adjusting some parameters now and improving it a bit, so that graph won't actually represent the reality after v1.77, but it's primarily meant to illustrate the concept anyways.

Now back to something I said earlier in this message. I intentionally didn't add any "fuzzing" or random transmit delay to the CSMA algorithm, since statistically speaking, the nature of the setup provides that randomness itself, and trying to add randomness to randomness can lead to weird and unintended consequences without very careful modeling of the underlying characteristics.

I'll definitely be exploring ways to improve on situations like this, but until I get around to it, for now it's probably best to add a small random delay or similar to your connects, at least just as a stop-gap solution until we have a solid way of handling it directly in the firmware.

Answer selected by resiliencetheatre

GUVWAF Oct 8, 2024

I hope you don’t mind that I jump in here, I happened to come across this and took some time to dig into the current implementation.

Only one RNode acquires the free CSMA slot, and the other two hold in CSMA wait, but for some reason never transmit.

It may also be the other two waited, but transmitted simultaneously after that and created a collision.

but only in the exceedingly rare event that they all receive an outbound packet from the host simultaneously

This is not the only case. In fact due to DCD, nodes that have any packets available when the channel is busy will be synchronized based on the “channel idle” event, at which point they try to transmit simultaneously. This happens by default in a mesh if the origin is sending more than one packet back-to-back that needs to be relayed, as this issue markqvist/RNode_Firmware#72 demonstrated already.

In general this would also happen frequently if the overall channel utilization is high. However, only the airtime of packets the node itself transmits is taken into account for the CSMA P-curve, but only one of the competing nodes can transmit at the same time (in case they are not colliding which is the design goal). Meaning that with equal competition among N transmitting nodes, the airtime of each node is proportional to 1/N. Therefore, the more nodes transmit, the higher the chance nodes pick the earlier slots, which is the opposite behavior of what you want to achieve.

This gets worse due to the slow start of the P-curve. For example, with an airtime of 0.3 for three nodes (so the channel has been occupied for 90%(!) of the time) csma_p will be 214. Then there is still a chance of (215/256) * (1-(((256-215)/256)^2) = 82% that at least two of them transmit immediately after they sense the channel idle. Even when two of them didn’t take the first slot, there’s a high chance they both take the same one of the subsequent slots.

So, I would say the overall channel utilization is a better measure than your own airtime. Furthermore -and you’re probably adjusting this already- in my view the chance that two nodes pick the same slot is way too high. For example, in IEEE 802.11 and 802.15.4, nodes pick a random number of slots from a contention window, whose size is variable, but at least 2^5 for 802.11 (and up to 2^10!) or 2^4 for 802.15.4. So with 2^5 and three nodes, the chance that at least two pick the same slot is only 1-(32/32*31/32*30/32) = 9.2%.
Decreasing this chance will of course come at the cost of latency, which might not be insignificant depending on the slot time.

Also, maybe less important because it’s only relevant for SX126x/SX128x radios, but the dcd_threshold is set to 2, meaning you have to detect activity twice before deciding to wait. However, the preamble detect flag gets cleared after each modemStatus check. Currently there’s only 3ms between each modemStatus check, and given that “The receiver undertakes a preamble detection process that periodically restarts”, from the SX126x datasheet, it may well be it won’t be triggered again during the checking loop. Then if you did not yet reach the header during DCD, it will only increment dcd_count once. Furthermore, if you sample in between the preamble detected flag and a subsequent flag, dcd_count will be reset to 0. So, the flags have to follow each other within 3ms.
I would say for SX126x/SX128x, the threshold should just be 1. While the radio is subject to false preamble detections and it won’t clear the flag automatically, to overcome this you could ignore the flag when the preamble and header airtime has passed, and the header valid flag is not yet set.

faragher Oct 8, 2024

This is all great. We're getting perilously close to having all the information for an actual risk analysis. :D

Using the total channel usage rather than the individual seems a reasonable change, and the IEEE 802 suggestions are interesting. As you say, slot time makes this an interesting balancing act where you need to accept a reasonable pCollision rather than pushing out six standard deviations or something that has larger implications.

The actual risk of collision depends entirely on the number of nodes and their average duty cycle, so I'm not sure it's reasonable to generalize too much. It might be worth balancing the cost of a collision, the cost of a false positive, and the cost of normal operation. Really, in my untrained opinion, the most vital here is bringing up the probability of detection, it's the sine qua non here; if we can't detect it, we can't mitigate it.

From there, we need to determine the acceptable risk for a collision. If the detection is good enough, this equation looks very different. For example, if we can detect 95% of all collisions, then a 10% chance of choosing the same time slot is acceptable, as the next collision is also very likely to be caught. If we detect 75% of collisions, then a much lower probability of choosing the same slot is acceptable, as a single hit may be all we get.

Now, I'm not going to pretend I understand the finer points of CSMA or the SX series modems, but I certainly think there's something quite important here for improving effective multiple access. So how do we improve detection (I see the DCD threshold above, I assume that would help) and what is the time cost of multiple slots? Explain it like I'm the guy you've never seen who got dragged into the technical meeting but hasn't acted like an idiot quite yet. :D

GUVWAF Oct 8, 2024

Note that it’s not about detecting collisions, but detecting ongoing packets in order to prevent collisions. But actually the current implementation is already CSMA/CA (with collision avoidance), because even if the channel is idle, you wait a while before transmitting.

Currently the detection is based on the preamble and header valid flags reported when in receive mode. I don’t know what the accuracy of this method is, but there is also a separate Channel Activity Detection (CAD) feature, and Semtech lists the true positive rates for the SX126x for various settings here. This might be more sensitive than in normal receive mode, and for the SX126x, it can then even detect LoRa in general (and not only a preamble).

It seems the slot time is fixed to 50ms. This slot time is to allow nodes to perform DCD, to switch the radio from Rx to Tx, and it should account for propagation delay and any additional processing time for preparing a transmission. When using CAD, the duration of this procedure depends on the LoRa settings and thus the slot time should be adjusted accordingly. Currently it depends on STATUS_INTERVAL_MS and dcd_threshold, because it’s checking the modem status dcd_threshold*2 times with STATUS_INTERVAL_MS delays in between.

In order to do more aggressive collision avoidance, you have to pick a number of slots from a larger window. However, then there will be on average more idle time in between packets, so the latency increases and indirectly also throughput decreases. E.g. if previously you would pick on average 2 slots, there is a latency penalty of only 100ms, but if that increases to 20, it would be 1 full second. This is a trade-off with the chance of collisions, which decrease reliability, or lead to added latency when you retransmit, and thus also decreased throughput. The penalty of a collision can be quite high, because the packet airtime is usually much longer than a slot time, and usually two packets are lost in a collision. And as you mention, the chance of collisions really depends on the channel utilization.

faragher Oct 8, 2024

Apologies, you're absolutely correct. I intended to say "detect potential collisions" or "prevent transmitting when it would cause a collision." It's not what I said, but that's what I intended.

Given the expected latency of LoRa transmissions, one to two seconds isn't terrible, but is less than ideal. Given 2^5 slots for the three nodes, it would have a delay of between 0.05-1.6 seconds, averaging just under a second, correct? Given a historical channel utilization, you could roll the dice that you can use a much shorter delay.

And it actually looks like the CAD is extremely reliable until you get to a SF dependent signal strength, when it drops off rapidly. That's an interesting data point, and makes an interesting consideration for broadly distributed networks.

I'm not entirely sure what to make of all this information, so I'm going to chew on it for a while. Thank you very much for bringing all this up.

resiliencetheatre · 2024-10-09T00:12:05Z

resiliencetheatre
Oct 9, 2024
Author

Guys this was the science I was hoping for. Markqvist comment "statistically speaking, the nature of the setup provides that randomness itself" is exactly the thing I was thinking. My desktop development setup is far from that and brought up this issue, probably without an real cause at the first place.

However this setup of mine (4 x RNodes) fails to work if I choose to connect directly as an answer for announce. So maybe I implement random delay based on faragher comment to over come this.

Other thing I found out yesterday is that when operating 3 pcs link connection to same node, if I choose to send within ~2 seconds from other node sending, it timeouts. I'll try to produce some detailed information about that later on.

0 replies

markqvist · 2024-10-09T08:58:51Z

markqvist
Oct 9, 2024
Maintainer

Thanks a lot for all the great input everyone, and welcome to the discussion @GUVWAF, appreciate your input and specificity!

I would say for SX126x/SX128x, the threshold should just be 1

This is reasonable, and how I originally intended it when I rewrote the CSMA to work for SX126x as well, but as it turned out, practical reality was quite a bit different than the theoretical performance of the system outlined in the datasheet. In a real-world noisy channel, there is a lot of false CAD detections from the modem, and raising the DCD status in the firmware on a single event leads to substantially increased latency, and in many cases worse real-world CSMA performance. While I would have loved to optimise it to a more ideal solution, the method of waiting 3 milliseconds, and checking the modem CAD status again was arrived upon experimentally as a reliable way to filter out false CAD events, without sacrificing too much performance or incurring too much extra latency. It's not ideal, but the best practical solution I could come up with within the time limits I was working under.

nodes that have any packets available when the channel is busy will be synchronized based on the “channel idle” event, at which point they try to transmit simultaneously.

Yes, but only if P is high enough that simultaneous transmission is likely to occur! I think it's safe to say, though, that the current P value and slope can bear adjustment. There was actually a typo in the P value calculation, that made pMax even higher than it should have been, further compounding the problem. I'm adjusting and improving this slightly now, but finding a balance between latency and collision probability is not easy, especially given the wide range of modulation characteristics and thereby on-air symbol times of LoRa in general.

To illustrate more visually how the P-curves affect latency and collision probability, consider the following two examples plotting the compound probability of transmission after Y number of CSMA slots, for different P-curves (the X-axis being airtime, Y-axis being number of slots passed and the Z-axis being the probability of transmission):

pMin=0.1, pMax=0.8, S=12.5

pMin=0.05, pMax=0.4, S=12.5

Now, collision probability is obviously not plotted here directly, but you should get the idea.

One of the main contributors to latency in this system is of course the CSMA slot time, and as @GUVWAF noticed, it is currently fixed at 50ms. If you look in this line of the source code, you will see that it was actually the intention to set the slot time dynamically based on the LoRa symbol-time of the configured modulation parameters, but it was ultimately disabled and fixed at 50ms, since (apparently without much system to it), it would perform very badly for some SF/BW combinations. Again, I didn't have enough time to fully diagnose this and design a slot time calculation that actually worked well in all cases, so I had to settle on a simpler approach that worked reasonably well in almost all cases.

If we can create a way of calculating ideal slot times for all possible SF/BW combinations, that will be a first big step towards optimising the CSMA performance and latency. As with all things CSMA, I think careful modelling of the system is very important, but as experience have shown, validating those models in practical, real-world setups is just as important, because things don't always act exactly like the theory and datasheets would have us presume!

I will probably release an update of the firmware relatively soon, with some conservative improvements to the CSMA. As you may have noticed in the above graphs, I added a backoff speed parameter, which slews the P-curve forward in time, so the drop-off occurs more rapidly. I will probably down-adjust the default P-values as well, which should help a bit too.

All of that being said, those changes will not fundamentally solve the problems, and to really get somewhere here, we need to have a reliable way of automatically calculating optimal slot times for different modulation configurations, since this will allow much more room for "modulating" the P-curves without incurring too much latency. Adding a contention window is also potentially a very good approach, but again, it has a much greater effect on latency when your slot time is measured in tens of milliseconds, not microseconds like in 802.11.

1 reply

GUVWAF Oct 9, 2024

the method of waiting 3 milliseconds, and checking the modem CAD status again was arrived upon experimentally as a reliable way to filter out false CAD events

It will indeed filter out false CAD events, but I believe it will also filter out almost all true CAD events, because the preamble detect flag will likely not be set again within 3ms after you cleared it. Only if you’re lucky (and the preamble length is short), it may be the header valid flag was already set and you count it twice, or it will be set within 3ms after you cleared the preamble flag.

raising the DCD status in the firmware on a single event leads to substantially increased latency

To detect a false preamble detect flag, you can check whether the header valid flag really gets set afterwards. Then you only have to wait the preamble airtime, which is roughly LORA_PREAMBLE_TARGET_MS = 15ms, and the header airtime, which is 41ms with the LoRa settings mentioned above, so that’s only a little more than 1 slot time.

Yes, but only if P is high enough that simultaneous transmission is likely to occur!

Yes, but from back-of-the-envelope calculations it follows that even for significant airtime, the chance that at least two nodes transmit simultaneously is very large, and it will increase even more when you scale this up to more competing nodes.

I think it's safe to say, though, that the current P value and slope can bear adjustment.

Indeed, but apart from adjusting the parameters, I still think that instead of your own airtime, the P-curve should take the overall channel utilization into account, meaning the sum of your transmitted and received packets. This to avoid the effect where the more nodes competing, the lower your airtime is, and thus instead of decreasing the probability of collisions, it will be increased.

Also, because the airtime only considers the current and previous airtime bin, the airtime depends on how far in time in the current bin you are! E.g., if you calculate it just when the current bin started, the airtime can never be more than 0.5.

I will probably release an update of the firmware relatively soon, with some conservative improvements to the CSMA.

Also worth noting that any significant changes to this are semi-breaking, because nodes running new firmware may get starved by nodes running older firmware (those jump in every time while you are patiently waiting), potentially leading to excessive latency and even timeouts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Link behavior with multiple RNodes #567

{{title}}

Replies: 5 comments 14 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Link behavior with multiple RNodes #567

Replies: 5 comments · 14 replies

resiliencetheatre Oct 6, 2024 Author

markqvist Oct 6, 2024 Maintainer

resiliencetheatre Oct 6, 2024 Author

markqvist Oct 6, 2024 Maintainer

markqvist Oct 6, 2024 Maintainer

resiliencetheatre Oct 7, 2024 Author

resiliencetheatre Oct 7, 2024 Author

markqvist Oct 7, 2024 Maintainer

resiliencetheatre Oct 7, 2024 Author

markqvist Oct 8, 2024 Maintainer

resiliencetheatre Oct 9, 2024 Author

markqvist Oct 9, 2024 Maintainer

Replies: 5 comments 14 replies

resiliencetheatre
Oct 6, 2024
Author

markqvist Oct 6, 2024
Maintainer

resiliencetheatre Oct 6, 2024
Author

markqvist Oct 6, 2024
Maintainer

markqvist Oct 6, 2024
Maintainer

resiliencetheatre Oct 7, 2024
Author

resiliencetheatre
Oct 7, 2024
Author

markqvist Oct 7, 2024
Maintainer

resiliencetheatre Oct 7, 2024
Author

markqvist Oct 8, 2024
Maintainer

resiliencetheatre
Oct 9, 2024
Author

markqvist
Oct 9, 2024
Maintainer