You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There is a race condition in the handling of dTLS handshake at least in the latest version that is v0.11.0.
In particular in the tokio::select! at read_and_buffer()@dtls/src/conn/mod.rs line 832.
Read below for further details.
What I did
I have created a simple program that is connecting two peers over WebRTC protocol and sends messages via data channel.
The most basic WebRTC chat app.
I have implemented it 3 times: in JS for web browsers, in JS for Node, and in Rust using this WebRTC crate.
browser/node <-> browser/node works without any issues in any combination.
Trying to connect web browser with the rust does not work however.
It works maybe 1 out of 10 times. The WebRTC connection always works, and the connectionState is "connected", but the data channel rarely opens up and the ondatachannel is rarely called.
I have debugged the issue and came to the conclusion that there is a race condition in the code of this crate.
I created a patch with additional println!() statements to showcase what is wrong (patch 0001-prints.patch).
Good
When everything works and data channel opens up then I can see in the logs something like:
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake done TX
-- RTC state: peer: connected, ice: connected, gathering: complete, signal: stable
TEST handshake done RX
-- Chat open
As you can see from the logs handshake is done, it succedes and we open our data channel ("chat").
We "luckily" choose the _ = ctx.handshake_done_rx.recv() => { part of the tokio::select! in the read_and_buffer().
Bad
When it doesn't work out though I can see something like this:
(...)
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake done TX
-- RTC state: peer: connected, ice: connected, gathering: complete, signal: stable
TEST handshake TX
As you can see from the logs handshake is done, it has succeded, but we do not open the data channel.
We get "unlucky" and we choose the _ = ctx.handshake_tx.send(done_tx) => { in the tokio::select! in read_and_buffer().
And because of this we land in the infinite while loop because it never gets the done_rx.recv() message.
Cause
I am pretty sure that this has nothing to do with the network.
I have investigated this using wireshark and all the traffic is correct.
Web browsers (firefox or chromium) work correctly and send all the necessary packets and nothing is lost.
What happens is that there is a race condition in that select and we should never land in the while loop since the handshake is already done.
Workarounds
I can see two workaround for this issue.
Workaround 1
First one is to "repeat" the _ = ctx.handshake_done_rx.recv() => { inside of the tokio::select! inside of the while loop and terminate that loop if the handshake is done.
This way the loop is no longer infinite and even if we get "unlucky" in the first select it will still work out.
See 0002-workaround-1.patch.
Workaround 2
Second option is to use biased select and switch order of the futures.
This way we should never get "unlucky" and we should always prioritize the handshake_done_rx.recv() over handshake_tx.send().
I am less certain of the correctness of this one but it has worked so far in my tests and never produced a race condition.
See 0002-workaround-2.patch.
Proper solution
I believe the proper solution is to never put mpsc::Sender::send() in the tokio::select! in the first place.
And so the proper solution would be to get rid of the select completely.
Unfortunately I don't know exactly how this would have to look like then in the code.
I find it hard to completely grasp what is happening in the code, it is a bit complex for me.
And so I wouldn't want to propose what a proper solution should look like.
Hopefully you can figure that out having better knowledge of the codebase.
I think that maybe the author incorrectly thought that send() will block until other end of channel receives but that is not the case.
Citation from tokio docs on mpsc::Sender::send(): "Sends a value, waiting until there is capacity."
send() doesn't wait for the other end to receive, it only waits until there is a capacity in the channel which in this case means it doesn't wait at all.
And all this in turn means that we just roll dice. tokio::select! can always choose the handshake_tx.send() option even though we don't want it to cause the handshake is already done.
Final words
I am attaching the three mentioned patches as well.
In case of any questions I'll try to help.
TL;DR
There is a race condition in the handling of dTLS handshake at least in the latest version that is v0.11.0.
In particular in the
tokio::select!
atread_and_buffer()@dtls/src/conn/mod.rs
line 832.Read below for further details.
What I did
I have created a simple program that is connecting two peers over WebRTC protocol and sends messages via data channel.
The most basic WebRTC chat app.
I have implemented it 3 times: in JS for web browsers, in JS for Node, and in Rust using this WebRTC crate.
You can find the code on my github: https://github.com/szatanjl/rtc-share-poc.
browser/node <-> browser/node works without any issues in any combination.
Trying to connect web browser with the rust does not work however.
It works maybe 1 out of 10 times. The WebRTC connection always works, and the connectionState is "connected", but the data channel rarely opens up and the ondatachannel is rarely called.
I have debugged the issue and came to the conclusion that there is a race condition in the code of this crate.
I created a patch with additional
println!()
statements to showcase what is wrong (patch 0001-prints.patch).Good
When everything works and data channel opens up then I can see in the logs something like:
As you can see from the logs handshake is done, it succedes and we open our data channel ("chat").
We "luckily" choose the
_ = ctx.handshake_done_rx.recv() => {
part of thetokio::select!
in theread_and_buffer()
.Bad
When it doesn't work out though I can see something like this:
As you can see from the logs handshake is done, it has succeded, but we do not open the data channel.
We get "unlucky" and we choose the
_ = ctx.handshake_tx.send(done_tx) => {
in thetokio::select!
inread_and_buffer()
.And because of this we land in the infinite while loop because it never gets the
done_rx.recv()
message.Cause
I am pretty sure that this has nothing to do with the network.
I have investigated this using wireshark and all the traffic is correct.
Web browsers (firefox or chromium) work correctly and send all the necessary packets and nothing is lost.
What happens is that there is a race condition in that select and we should never land in the while loop since the handshake is already done.
Workarounds
I can see two workaround for this issue.
Workaround 1
First one is to "repeat" the
_ = ctx.handshake_done_rx.recv() => {
inside of thetokio::select!
inside of the while loop and terminate that loop if the handshake is done.This way the loop is no longer infinite and even if we get "unlucky" in the first select it will still work out.
See 0002-workaround-1.patch.
Workaround 2
Second option is to use biased select and switch order of the futures.
This way we should never get "unlucky" and we should always prioritize the
handshake_done_rx.recv()
overhandshake_tx.send()
.I am less certain of the correctness of this one but it has worked so far in my tests and never produced a race condition.
See 0002-workaround-2.patch.
Proper solution
I believe the proper solution is to never put mpsc::Sender::send() in the tokio::select! in the first place.
And so the proper solution would be to get rid of the select completely.
Unfortunately I don't know exactly how this would have to look like then in the code.
I find it hard to completely grasp what is happening in the code, it is a bit complex for me.
And so I wouldn't want to propose what a proper solution should look like.
Hopefully you can figure that out having better knowledge of the codebase.
I think that maybe the author incorrectly thought that send() will block until other end of channel receives but that is not the case.
Citation from tokio docs on mpsc::Sender::send(): "Sends a value, waiting until there is capacity."
send() doesn't wait for the other end to receive, it only waits until there is a capacity in the channel which in this case means it doesn't wait at all.
And all this in turn means that we just roll dice.
tokio::select!
can always choose thehandshake_tx.send()
option even though we don't want it to cause the handshake is already done.Final words
I am attaching the three mentioned patches as well.
In case of any questions I'll try to help.
0001-prints.patch.txt
0002-workaround-1.patch.txt
0002-workaround-2.patch.txt
The text was updated successfully, but these errors were encountered: