[BUG] dTLS handshake race condition #628

szatanjl · 2024-11-08T12:14:15Z

TL;DR

There is a race condition in the handling of dTLS handshake at least in the latest version that is v0.11.0.
In particular in the tokio::select! at read_and_buffer()@dtls/src/conn/mod.rs line 832.

Read below for further details.

What I did

I have created a simple program that is connecting two peers over WebRTC protocol and sends messages via data channel.
The most basic WebRTC chat app.
I have implemented it 3 times: in JS for web browsers, in JS for Node, and in Rust using this WebRTC crate.

You can find the code on my github: https://github.com/szatanjl/rtc-share-poc.

browser/node <-> browser/node works without any issues in any combination.

Trying to connect web browser with the rust does not work however.
It works maybe 1 out of 10 times. The WebRTC connection always works, and the connectionState is "connected", but the data channel rarely opens up and the ondatachannel is rarely called.

I have debugged the issue and came to the conclusion that there is a race condition in the code of this crate.

I created a patch with additional println!() statements to showcase what is wrong (patch 0001-prints.patch).

Good

When everything works and data channel opens up then I can see in the logs something like:

TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake done TX
-- RTC state: peer: connected, ice: connected, gathering: complete, signal: stable
TEST handshake done RX
-- Chat open

As you can see from the logs handshake is done, it succedes and we open our data channel ("chat").
We "luckily" choose the _ = ctx.handshake_done_rx.recv() => { part of the tokio::select! in the read_and_buffer().

Bad

When it doesn't work out though I can see something like this:

(...)
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake TX
TEST handshake RX 1
TEST handshake done TX
-- RTC state: peer: connected, ice: connected, gathering: complete, signal: stable
TEST handshake TX

As you can see from the logs handshake is done, it has succeded, but we do not open the data channel.
We get "unlucky" and we choose the _ = ctx.handshake_tx.send(done_tx) => { in the tokio::select! in read_and_buffer().
And because of this we land in the infinite while loop because it never gets the done_rx.recv() message.

Cause

I am pretty sure that this has nothing to do with the network.
I have investigated this using wireshark and all the traffic is correct.
Web browsers (firefox or chromium) work correctly and send all the necessary packets and nothing is lost.

What happens is that there is a race condition in that select and we should never land in the while loop since the handshake is already done.

Workarounds

I can see two workaround for this issue.

Workaround 1

First one is to "repeat" the _ = ctx.handshake_done_rx.recv() => { inside of the tokio::select! inside of the while loop and terminate that loop if the handshake is done.

This way the loop is no longer infinite and even if we get "unlucky" in the first select it will still work out.

See 0002-workaround-1.patch.

Workaround 2

Second option is to use biased select and switch order of the futures.
This way we should never get "unlucky" and we should always prioritize the handshake_done_rx.recv() over handshake_tx.send().

I am less certain of the correctness of this one but it has worked so far in my tests and never produced a race condition.

See 0002-workaround-2.patch.

Proper solution

I believe the proper solution is to never put mpsc::Sender::send() in the tokio::select! in the first place.
And so the proper solution would be to get rid of the select completely.
Unfortunately I don't know exactly how this would have to look like then in the code.
I find it hard to completely grasp what is happening in the code, it is a bit complex for me.
And so I wouldn't want to propose what a proper solution should look like.
Hopefully you can figure that out having better knowledge of the codebase.

I think that maybe the author incorrectly thought that send() will block until other end of channel receives but that is not the case.

Citation from tokio docs on mpsc::Sender::send(): "Sends a value, waiting until there is capacity."
send() doesn't wait for the other end to receive, it only waits until there is a capacity in the channel which in this case means it doesn't wait at all.

And all this in turn means that we just roll dice.
tokio::select! can always choose the handshake_tx.send() option even though we don't want it to cause the handshake is already done.

Final words

I am attaching the three mentioned patches as well.
In case of any questions I'll try to help.

0001-prints.patch.txt
0002-workaround-1.patch.txt
0002-workaround-2.patch.txt

The text was updated successfully, but these errors were encountered:

szatanjl changed the title ~~dTLS handshake race condition~~ [BUG] dTLS handshake race condition Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] dTLS handshake race condition #628

[BUG] dTLS handshake race condition #628

szatanjl commented Nov 8, 2024

[BUG] dTLS handshake race condition #628

[BUG] dTLS handshake race condition #628

Comments

szatanjl commented Nov 8, 2024

TL;DR

What I did

Good

Bad

Cause

Workarounds

Workaround 1

Workaround 2

Proper solution

Final words