Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DTLS restart #1846

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Implement DTLS restart #1846

wants to merge 2 commits into from

Conversation

Antonito
Copy link
Member

@Antonito Antonito commented Jun 26, 2021

pion/srtp#148 is needed before being able to merge

The unit test needs pion/transport#141

Fixes #1636

I'm not sure about if t.state != DTLSTransportStateNew && t.state != DTLSTransportStateClosed, one other option would be to set the state to New before restarting, but I don't want to trigger a StateChanged event – which could be wrong.
Turns out that's what libwebrtc does

Also the condition on which DTLS is restarted might be too permissive – but it doesn't seem to break any test so far.

@codecov
Copy link

codecov bot commented Jun 26, 2021

Codecov Report

Merging #1846 (ce2fd9b) into master (1765e9b) will decrease coverage by 6.91%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1846      +/-   ##
==========================================
- Coverage   76.89%   69.98%   -6.92%     
==========================================
  Files          87       64      -23     
  Lines        8937     3258    -5679     
==========================================
- Hits         6872     2280    -4592     
+ Misses       1647      860     -787     
+ Partials      418      118     -300     
Flag Coverage Δ
go ?
wasm 69.98% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
constants.go 0.00% <0.00%> (-100.00%) ⬇️
icecandidatepair.go 0.00% <0.00%> (-81.82%) ⬇️
rtpcodec.go 13.33% <0.00%> (-80.00%) ⬇️
stats.go 0.00% <0.00%> (-75.00%) ⬇️
track_local.go 0.00% <0.00%> (-66.67%) ⬇️
internal/mux/muxfunc.go 9.09% <0.00%> (-63.64%) ⬇️
icemux.go 0.00% <0.00%> (-54.55%) ⬇️
atomicbool.go 57.14% <0.00%> (-42.86%) ⬇️
sdpsemantics.go 0.00% <0.00%> (-34.49%) ⬇️
internal/mux/endpoint.go 24.13% <0.00%> (-34.49%) ⬇️
... and 36 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1765e9b...ce2fd9b. Read the comment docs.

@Sean-Der
Copy link
Member

@Antonito This is absolutely amazing work, your PRs are the best I have seen in months. Gives me the energy to get back into it :)

I think we could make the test a lot simpler.

  • Create two PeerConnections and connect them.
  • Send one RTP packet
  • Create a new PeerConnection and generate an offer, and send it to one of the PeerConnections in the first flow.
  • Send one RTP packet

I think that gets us the coverage we need and keeps the LoC low still? Push back if I am wrong!

@Antonito
Copy link
Member Author

@Antonito This is absolutely amazing work, your PRs are the best I have seen in months. Gives me the energy to get back into it :)

I think we could make the test a lot simpler.

  • Create two PeerConnections and connect them.
  • Send one RTP packet
  • Create a new PeerConnection and generate an offer, and send it to one of the PeerConnections in the first flow.
  • Send one RTP packet

I think that gets us the coverage we need and keeps the LoC low still? Push back if I am wrong!

So, I tried to rewrite the unit test in a simpler way, and it uncovered several problems. I'm not sure if there are caused by the DTLS restart or were already there (I'm thinking the later, for most of them).

  • Sending a single RTP packet doesn't trigger an OnTrack event on the other peer. Sending two in a row doesn't either, but sending 1 / waiting a few ms / sendign another one works.
  • When ran multiple times in a row, the test sometimes logs pc ERROR: 2021/06/28 13:30:33 Incoming unhandled RTP ssrc(1595441956), OnTrack will not be fired. single media section has an explicit SSRC. I initially thought this was a difference in behavior with the test I already pushed (since with your method, the first peer connection doesn't go to disconnected or failed before the other one connects), but turns out it's also happening with this one – I just had lucky runs. It doesn't seem to be a race condition, tsan didn't detect anything.
  • I found out weird behaviors with DataChannels
    • I added a call to CreateDataChannel, behavior unchanged (as expected)
    • I then set an empty OnDataChannel callback -> this causes the test to sometimes hang (again I'm not doing anything more with DataChannels here)
    • I then tried to send data from ClientB and read with ClientA1 and ClientA2, sometime it hangs, but when it doesn't:
      • As expected, ClientA1 can read
      • After ClientA2 connects, ClientB tries to send data – which causes a io: read/write on closed pipe error log
      • ClientA2 can't read, since ClientB can't send anything (this is most likely an issue with DTLS restart)

@Antonito
Copy link
Member Author

I rewrote the unit test without pion/transport, and split it in two very similar tests, for debugging purpose:

  • TestPeerConnection_DTLS_Restart_MediaOnly
  • TestPeerConnection_DTLS_Restart_MediaAndDataChannel

Before merging, we should only keep TestPeerConnection_DTLS_Restart_MediaAndDataChannel.

SCTP restart is now implemented and DataChannels are re-opened, if needed.

There seems to be some kind of race condition, but I really don't think it's related to the stuff this PR modifies – I think the unit tests just show an already existing problem.
It's what I was describing above, and now have a bit more validation of what's really happening.

The TestPeerConnection_DTLS_Restart_MediaOnly behaves correctly, in go test. To make it work with tsan, I had to add an extra:

triggerMedia()

I think this is a consequence of the first point I was developing in my previous message (about RTP packets).

TestPeerConnection_DTLS_Restart_MediaAndDataChannel behaves correctly with go test, but hangs with tsan. I also tried to add a few extra RTP packets, but it doesn't change any thing. Somehow, the hang seems to be caused by the simple fact of calling CreateDataChannel and OnDataChannel.

@Antonito Antonito force-pushed the fix/issue-#1636 branch 3 times, most recently from bb27b83 to 2e55618 Compare March 4, 2022 07:21
@Antonito
Copy link
Member Author

Antonito commented Mar 4, 2022

Following up discussion with @davidzhao, I updated both this PR and pion/srtp#148.

In order to get some progress on this, I think we should first focus on merging pion/srtp#148 – we need to figure out @Sean-Der 's concerns about rolloverCounter and lastSequenceNumber. I was careful and really don't think this one introduces any security issue, but wouldn't mind a second opinion :)

As for this PR, it requires a bit more work, there are still issues with TestPeerConnection_DTLS_Restart_MediaAndDataChannel, and I'm far less confident about security issues here. Unfortunately I don't have much time to dedicate to this for now, I'd gladly welcome any help 😁

@davidzhao
Copy link
Member

Thanks for picking this back up @Antonito! We'd love to collaborate on this. Will have a look in a few weeks.

@edaniels
Copy link
Member

hey @Antonito, do you still want to work together to get this in?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
4 participants