-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
end devices fail to respond a/aq during SRP registration #9416
Comments
Not sure if there's been any fundamental change in either the matter SDK or openthread for this. I do see often times first time registration of srp fails and must wait for backoff on retry. I have not dug into why that is. In the referenced discussion, its unclear why @abtink felt waiting to /aq was necessary? Even without a direct link, the response should still come back? maybe our initial timeout is too short on /aq? |
I believe the Address Query issue should be addressed by: @abtink , thoughts? |
Thanks @jwhui . Yes, the PR above would help (no need to perform address query to find SRP server's address, if the address can be resolved from the related Netdata entry). |
Thank you everyone!, that PR looks great! , I will test it right away. |
i chatted a bit more with @AlanLCollins oob it sounds like 9/10 times the first aq fails. the above optimization will obviate, but that only hides what still seems like a potentially fundamental problem? |
actually, more than that, the original issue was aq from device --> br for sake of srp reg. but i think alan's case is actually reverse:
|
@ndyck14 Do you have any suggestions? The address query uses multicast which is not as reliable as unicast. I recall some other ideas of having BR update its own address cache entries from a received SRP registration (basically an enhanced snoop optimization that adds registered addresses of SRP client into SRP sever device's address cache table). This way we may avoid doing address query from BR. This is technically possible to implement but not that simple (may require some cross layer interactions). |
@ndyck14 @AlanLCollins , can we get any more visibility into what/where messages are getting dropped? Do we know if it is the Address Query or Address Notification that is being dropped? Can we get a packet capture? It shouldn't be failing 90% of the time. |
agreed on the 90% failure. I think this is in @AlanLCollins 's court. |
a/aq is the message getting dropped so the EID-to-RLOC mapping fails and CASE session times out. |
the previous assessment about 90% was incorrect. After analyzing the logs carefully (see attachment below) , We realized that other unicast messages (like MLE Parent Rsp, Discovery Rsp, etc) were having trouble to complete, until multiple retries allowed the flow to continue. We realized we had an improvement at platform level and fixed it. The occurrences of the a/aq failure should be reduced back to our previous number (10% -> 1 of 10) Reference 25 back to back Matter commissioning - 22 failed at a/aq timeout: We are analyzing options to increase the robustness of the a/aq flow. Two options are on the table:
|
@AlanLCollins , thanks for sharing the pcap. Just to be sure we are looking at the same place, are you able to provide the pkt no. range for the 22nd iteration? |
22 attempts failed in that .pcap. couple of examples are pkt 2311, 3790. The a/aq does not complete so Parent does not continue with Matter CASE formation (Sigma1 message). |
We continue to see sporadic reports and instances of this. has anyone seen recent stuff around that could point to better understanding or a fix? |
@abtink did we end up doing the SRP table snooping? |
We temporarily increased the MPL router retries to workaround this issue. (kMplRouterDataMessageTimerExpirations from 2 to 4 ). We will start a characterization effort to understand the impact to user experience in large-scale/high-density networks. However, I'd prefer to reduce back the MPL retreis and find the real root cause the original issue which is associated to potential race condition between SRP registrations to a/aq responses in the accessory device. |
Looking at the PCAP from #9416 (comment), it looks like there are some other issues that are causing the Address Query to fail. After sending a SRP Update, the joining device seems to be generally unresponsive for some time. Note all the SRP Response retransmissions. And again just before the parent sends the Address Query message. Given the unreliability of communication from the parent to joining device, I don't expect this to be a very usable link in general. @alan-eero @ndyck14 , thoughts? |
@alan-eero , we merged the optimization to help address this issue: Can you provide feedback on whether this addresses the issue above? |
Thanks @jwhui |
@pragun01 , we already do something similar here: openthread/src/core/thread/mle_router.cpp Line 1333 in 51ab865
Changing the cache entry upon receiving the Address Solicit message can be problematic if the Address Response message cannot be delivered. |
Thanks @jwhui Caching of the address mapping is removed here, I agree we should use replace here also. openthread/src/core/thread/mle_router.cpp Line 3687 in 51ab865
|
@pragun01 , as noted in #8987, the concern was:
But maybe we should have a check to only remove the entry if the device was not a direct child. @abtink , thoughts? |
Sounds okay to me to do so. However, this optimization is tailored for a specific topology/situation where the BR (SRP server) is the leader and the device is a direct child. If the topology differs, address queries would be needed. |
Submitted: Again I want to mention that this is a tailored optimization for very specific topology. |
Thanks @abtink |
This issue is related to the conversation: #8202
During matter commissioning, the accessory device needs to register the SRP record fast so the Operational discovery succeeds. ~ 10% of the CASE creation will fail due to address solicitation timeout. The backoff retry mechanism of /aq would recover, but it provides a degradation in user experience from Matter application perspective.
This issue becomes more critical on devices that register more than SRP records (not only matter) right after they join the network.
Has the community found more information beyond the discussion in 8202 ticket ?
CC: @ndyck14 , @gabekassel , @abtink
The text was updated successfully, but these errors were encountered: