Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

end devices fail to respond a/aq during SRP registration #9416

Open
AlanLCollins opened this issue Sep 11, 2023 · 26 comments
Open

end devices fail to respond a/aq during SRP registration #9416

AlanLCollins opened this issue Sep 11, 2023 · 26 comments
Labels

Comments

@AlanLCollins
Copy link

This issue is related to the conversation: #8202
During matter commissioning, the accessory device needs to register the SRP record fast so the Operational discovery succeeds. ~ 10% of the CASE creation will fail due to address solicitation timeout. The backoff retry mechanism of /aq would recover, but it provides a degradation in user experience from Matter application perspective.

This issue becomes more critical on devices that register more than SRP records (not only matter) right after they join the network.

Has the community found more information beyond the discussion in 8202 ticket ?

CC: @ndyck14 , @gabekassel , @abtink

@ndyck14
Copy link
Contributor

ndyck14 commented Sep 11, 2023

Not sure if there's been any fundamental change in either the matter SDK or openthread for this.

I do see often times first time registration of srp fails and must wait for backoff on retry. I have not dug into why that is. In the referenced discussion, its unclear why @abtink felt waiting to /aq was necessary? Even without a direct link, the response should still come back? maybe our initial timeout is too short on /aq?

@jwhui jwhui added the comp:srp label Sep 12, 2023
@jwhui
Copy link
Member

jwhui commented Sep 12, 2023

I believe the Address Query issue should be addressed by:

@abtink , thoughts?

@abtink
Copy link
Member

abtink commented Sep 12, 2023

Thanks @jwhui .

Yes, the PR above would help (no need to perform address query to find SRP server's address, if the address can be resolved from the related Netdata entry).

@AlanLCollins
Copy link
Author

Thank you everyone!, that PR looks great! , I will test it right away.
Cheers !

@ndyck14
Copy link
Contributor

ndyck14 commented Sep 12, 2023

i chatted a bit more with @AlanLCollins oob it sounds like 9/10 times the first aq fails. the above optimization will obviate, but that only hides what still seems like a potentially fundamental problem?

@ndyck14
Copy link
Contributor

ndyck14 commented Sep 12, 2023

actually, more than that, the original issue was aq from device --> br for sake of srp reg. but i think alan's case is actually reverse:

yeah, I realize that. SRP completes ok , the problem is a/aq, the accessory does not respond to it, and it consisntely happens when it overlaps with the SRP registration flow.

@abtink
Copy link
Member

abtink commented Sep 12, 2023

@ndyck14 Do you have any suggestions?

The address query uses multicast which is not as reliable as unicast.

I recall some other ideas of having BR update its own address cache entries from a received SRP registration (basically an enhanced snoop optimization that adds registered addresses of SRP client into SRP sever device's address cache table). This way we may avoid doing address query from BR. This is technically possible to implement but not that simple (may require some cross layer interactions).

@jwhui
Copy link
Member

jwhui commented Sep 13, 2023

@ndyck14 @AlanLCollins , can we get any more visibility into what/where messages are getting dropped? Do we know if it is the Address Query or Address Notification that is being dropped? Can we get a packet capture? It shouldn't be failing 90% of the time.

@ndyck14
Copy link
Contributor

ndyck14 commented Sep 13, 2023

agreed on the 90% failure. I think this is in @AlanLCollins 's court.

@AlanLCollins
Copy link
Author

a/aq is the message getting dropped so the EID-to-RLOC mapping fails and CASE session times out.
I will provide sniffer logs early next week.

@AlanLCollins
Copy link
Author

AlanLCollins commented Sep 19, 2023

the previous assessment about 90% was incorrect. After analyzing the logs carefully (see attachment below) , We realized that other unicast messages (like MLE Parent Rsp, Discovery Rsp, etc) were having trouble to complete, until multiple retries allowed the flow to continue. We realized we had an improvement at platform level and fixed it. The occurrences of the a/aq failure should be reduced back to our previous number (10% -> 1 of 10)

Reference 25 back to back Matter commissioning - 22 failed at a/aq timeout:
MatterCASEfailures_aq-timeout.zip
Network key = 26445c1f1b2291344ae70586db94bba5

We are analyzing options to increase the robustness of the a/aq flow. Two options are on the table:

  • increase the timing to perform retry broadcasts for a/aq
  • Abtin's suggestion above, to use the information from other layers (e.g. SRP registration / DNS update messages) to trigger EID-to-RLOC mapping without sending a/aq

@jwhui
Copy link
Member

jwhui commented Sep 21, 2023

@AlanLCollins , thanks for sharing the pcap. Just to be sure we are looking at the same place, are you able to provide the pkt no. range for the 22nd iteration?

@AlanLCollins
Copy link
Author

22 attempts failed in that .pcap. couple of examples are pkt 2311, 3790. The a/aq does not complete so Parent does not continue with Matter CASE formation (Sigma1 message).
However, due to the platform issue that I mentioned on my previous message, I don't think it's worth to spend much time in that .pcap log. our QA team is trying to reproduce with the platform fix + cleaner environment.

@ndyck14
Copy link
Contributor

ndyck14 commented Feb 23, 2024

We continue to see sporadic reports and instances of this. has anyone seen recent stuff around that could point to better understanding or a fix?

@ndyck14
Copy link
Contributor

ndyck14 commented Feb 24, 2024

Should we consider relaxing the 15 second retry interval @jwhui @abtink . Not sure we have any guarantees that a/aq is received by a device, so if first one fails, any application retries are futile (unless they are super slow like 15 second delay).

@ndyck14
Copy link
Contributor

ndyck14 commented Feb 24, 2024

@abtink did we end up doing the SRP table snooping?

@alan-eero
Copy link

We temporarily increased the MPL router retries to workaround this issue. (kMplRouterDataMessageTimerExpirations from 2 to 4 ). We will start a characterization effort to understand the impact to user experience in large-scale/high-density networks. However, I'd prefer to reduce back the MPL retreis and find the real root cause the original issue which is associated to potential race condition between SRP registrations to a/aq responses in the accessory device.
This issue was present in different products, but it was more common to hit it in products that have more/extra SRP registrations than the normal Matter _tcp & _udp.

@jwhui
Copy link
Member

jwhui commented Feb 27, 2024

Looking at the PCAP from #9416 (comment), it looks like there are some other issues that are causing the Address Query to fail.

After sending a SRP Update, the joining device seems to be generally unresponsive for some time. Note all the SRP Response retransmissions.

Screenshot 2024-02-27 at 10 30 14 AM

And again just before the parent sends the Address Query message.

Screenshot 2024-02-27 at 10 30 20 AM

Given the unreliability of communication from the parent to joining device, I don't expect this to be a very usable link in general.

@alan-eero @ndyck14 , thoughts?

@jwhui
Copy link
Member

jwhui commented Mar 12, 2024

@alan-eero , we merged the optimization to help address this issue:

Can you provide feedback on whether this addresses the issue above?

@pragun01
Copy link

pragun01 commented Mar 21, 2024

Thanks @jwhui
Change below for caching the mapping of EID-to-RLOC16 during the SRP update solved problem up to some extent, this change is solving the problem when child sends SRP Update message while joining the network. But in some scenarios where ID changes like REED promoted to Router, cache entry will be removed from the address resolver, we need to multicast Address Query for finding the mapping, we will have same problem due to unreliable multicast. During the promotion of the REED to router, instead of removing the mapping from the cache, can we update the mapping in the cache with the Router ID?

#9881

@jwhui
Copy link
Member

jwhui commented Mar 21, 2024

@pragun01 , we already do something similar here:

Get<AddressResolver>().ReplaceEntriesForRloc16(aRxInfo.mNeighbor->GetRloc16(), router->GetRloc16());

Changing the cache entry upon receiving the Address Solicit message can be problematic if the Address Response message cannot be delivered.

@pragun01
Copy link

pragun01 commented Mar 21, 2024

Thanks @jwhui

Caching of the address mapping is removed here, I agree we should use replace here also.

Get<AddressResolver>().RemoveEntriesForRloc16(oldRloc16);

@jwhui
Copy link
Member

jwhui commented Mar 21, 2024

@pragun01 , as noted in #8987, the concern was:

  On leader, when we successfully reply to an "Address Solicit"
  message and assign a new RLOC16 to a node, we clear all entries
  associated with old RLOC16. We do not change to new RLOC16 since we
  cannot be sure that child will successfully receive the Address
  Solicit" response.

But maybe we should have a check to only remove the entry if the device was not a direct child.

@abtink , thoughts?

@abtink
Copy link
Member

abtink commented Mar 21, 2024

But maybe we should have a check to only remove the entry if the device was not a direct child.

Sounds okay to me to do so. However, this optimization is tailored for a specific topology/situation where the BR (SRP server) is the leader and the device is a direct child. If the topology differs, address queries would be needed.

@abtink
Copy link
Member

abtink commented Mar 21, 2024

Submitted:

Again I want to mention that this is a tailored optimization for very specific topology.

@pragun01
Copy link

Thanks @abtink

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants