end devices fail to respond a/aq during SRP registration #9416

AlanLCollins · 2023-09-11T16:08:54Z

This issue is related to the conversation: #8202
During matter commissioning, the accessory device needs to register the SRP record fast so the Operational discovery succeeds. ~ 10% of the CASE creation will fail due to address solicitation timeout. The backoff retry mechanism of /aq would recover, but it provides a degradation in user experience from Matter application perspective.

This issue becomes more critical on devices that register more than SRP records (not only matter) right after they join the network.

Has the community found more information beyond the discussion in 8202 ticket ?

CC: @ndyck14 , @gabekassel , @abtink

ndyck14 · 2023-09-11T16:16:13Z

Not sure if there's been any fundamental change in either the matter SDK or openthread for this.

I do see often times first time registration of srp fails and must wait for backoff on retry. I have not dug into why that is. In the referenced discussion, its unclear why @abtink felt waiting to /aq was necessary? Even without a direct link, the response should still come back? maybe our initial timeout is too short on /aq?

jwhui · 2023-09-12T00:09:43Z

I believe the Address Query issue should be addressed by:

[address-resolver] allow addr resolution using net data service #8318

@abtink , thoughts?

abtink · 2023-09-12T00:17:53Z

Thanks @jwhui .

Yes, the PR above would help (no need to perform address query to find SRP server's address, if the address can be resolved from the related Netdata entry).

AlanLCollins · 2023-09-12T03:24:50Z

Thank you everyone!, that PR looks great! , I will test it right away.
Cheers !

ndyck14 · 2023-09-12T15:34:39Z

i chatted a bit more with @AlanLCollins oob it sounds like 9/10 times the first aq fails. the above optimization will obviate, but that only hides what still seems like a potentially fundamental problem?

ndyck14 · 2023-09-12T15:37:02Z

actually, more than that, the original issue was aq from device --> br for sake of srp reg. but i think alan's case is actually reverse:

yeah, I realize that. SRP completes ok , the problem is a/aq, the accessory does not respond to it, and it consisntely happens when it overlaps with the SRP registration flow.

abtink · 2023-09-12T15:45:45Z

@ndyck14 Do you have any suggestions?

The address query uses multicast which is not as reliable as unicast.

I recall some other ideas of having BR update its own address cache entries from a received SRP registration (basically an enhanced snoop optimization that adds registered addresses of SRP client into SRP sever device's address cache table). This way we may avoid doing address query from BR. This is technically possible to implement but not that simple (may require some cross layer interactions).

jwhui · 2023-09-13T03:47:23Z

@ndyck14 @AlanLCollins , can we get any more visibility into what/where messages are getting dropped? Do we know if it is the Address Query or Address Notification that is being dropped? Can we get a packet capture? It shouldn't be failing 90% of the time.

ndyck14 · 2023-09-13T18:52:27Z

agreed on the 90% failure. I think this is in @AlanLCollins 's court.

AlanLCollins · 2023-09-16T01:07:06Z

a/aq is the message getting dropped so the EID-to-RLOC mapping fails and CASE session times out.
I will provide sniffer logs early next week.

AlanLCollins · 2023-09-19T22:16:03Z

the previous assessment about 90% was incorrect. After analyzing the logs carefully (see attachment below) , We realized that other unicast messages (like MLE Parent Rsp, Discovery Rsp, etc) were having trouble to complete, until multiple retries allowed the flow to continue. We realized we had an improvement at platform level and fixed it. The occurrences of the a/aq failure should be reduced back to our previous number (10% -> 1 of 10)

Reference 25 back to back Matter commissioning - 22 failed at a/aq timeout:
MatterCASEfailures_aq-timeout.zip
Network key = 26445c1f1b2291344ae70586db94bba5

We are analyzing options to increase the robustness of the a/aq flow. Two options are on the table:

increase the timing to perform retry broadcasts for a/aq
Abtin's suggestion above, to use the information from other layers (e.g. SRP registration / DNS update messages) to trigger EID-to-RLOC mapping without sending a/aq

jwhui · 2023-09-21T22:48:37Z

@AlanLCollins , thanks for sharing the pcap. Just to be sure we are looking at the same place, are you able to provide the pkt no. range for the 22nd iteration?

AlanLCollins · 2023-09-22T15:57:40Z

22 attempts failed in that .pcap. couple of examples are pkt 2311, 3790. The a/aq does not complete so Parent does not continue with Matter CASE formation (Sigma1 message).
However, due to the platform issue that I mentioned on my previous message, I don't think it's worth to spend much time in that .pcap log. our QA team is trying to reproduce with the platform fix + cleaner environment.

ndyck14 · 2024-02-23T22:52:52Z

We continue to see sporadic reports and instances of this. has anyone seen recent stuff around that could point to better understanding or a fix?

ndyck14 · 2024-02-24T00:54:04Z

Should we consider relaxing the 15 second retry interval @jwhui @abtink . Not sure we have any guarantees that a/aq is received by a device, so if first one fails, any application retries are futile (unless they are super slow like 15 second delay).

ndyck14 · 2024-02-24T00:59:28Z

@abtink did we end up doing the SRP table snooping?

alan-eero · 2024-02-25T19:42:40Z

We temporarily increased the MPL router retries to workaround this issue. (kMplRouterDataMessageTimerExpirations from 2 to 4 ). We will start a characterization effort to understand the impact to user experience in large-scale/high-density networks. However, I'd prefer to reduce back the MPL retreis and find the real root cause the original issue which is associated to potential race condition between SRP registrations to a/aq responses in the accessory device.
This issue was present in different products, but it was more common to hit it in products that have more/extra SRP registrations than the normal Matter _tcp & _udp.

jwhui · 2024-02-27T18:33:51Z

Looking at the PCAP from #9416 (comment), it looks like there are some other issues that are causing the Address Query to fail.

After sending a SRP Update, the joining device seems to be generally unresponsive for some time. Note all the SRP Response retransmissions.

And again just before the parent sends the Address Query message.

Given the unreliability of communication from the parent to joining device, I don't expect this to be a very usable link in general.

@alan-eero @ndyck14 , thoughts?

jwhui · 2024-03-12T22:23:04Z

@alan-eero , we merged the optimization to help address this issue:

[srp-server] add snoop cache entries for registered host addresses #9881

Can you provide feedback on whether this addresses the issue above?

pragun01 · 2024-03-21T00:33:06Z

Thanks @jwhui
Change below for caching the mapping of EID-to-RLOC16 during the SRP update solved problem up to some extent, this change is solving the problem when child sends SRP Update message while joining the network. But in some scenarios where ID changes like REED promoted to Router, cache entry will be removed from the address resolver, we need to multicast Address Query for finding the mapping, we will have same problem due to unreliable multicast. During the promotion of the REED to router, instead of removing the mapping from the cache, can we update the mapping in the cache with the Router ID?

#9881

jwhui · 2024-03-21T00:47:41Z

@pragun01 , we already do something similar here:

openthread/src/core/thread/mle_router.cpp

Line 1333 in 51ab865

 Get<AddressResolver>().ReplaceEntriesForRloc16(aRxInfo.mNeighbor->GetRloc16(), router->GetRloc16()); 

Changing the cache entry upon receiving the Address Solicit message can be problematic if the Address Response message cannot be delivered.

pragun01 · 2024-03-21T01:21:49Z

Thanks @jwhui

Caching of the address mapping is removed here, I agree we should use replace here also.

openthread/src/core/thread/mle_router.cpp

Line 3687 in 51ab865

Get<AddressResolver>().RemoveEntriesForRloc16(oldRloc16);

jwhui · 2024-03-21T01:34:34Z

@pragun01 , as noted in #8987, the concern was:

  On leader, when we successfully reply to an "Address Solicit"
  message and assign a new RLOC16 to a node, we clear all entries
  associated with old RLOC16. We do not change to new RLOC16 since we
  cannot be sure that child will successfully receive the Address
  Solicit" response.

But maybe we should have a check to only remove the entry if the device was not a direct child.

@abtink , thoughts?

abtink · 2024-03-21T06:42:22Z

But maybe we should have a check to only remove the entry if the device was not a direct child.

Sounds okay to me to do so. However, this optimization is tailored for a specific topology/situation where the BR (SRP server) is the leader and the device is a direct child. If the topology differs, address queries would be needed.

abtink · 2024-03-21T22:52:38Z

Submitted:

[mle] retain direct child cache entries on Addr Solicit Response TX #9956

Again I want to mention that this is a tailored optimization for very specific topology.

pragun01 · 2024-03-22T15:27:55Z

Thanks @abtink

jwhui added the comp:srp label Sep 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

end devices fail to respond a/aq during SRP registration #9416

end devices fail to respond a/aq during SRP registration #9416

AlanLCollins commented Sep 11, 2023

ndyck14 commented Sep 11, 2023

jwhui commented Sep 12, 2023

abtink commented Sep 12, 2023

AlanLCollins commented Sep 12, 2023

ndyck14 commented Sep 12, 2023

ndyck14 commented Sep 12, 2023

abtink commented Sep 12, 2023 •

edited

jwhui commented Sep 13, 2023

ndyck14 commented Sep 13, 2023

AlanLCollins commented Sep 16, 2023

AlanLCollins commented Sep 19, 2023 •

edited

jwhui commented Sep 21, 2023

AlanLCollins commented Sep 22, 2023

ndyck14 commented Feb 23, 2024

ndyck14 commented Feb 24, 2024

ndyck14 commented Feb 24, 2024

alan-eero commented Feb 25, 2024

jwhui commented Feb 27, 2024 •

edited

jwhui commented Mar 12, 2024

pragun01 commented Mar 21, 2024 •

edited

jwhui commented Mar 21, 2024 •

edited

pragun01 commented Mar 21, 2024 •

edited

jwhui commented Mar 21, 2024

abtink commented Mar 21, 2024

abtink commented Mar 21, 2024

pragun01 commented Mar 22, 2024

end devices fail to respond a/aq during SRP registration #9416

end devices fail to respond a/aq during SRP registration #9416

Comments

AlanLCollins commented Sep 11, 2023

ndyck14 commented Sep 11, 2023

jwhui commented Sep 12, 2023

abtink commented Sep 12, 2023

AlanLCollins commented Sep 12, 2023

ndyck14 commented Sep 12, 2023

ndyck14 commented Sep 12, 2023

abtink commented Sep 12, 2023 • edited

jwhui commented Sep 13, 2023

ndyck14 commented Sep 13, 2023

AlanLCollins commented Sep 16, 2023

AlanLCollins commented Sep 19, 2023 • edited

jwhui commented Sep 21, 2023

AlanLCollins commented Sep 22, 2023

ndyck14 commented Feb 23, 2024

ndyck14 commented Feb 24, 2024

ndyck14 commented Feb 24, 2024

alan-eero commented Feb 25, 2024

jwhui commented Feb 27, 2024 • edited

jwhui commented Mar 12, 2024

pragun01 commented Mar 21, 2024 • edited

jwhui commented Mar 21, 2024 • edited

pragun01 commented Mar 21, 2024 • edited

jwhui commented Mar 21, 2024

abtink commented Mar 21, 2024

abtink commented Mar 21, 2024

pragun01 commented Mar 22, 2024

abtink commented Sep 12, 2023 •

edited

AlanLCollins commented Sep 19, 2023 •

edited

jwhui commented Feb 27, 2024 •

edited

pragun01 commented Mar 21, 2024 •

edited

jwhui commented Mar 21, 2024 •

edited

pragun01 commented Mar 21, 2024 •

edited