Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

There are some cases where OT (or Matter) loses track of matter services #9909

Open
ndyck14 opened this issue Mar 8, 2024 · 6 comments
Open

Comments

@ndyck14
Copy link
Contributor

ndyck14 commented Mar 8, 2024

We've observed cases where matter mDNS services disappear unexpectedly

To Reproduce Information to reproduce the behavior, including:

  1. Git commit id e6df00d (https://github.com/SiliconLabs/matter/releases/tag/v2.2.0-1.2)
  2. IEEE 802.15.4 hardware platform MG24
  3. Build steps
  4. Network topology typically larger numbers of devices
  5. Usually with 3 fabrics on each device (e.g. Apple Home which has 2 + Home Assistant)

We've been unable to capture this reproducibility high enough to observe it on a device with debug logs. We've indirectly observed logs from HPM sysdiagnose with the Home Network Diagnostics profile, which includes SRP calls.

Based on observation, we see some or all matter services disappearing ~2 hours after boot. Matter flow starts with advertising all operational nodes and a matterc record for a period of 3 minutes.

Independently we have our own record type (ltpdu) that needs to be advertised always. Because of current deficiencies in Matter SDK (project-chip/connectedhomeip#32507), we have a timer that will check every 2s to see if our service is still there - we iterate over known OT services

const otSrpClientService *service_list = otSrpClientGetServices(chip::DeviceLayer::ThreadStackMgrImpl().OTInstance());

If we detect that our service is missing, we will add it immediately:

CHIP_ERROR error = chip::Dnssd::ChipDnssdPublishService(&SRPltpdu::service, nullptr, nullptr);

Our working theory is that race conditions can result in some or all of matter services not getting re-added when matterc is removed. Observationally, these services are not actively removed by the device, they simply expire later at the srp server (e.g. not refreshed).

Its unclear to us if this indicates a deficiency in OT code, a deficiency how we are using it, or otherwise a deficiency in Matter SDK.

We are already refactoring our approach to have our own service added to OT in the same function call that Matter uses to add its own to potentially address our hypothesis. But it would be good to get some understanding from the OT community whether this points to other hiding OT issues with srp behaviour.

@abtink
Copy link
Member

abtink commented Mar 8, 2024

If a service is not present in the list of services you get from OT otSrpClientGetServices, then no sw entity (matter or in general any next layer code) has not called the API to add the service.

Once a service is successfully added using otSrpClientAddService() it will be in the list until it is explicitly removed/cleared by API call from next layer.

@ndyck14
Copy link
Contributor Author

ndyck14 commented Mar 8, 2024

ok, so issue points to chip stack likely, and this:

CHIP_ERROR DnssdServer::AdvertiseOperational()
{
    VerifyOrDie(mFabricTable != nullptr);

    for (const FabricInfo & fabricInfo : *mFabricTable)
    {
...
// Should we keep trying to advertise the other operational
        // identities on failure?
        ReturnErrorOnFailure(mdnsAdvertiser.Advertise(advertiseParameters));

if it fails on the first call, the rest are not tried. Otherwise it can fail anywhere through the list of fabrics.

@ndyck14
Copy link
Contributor Author

ndyck14 commented Mar 26, 2024

I suspect there is still an issue in OT side either as a bug, or due to inappropriate use by us and/or matter.

Matter seems to remove all services and readd the ones it cares about. As this issue presents on the BR side logs from Apple, only the matterc service is actually removed by thread client, the remainder timeout due to lease. This suggests that perhaps if multiple services are removed all at once, OT only actively removes the last one from the SRP server?

We've otherwise avoided this behaviour by combining our own registration with matter's which seems to avoid the original failure to Advertise a service in the for loop.

filters:
any:727E27068C33DD2A.local → F75 node, which is affected
any:E62BF306B0F70C4F.local → AEA node, which is unaffected

to navigate between srp calls, search for srp_eval

both seem to restart ~17:28:43. AEA seems to register first (perhaps results in a few retries in F75?). Matterc records removed ~17:31. 

For AEA, when matterc is removed, ltpdu is also removed in what seems to be same call. ltpdu is re-added. so 2 total srp evals. 

In F75, ltpdu is removed first, then matterc. then ltpdu is readded. so 3 calls.

in AEA, services are refreshed at 19:26, which is just before 2 hours on the original 17:28 registration

in F75, next refresh happens at 19:29, after when the matter records have expired. only ltpdu is refreshed

I have sysdiagnose logs for any parties interested in the above analysis (file too large to upload)

@abtink
Copy link
Member

abtink commented Mar 26, 2024

OT only actively removes the last one from the SRP server?

A call to otSrpClientRemoveHostAndServices() will trigger SRP client to send an SRP update message to server asking "host" to be fully removed which should also remove any previously registered services associated with this host (on server).

@ndyck14
Copy link
Contributor Author

ndyck14 commented Mar 26, 2024

Code snippet from CHIP:

for (typename SrpClient::Service & service : mSrpClient.mServices)
    {
        if (service.IsUsed() && service.mIsInvalid)
        {
            ChipLogProgress(DeviceLayer, "removing srp service: %s.%s", service.mService.mInstanceName, service.mService.mName);
            error = MapOpenThreadError(otSrpClientRemoveService(mOTInst, &service.mService));
            SuccessOrExit(error);
        }
    }

@ndyck14
Copy link
Contributor Author

ndyck14 commented Mar 26, 2024

Anyway, the issue has been logged to CHIP, so this could probably be closed in the absence of stronger evidence/steps to reproduce.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants