Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Hopefully) fix the MQTT disconnections happening all the time #682

Closed

Conversation

uschindler
Copy link
Contributor

This PR adds better values for keep alives and socket timeouts:

    MQTTPubSubClient->setKeepAlive(30);
    MQTTPubSubClient->setSocketTimeout(120);

The keep alive tells the client to send a packet to the broker to tell it "yes I am there". If the broker does not get keep alive packets often enough, it will disconnect.

Actually the socket timeout identical to keep alive looks strange to me. I raised it to 120 seconds. This would mean that if the server does not send a keep alive to the client in time, the client will kill the connection. The default keep alive value in Mosquitto is 60 secs, minimum configurable is 5s:

MAN page of mosquitto:

   keepalive_interval seconds
      Set the number of seconds after which the bridge should send a ping if no other traffic has occurred. Defaults to 60. A minimum value of 5 seconds is allowed.

Basically the socket timeout needs >> keep-alives (thats what my knowledge about similar protocols tell me, I am devloping large network protocols with Elasticsearch/Apache Lucene/Apache Solr so that's my daily business).

It also looks like one of the many keepalive bugs in the (now unmaintained) PubSubClient:

Those issues seem to be fixed by: knolleary/pubsubclient#802

As you have the code of PubSubClient 2.8 included, I also added the missing line in it.

To explain why it happens:

  • when the pubsub client (re-)connects to the MQTT broker, it forgets to reset the "pingOutstanding" variable to false. It sometimes does it (depending what the server sends), but generally the state is undefined.
  • When BSB-LAN starts up, it seems to connect 2 times to MQTT (it looks like it reconnects after applying some config from EEPROM). Nevertheless, if it is in the state "pingOutstanding" while disconnecting for first time it stays with that. When entering the main loop it waits for the ping and/or other messages and disconnects if theres no activity withing 15 secs on wire. It does not exit that state, because the server won't send a ping response if no ping was requested before. This effectively makes the keep alive broken, as itself won't send any new pings anymore, so on longer latencies ther server or client diconnects.

The fix is to initialize pingOutstanding=false on connection.

…ime on some devices (e.g., when there is lots of busy traffic on BSB)
@uschindler uschindler changed the title This PR (hopefully) fixes the MQTT disconnections happening all the t… (Hopefully) fix the MQTT disconnections happening all the time Dec 6, 2024
@uschindler
Copy link
Contributor Author

I am testing this PR in production. No disconnects seen yet, although temperature in my house is raising due to heavy heater activity.

@fredlcore
Copy link
Owner

Actually the socket timeout identical to keep alive looks strange to me.

But that's how it's defined in PubSubClient.h (15 seconds default).

@fredlcore
Copy link
Owner

If there is a newer version of PubSubClient, I rather update to the full new version rather than patching bits and pieces of the code, but thanks for the pointer, and I'm glad if this will fix the problem.

@uschindler
Copy link
Contributor Author

If there is a newer version of PubSubClient, I rather update to the full new version rather than patching bits and pieces of the code, but thanks for the pointer, and I'm glad if this will fix the problem.

Unfortunately theres no newer version. The maintainer has told us that he won't take care of the client anymore, except very important (security) bugs: knolleary/pubsubclient#1045

@uschindler
Copy link
Contributor Author

Give it a few days and I will proceed with testing. So far all looks fine. I will give you a ping, if this solves the issue completely or at least lets the problem appear more seldom.

@uschindler
Copy link
Contributor Author

Actually the socket timeout identical to keep alive looks strange to me.

But that's how it's defined in PubSubClient.h (15 seconds default).

Yes I know, but that was already criticied in the issues of this project. Basically this low socket timeout makes it incompatible to the defaults as used in mosquitto (there it uses 60s to send pings to keep connection alive). Of course you could lower that value in mosquittos config, but the better fix is to let the client wait longer for pings.

@fredlcore
Copy link
Owner

Ok, understood, thanks!

@uschindler
Copy link
Contributor Author

It looks like somebody else took maintenance of the pubsubclient: https://github.com/thingsboard/pubsubclient. More details also here: https://registry.platformio.org/libraries/thingsboard/TBPubSubClient

Unfortunately this code still has the problem with the lost pings, so the bugfix presented here is still missing. But they fixed security issues.

Unfortunately I had one more short disconnection a few minutes ago. But not as many as before. I will try to update to the latest version of the above fork and give it a try, possibly later or over night.

@uschindler
Copy link
Contributor Author

I merged with main branch.

@uschindler
Copy link
Contributor Author

I am closing this because I made the update to 2.10 provided by https://github.com/thingsboard/pubsubclient

See separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants