Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] BSB protocol device discovery does not work -- [FEATURE REQUEST] MQTT topic structure without category in path (optional?) #680

Closed
uschindler opened this issue Dec 4, 2024 · 78 comments

Comments

@uschindler
Copy link
Contributor

uschindler commented Dec 4, 2024

Hi,
I did not yet install the new version of BSB-LAN but reviewed the changes since the Git version of end October that's still installed locally.

The new Topic structure with the endpoints /status, /set, /inf,.. looks fine and makes it much easier to set values from Home Assistant (HASS), as you no longer need templates to form the message to set a value, it should work automagically. /poll is also great because I can setup a cronjob in Homeassistant to poll specific values with different intervals (normaly things like heater status and temperature are fetched 40 seconds for me, but stuff like actual heating curve due to adaption do not need to be fetched that often). So I can change this to send /poll MQTT messages with different intervals. Thats wonderful!

In addition communicating with the BSB-LAN only through the MQTT serializes the messages "in order". Sending messages in parallel through HTTP sometimes leads to crushes (TCP reset) as the BSB-LAN is single-threaded. When everything from HASS goes through MQTT all arrives at BSB-LAN in order in a single "thread".

But one thing annoys me: The new topic structure contains the category id, which is somehow redundant. The device ID is easy to handle with search/replace in the config of HASS. But category id is totally useless! It may be usefull for autodiscovery, but if you don't use auto-discovery and configure your entities by hand (I have about 20 including a climate control entity and water heater entity), it gets annoying to lookup all category ids which are different for different devices. This also makes it hard to post reuseable configurations, because the categories are different in each system, while the parameter numbers are reuseable and the same for most heaters (for common parameters like temperature, FA status,....).

Would it be possible to either revert addition of category id to MQTT messages or provide a compile time config which removes them from the string formatting when creating MQTT topic paths?

Once there is an easy option to remove them, I will upgrade and report back how the new endpoints work (/poll, /set, /status, /inf).

There's an additional bug which was observed during testing the master branch with discovery of BSB devices. See below for details!

@uschindler
Copy link
Contributor Author

P.S.: I was about to post my configuration of a HASS "climate" device that supports swithching on/off (where it handles to restore the previous state of system like "Reduziert", "Komfort", "Auto") when switching it back on. Background: Unfortunately HASS's climate have only on/of setting and a separate toggle for the "mode", while on BSB-LAN heaters you only have one toggle, where "Frostschutz" means off. So the on/off switch should switch to "Frostschutz" and when turning back on it should revert to the previous mode (Komfort/Reduced/Auto). -- Unfortunately the config is not easy to post on forums with the new topic structure because the category IDs are different for every system.

@fredlcore
Copy link
Owner

That's the thing if you introduce new features - some people want it this way, some people want it that way. The argument to filter MQTT-subscribed parameters based on categories rather than having to enter parameters one-by-one was the reason I introduced that feature. To me, this is a more frequent use case than re-using configurations across different systems. If the systems are really the same, the category IDs will also be the same. If they are not, I wouldn't want to encourage people re-using configurations in the first place as you may or may not know what you are doing. And the "may nots" are the ones who cause support-time ;). If it is any consolation, the category ID is ignored when setting parameters, so you can write to BSB-LAN/1/0/700 in order to write to parameter 700 on device ID 1.

@uschindler
Copy link
Contributor Author

I don't agree with the "filtering". If you want to filter you filter by parameter id. They are all pre-structured. You know that 7xx is the Heizkreis 1, so its easy to filter them by path in MQTT explorers. I know almost all of my parameter IDs (I knew them already before using BSB-LAN).
I am just mentioning this. I would agree if the categories are "user-readable" strings, but just the IDs makes discovery not more useful.

I know from the code that while setting the values it does not matter, but the bigger problem is setting up entities where you wait for changes on "/status", they need to have the correct category ID.

@fredlcore
Copy link
Owner

Well, heating circuit 1 is more than just parameter numbers 7xx, they go up to 900. And if you know the parameter IDs, you will probably also know the order of the categories and thus also the category IDs (at least in systems with just one heating system). If you are interested in the parameters of HC1, it's of course much faster to subscribe just to BSB-LAN/0/XX/ instead of subscribing to each parameter manually.

User-readable strings would also not solve the problem of portability, because they would be localized.
And a compile time flag would also not help in terms of portability, as users would have to know that any configuration they see on the internet would only work with this or that setting. And you won't believe how many e-mails I get where people complain about the one error message that is the first one explained in the "troubleshooting" section - which again only can come up if you haven't read the quickstart instructions properly. So I cannot introduce any features that bring in more ambiguity.

In any case, it's a one-time task, and since reading/monitoring - i.e. subscribing - parameters are the majority compared to those that you want to actively change regularly (maybe a handful?), the advantages of the present model are clear to me.

But I'll leave this issue open for a while. Let's see if there are more users who feel that this is a step back.

@fredlcore
Copy link
Owner

Ok, since it only affects two lines in code, I have now added the NO_MQTT_HIERARCHY compile option, which you can set using #define NO_MQTT_HIERARCHY in BSB_LAN_config.h. I won't advertise this feature for the above-mentioned reasons, so please do not post configurations that make use of this to a wider audience, as it will for sure increase my support load. Of course, if you can adjust your templates to work with the "standard" configuration, this is, of course, welcome.

@uschindler
Copy link
Contributor Author

uschindler commented Dec 4, 2024

OK thanks for this.

Actually this setting only helps for the /status case, because it again makes the structure inconsistent. So I agree that it is a support desaster.

I will try to update my code on the weekend and extract the category ID. Maybe you can subscribe to topics like "/BSB-LAN/0/+/712" instead (not sure if that works with HASS), when writing you could use "0" as suggested before..

@uschindler
Copy link
Contributor Author

@uschindler
Copy link
Contributor Author

Looks like you can use topics for subscribing in HASS: https://community.home-assistant.io/t/mqtt-topic-wildcard/389382/3

This makes it easy to ignore the category when subscribing for updates. For setting/informing you can use "0" and for polling the topic ID is enough.

@fredlcore
Copy link
Owner

fredlcore commented Dec 4, 2024

I missed one instance, so now it should have a consistent structure without the category ID. Can you test again?
And yes, such kind of subscriptions should be doable if you want to ignore the category ID on the level of MQTT subscription. Wouldn't that solve your problem even better?

@uschindler
Copy link
Contributor Author

I missed one instance, so now it should have a consistent structure without the category ID. Can you test again? And yes, such kind of subscriptions should be doable if you want to ignore the category ID on the level of MQTT subscription. Wouldn't that solve you problem even better?

Yes and no. For subscribing wildcards are a good idea, but for setting values you still need a special case. But as "0" as category works for setting values I am fine with that limitation.

@uschindler
Copy link
Contributor Author

I will give both variants a try later, I am a bit busy at moment.

@fredlcore
Copy link
Owner

So then I could/should revert the changes? If it means that you can create relocatable templates with the way you just described, then I'd rather remove it. If you'd still need the "flat" hierarchy, I'll keep it, with the risk that people will try it out and clog their MQTT topic hierarchy ;)...

@fredlcore
Copy link
Owner

Ok, thanks!

@uschindler
Copy link
Contributor Author

uschindler commented Dec 4, 2024

Hi, I tried to test it but after upgrading (before, all was working correct), it is unable to detect the BSB. I cleared the EPROM manually, still no success. It seems to get the correct device/type and variant, but it reports 0 and 0 for both. Therefore I can't get it online at all.

19:22:58.822 -> Scanning devices on the bus...
19:22:58.904 -> LAN->ALL  QINF      3D050064 
19:22:58.904 -> DC C2 7F 0B 01 3D 05 00 64 37 B4
19:22:59.037 -> HEIZ->LAN INF      053D0064 00 61 00 64 00 1A 7D B5 B4 0A
19:22:59.037 -> DC 80 42 15 02 05 3D 00 64 00 61 00 64 00 1A 7D B5 B4 0A 34 0E
19:23:04.323 -> DSP1->HEIZ QUR 8310.0 Uhrzeit - Kesseltemperatur-Istwert: 
19:23:04.323 -> DC 8A 00 0B 06 3D 0D 05 19 4F 8C
19:23:07.638 -> DC C2 80 0B 06 3D 05 00 01 C9 12 
19:23:07.638 -> Bus devices found:
19:23:07.638 -> 128/97/100/
19:23:07.638 -> Device family: 0
19:23:07.638 -> Device variant: 0
19:23:10.621 -> Start network services
19:23:10.621 -> Starting MDNS service with hostname BSB-LAN
19:23:10.621 -> Update Server started on port 8080.
19:23:10.621 -> Setup complete
19:23:10.621 -> Connecting to MQTT broker sirius.home on port 1883...
19:23:10.621 -> Client ID: BSB-LAN
19:23:10.678 -> Will topic: BSB-LAN/status
19:23:10.678 -> Connected to MQTT broker, updating will topic
19:23:10.678 -> Subscribed to topic 'BSB-LAN/#'
19:23:10.678 -> Published status 'online' to topic 'BSB-LAN/status'
19:23:38.497 -> MQTT connection lost
19:23:48.494 -> Connecting to MQTT broker sirius.home on port 1883...
19:23:48.494 -> Client ID: BSB-LAN
19:23:48.494 -> Will topic: BSB-LAN/status
19:23:48.494 -> Connected to MQTT broker, updating will topic
19:23:48.494 -> Subscribed to topic 'BSB-LAN/#'
19:23:48.494 -> Published status 'online' to topic 'BSB-LAN/status'
19:23:51.044 -> 60002 Ping!
19:23:51.044 -> Device family: 0
19:23:51.044 -> Device variant: 0

This repeats forever. Can you help. Unfortunately theres no way anymore to set family and variant anymore.

It shows correct in first line: 97/100 but then reports 0/0

19:23:07.638 -> Bus devices found:
19:23:07.638 -> 128/97/100/
19:23:07.638 -> Device family: 0
19:23:07.638 -> Device variant: 0

Uwe

@uschindler
Copy link
Contributor Author

I found the issue. The following line breaks everything in the logic. After disabling it, all works:

bus->setBusType(BUS_LPB, bus->getBusAddr(), save_destAddr);

This line of code changes the bus type for no reason to LBP which then breaks. On Web interface it also shows LBP on the config page.

So basically I got the master version running, now trying to configure MQTT! Should I open another issue about the logic problem setting to bus type LBP on discovery of bus devices?

@uschindler
Copy link
Contributor Author

There are more problems in the code. It boots up successfully, but every request for a parameter (via auto-logging or MQTT) leads to:

20:15:29.694 -> GET /K0!128 HTTP/1.1
20:15:29.769 -> 0 - Uhrzeit
20:15:32.912 -> DC C2 80 0B 06 3D 05 00 0B 68 58 
20:15:32.912 ->  query failed
20:15:35.980 -> DC C2 80 0B 06 3D 05 00 0B 68 58 
20:15:35.980 ->  query failed
20:15:39.069 -> DC C2 80 0B 06 3D 05 00 0B 68 58 
20:15:39.109 ->  query failed

I think I will revert to latest published version....

@uschindler
Copy link
Contributor Author

uschindler commented Dec 4, 2024

OK, released version 4.1 boots and works correct. As it did not have the newest MQTT topic structure (with status,....), I then tried several commits and found out that e66083d works for me.

If you want to fix the issues with discovery and startup (and possibly those where the query fails), you may contact me to get more debug information. It looks like the new code somehow mixes up device numbers and serial numbers in addition to suddenly setting the BUS type to PPS after scanning for devices. The newest version with the bug (after it started with commenting out the code as described above) reported device ID 128 in MQTT messages, while the working e66083d reports ID 0.

Some issues I also found while upgrading: If I did not delete all retained messages with mosquitto_sub --remove-retained -t 'BSB-LAN/#' before startup, the code crashed with a stack overflow after complaining that it received messages vom mosquito with wrong syntax. I also found out that the new code (like the old code) sometimes disconnects from MQTT server when it gets a bulk of many bus telegrams. This is a separate issue, if I have some time, I will report when this happens.

I will now try to use commit e66083d with wildcard topics in HASS.

@uschindler
Copy link
Contributor Author

uschindler commented Dec 4, 2024

Hi,
with wildcard topics it works fine. The most complex example is the "climate" entity with many settings:

climate:
  - name: Heizkreis
    unique_id: bsb-lan.heating
    availability_topic: "BSB-LAN/status"
    precision: 0.1

    action_topic: "BSB-LAN/0/+/8000/status"
    action_template: >
      {% set idx = value.split() | first | int %}
      {{ 'heating' if idx in [111,113,114,116,137,155] else 'idle' }}

    modes: [ "auto", "off" ]
    mode_state_topic: "BSB-LAN/0/+/700/status"
    mode_state_template: >
      {% set idx = value.split() | first | int %}
      {{ 'off' if idx == 0 else 'auto' }}
    mode_command_topic: "BSB-LAN/0/0/700/set"
    mode_command_template: >
      {{ '0' if value=='off' else states('sensor.heizkreis_letzte_non_off_betriebsart') | default('1') }}

    preset_modes: [ "Schutzbetrieb", "Automatik", "Reduziert", "Komfort" ]
    preset_mode_state_topic: "BSB-LAN/0/+/700/status"
    preset_mode_value_template: >
      {% set mapping = {0: 'Schutzbetrieb', 1: 'Automatik', 2: 'Reduziert', 3: 'Komfort'} %}
      {% set idx = value.split() | first | int %}
      {{ mapping[idx] }}
    preset_mode_command_topic: "BSB-LAN/0/0/700/set"
    preset_mode_command_template: >
      {% set mapping = {'Schutzbetrieb': 0, 'Automatik': 1, 'Reduziert': 2, 'Komfort': 3} %}
      {{ mapping[value] }}

    current_temperature_topic: "BSB-LAN/0/+/8740/status"
    temperature_command_topic: "BSB-LAN/0/0/710/set"
    temperature_state_topic: "BSB-LAN/0/+/710/status"
    temp_step: 0.1
    max_temp: 35
    min_temp: 10
    qos: 1

Please note for the on/off button it uses the last known state (another entity) which is not "Frostschutz".

Basically I think you can revert the two changesets. With wildcards it is not needed. The topics are still a bit async but much better than explicitly giving the category ids.

For getting a value it uses wildcard, for setting the fixed category=0:

    temperature_command_topic: "BSB-LAN/0/0/710/set"
    temperature_state_topic: "BSB-LAN/0/+/710/status"

The cool thing is: For getting/setting temperature, no template is needed, as the payload can be directly sent.

I will only improve my config to have a global variable for the device ID reused.

@uschindler
Copy link
Contributor Author

My running version is now: Version: 4.2.28-20241129140358

@fredlcore
Copy link
Owner

I found the issue. The following line breaks everything in the logic. After disabling it, all works:

bus->setBusType(BUS_LPB, bus->getBusAddr(), save_destAddr);

This line of code changes the bus type for no reason to LBP which then breaks. On Web interface it also shows LBP on the config page.

So basically I got the master version running, now trying to configure MQTT! Should I open another issue about the logic problem setting to bus type LBP on discovery of bus devices?

Thanks for pointing that out, that was a stupid mistake, but disabling the line leads to the other problems you encountered then. The correct line has to be:
bus->setBusType(bus->getBusType(), bus->getBusAddr(), save_destAddr);
I already uploaded a bugfix. Since I only have LPB and not BSB, could you quickly check if that solves the issues? Thanks

@fredlcore
Copy link
Owner

OK, released version 4.1 boots and works correct. As it did not have the newest MQTT topic structure (with status,....), I then tried several commits and found out that e66083d works for me.

If you want to fix the issues with discovery and startup (and possibly those where the query fails), you may contact me to get more debug information. It looks like the new code somehow mixes up device numbers and serial numbers in addition to suddenly setting the BUS type to PPS after scanning for devices. The newest version with the bug (after it started with commenting out the code as described above) reported device ID 128 in MQTT messages, while the working e66083d reports ID 0.

Some issues I also found while upgrading: If I did not delete all retained messages with mosquitto_sub --remove-retained -t 'BSB-LAN/#' before startup, the code crashed with a stack overflow after complaining that it received messages vom mosquito with wrong syntax. I also found out that the new code (like the old code) sometimes disconnects from MQTT server when it gets a bulk of many bus telegrams. This is a separate issue, if I have some time, I will report when this happens.

I will now try to use commit e66083d with wildcard topics in HASS.

As for the bulk of MQTT messages, this is probably related to a bug in mosquitto that I mentioned somewhere else. Can you try and install mosquitto version 2.0.15 which does not seem to have that issue?

As for the crash due to MQTT messages with wrong syntax, I will have to check, my mosquitto does not retain messages, so if you could find out more about that and open a separate issue, that would be great.

@fredlcore
Copy link
Owner

Actually, my mosquitto does retain messages, just not across restarts of mosquitto (I guess I have to enable persistence for it to do that?). What startup do you actually refer to, the one of mosquitto or of BSB-LAN?

@uschindler
Copy link
Contributor Author

Actually, my mosquitto does retain messages, just not across restarts of mosquitto (I guess I have to enable persistence for it to do that?). What startup do you actually refer to, the one of mosquitto or of BSB-LAN?

I did not restrat mosquitto at all as it runs as system service provided by Ubuntu 24.04. I will check how to downgrade it to maybe fix the "too many retained messages" problem, if that's causing the reconnect issue on many retained messages. Interestingly the problem also occurred after I deleted all retained messages, so basically the Mosquitto had an empty BSB-LAN topic.

With restart I mean restart of BSB-LAN. The bug is actually the following:

  • BSB-LAN connects to MQTT broker
  • BSB-LAN subscribes to the BSB-LAN/# topic
  • mosquitto sends all currently visible topics (everything that's visible to subscriber)
  • BSB-LAN parses the topic paths and figures out that the "old" topic names are invalid. It logs a message about that.
  • version e66083d then crashed with a stack overflow message. Unfortunately I have not saved the log output.

Thanks for pointing that out, that was a stupid mistake, but disabling the line leads to the other problems you encountered then. The correct line has to be:
bus->setBusType(bus->getBusType(), bus->getBusAddr(), save_destAddr);
I already uploaded a bugfix. Since I only have LPB and not BSB, could you quickly check if that solves the issues? Thanks

Working on it. Will report later.

@fredlcore
Copy link
Owner

That could be that it has to do with the old topic structure (which was only active for a week or so). Of course it would be good to catch the error, but if it no longer exists in the current version, then I guess we could live with that.
In any case, I restart BSB-LAN regularly and I can't observe any crashes.

The problem with messages getting dropped is unrelated to whether the broker already has stored many messages or not, the problem is described here:
eclipse-mosquitto/mosquitto#2887

@uschindler
Copy link
Contributor Author

That could be that it has to do with the old topic structure (which was only active for a week or so). Of course it would be good to catch the error, but if it no longer exists in the current version, then I guess we could live with that.
In any case, I restart BSB-LAN regularly and I can't observe any crashes.

I think the best is to tell users to execute the following command on update to get rid of all retained messages:

mosquitto_sub --remove-retained -t 'BSB-LAN/#'

(with their own configured topic root and user/pass if required). I think the issue with the crash happened when the 4.1 version I had installed before created topics without /status at end and then I switched over to the above commit.

It actually reports the issue, but it looks like the code then goes into a loop or calls itsself. I think you may be able to reproduce it by publishing some retained messages manually with the topic structure of 3.x and a few with 4.1 style (without /status).

Uwe

@fredlcore
Copy link
Owner

Can do, but for now, I'd rather focus on whether BSB-LAN works again without any problems with BSB (after the bugfix)...

@uschindler
Copy link
Contributor Author

Hi,
compiled new version, in the config page it shows again BSB, but it does not get the correct device information. The web interface shows no parameters.

In the telnet console it still gets telegrams about Brennerstatus, but still shows:

Device family: 0
Device variant: 0

Should I connect with serial console (luckily I added a USB cable coming out of the boiler on bottom....)?

@uschindler
Copy link
Contributor Author

I think the issue has to do with the change from device id to serial number. In the message about detected decice it shows 128 as device ID. Not sure if this is correct. It looks like later it does not find the device in the loop, which goes over all discovered devices.

@fredlcore
Copy link
Owner

Yes, a serial log would be helpful here. This probably happens in GedDevID function, where 128 is actually the broadcast address to get all devices on the bus. The code should reset it to the original destination at the end, which apparently it doesn't...

@uschindler
Copy link
Contributor Author

By the way, found and fixed the MQTT crash. Good old case of not handling a null pointer as a result of an operation :(...

This could be the reason why it crashed when some invalid MQTT topics of old style were reported by MQTT broker. Thanks! I can't replicate it easily, but this could be th reason for the strange behaviour after it got old retained messages.

@fredlcore
Copy link
Owner

Yes, pretty sure that's the same thing. Everything that was posted to BSB-LAN was assumed to contain an = sign at some point, because I assumed that with /poll, there would only be the global SET commands remaining, and then I didn't filter against anything else...

@uschindler
Copy link
Contributor Author

Regarding the mosquitto bug: I applied the workaround in eclipse-mosquitto/mosquitto#2887 (comment) (downgrading on Ubunto 24.04 is not possible, I have now the 2.0.20 installed through the mosquitto PPA) and see what happens.

@uschindler
Copy link
Contributor Author

Hi, the changes to Mosquitto config did not bring any changes. Basically the behaviour is very similar to that report recently on the FHEM forum: https://forum.fhem.de/index.php?msg=1326158

Like for this user, it often happens that on bootup the device connects multiple times to the MQTT server making it online/offline according to the BSB-LAN/status topic. Later on when the heater goes very active (e.g., during the morning when the "schnellaufheizung" starts), it also disconnects randomly. The device does not reboot, it just disconnects from MQTT and reconnects instantly. When having the telnet console open, you can follow the reconnect process but a bit later the telnet console also disconnects. So basically as the telnet console and MQTT disconnect at different times, it looks like it is not a network problem (otherwise both would disconnect at same time).

I will prepare my Chromebook and connect it to the USB port to get a long-term log with all debug options enabled.

My first guess is: Due to high activity on the heater and also the BSB-BUS somehow some messages seem to decode wrongly and the device enters a state where it "hangs". This causes the connections to drop shortly. It may be caused by the heater. But it does not seem to be a "too much heat" problem (ESP32 device does not get too warm or gets electric interference, because I added some shielding around my installation next to the LMU74 long time ago to prevent excessive heat from causing Display Unit flickering - a known problem on Brötje).

In the meantime, I will disable the "availability" topic in my config, so short millisecond downtimes don't flood the log.

Many thanks for the help here! If I find something why the device disconnects on high activity I will open new issue.

@fredlcore
Copy link
Owner

Hm, I don't really know under what conditions an MQTT connection is considered closed because in the code, we only actively disconnect from the broker if the board is reset or the configuration is changed. If the device does not reboot, e.g. through a crash, then I don't know when the broker considers the connection to be disconnected.
The reconnect then happens because every iteration of loop(), it is checked if the connection is active and if not, a reconnect is initiated. The debug log should then state "MQTT connection lost" if there was one established before.

If the broker considers a connection disconnected if it is not regularly "served" (probably through MQTTPubSubClient->loop();), then this could have many reasons, for example some of the "housekeeping jobs" that BSB-LAN has to take care of and that sometimes may take a few seconds, depending on the configuration. If in that moment, a MQTT message cannot be delivered, it would be logical to consider the connection disconnected, especially since the status "offline" comes from the will-message.
One way to test this would be to measure the time between two MQTTPubSubClient->loop(); calls and then see if there is something obvious. But due to the many iterations, it would clog the log quickly, so the error would have to be reproducable.

Anyway, yes, let me know if you find something, but in a different issue, as you suggested...

@uschindler
Copy link
Contributor Author

uschindler commented Dec 6, 2024

Hi yes, working on it. What you describe with "maintenance tasks" looks like the real issue. As I also said, the MQTT connection breaks first, but then also the telnet connection disconnects (a bit later). And if you look at webserver it also sometimes does not repond.

In addition what I observed: When we had the bug (as described here where the bus address was detected in wrong way), the MQTT connection also closed all the time (it reported "connection lost" after about 15 secs, which is the keepalive time in the MQTT client used). The problem was that the device tried to send messages on the BSB bus using the wrong source device id (128) and those requests timeouted as the client got no reply packet. As the timeouts sum up when multiple messages are to be sent on BSB bus, it prevents progress on the MQTT (keepalives) or telnet connection. To me it looks that especially on startup and when the heater is active, some BSB messages produce timeouts and therefore stall the main loop.

The problem of those ESP32/Arduino devices is that they don't support multiple threads. It is like Windows 3.1, you need to keep all tasks alive by switching between them on your own. Basically all actions waiting for network or bus would need to be done with some select()-like approach (send command and then do something else until the bus reports "got something new"). Which would be a major refactoring of BSB-LAN and not easily possible as the timing on the bus needs to be completely handled by the main CPU as it has no signal processor that can listen on BSB and create events.

I will keep you informed. Thanks anyways for the great project.

@uschindler
Copy link
Contributor Author

Maybe it would help to call the MQTT's loop() function when you are waiting for the BSB bus.

@fredlcore
Copy link
Owner

Hm, each bus message has a 3s timeout, and there are three retries, so it shouldn't take 15 seconds from one MQTT loop to another. The ESP32 has two cores, so generally, it would be possible to shift that to a different core, but that would only work on the ESP32, not on the Arduino Due. I'll see if I can call the function more often somewhere...

@fredlcore
Copy link
Owner

I have now added some more MQTT "housekeeping". If you want to run your tests, please try the most recent version from GitHub.

@uschindler
Copy link
Contributor Author

Sure. Compiling....

@uschindler
Copy link
Contributor Author

Feedback: With your recent change the BSB Adaptor only receives the first value of an MQTT poll request:

Dec 06 12:23:30 BSB-LAN/poll 8000,8003,8005,8310,8314,8830,8326,8328,8743,8744,8750
Dec 06 12:23:30 BSB-LAN/0/20/8000/status 114 - Heizbetrieb Komfort
Dec 06 12:24:00 BSB-LAN/poll 8000,8003,8005,8310,8314,8830,8326,8328,8743,8744,8750
Dec 06 12:24:00 BSB-LAN/0/20/8000/status 114 - Heizbetrieb Komfort
Dec 06 12:24:30 BSB-LAN/poll 8000,8003,8005,8310,8314,8830,8326,8328,8743,8744,8750
Dec 06 12:24:30 BSB-LAN/0/20/8000/status 114 - Heizbetrieb Komfort

So this new code seems to have side effects. Better revert it.

After restoring previous commit all fine again:

Dec 06 12:28:50 BSB-LAN/status online
Dec 06 12:29:00 BSB-LAN/poll 8000,8003,8005,8310,8314,8830,8326,8328,8743,8744,8750
Dec 06 12:29:00 BSB-LAN/0/20/8000/status 114 - Heizbetrieb Komfort
Dec 06 12:29:00 BSB-LAN/0/20/8003/status 99 - Geladen, Nenntemperatur
Dec 06 12:29:00 BSB-LAN/0/20/8005/status 0 - ---
Dec 06 12:29:00 BSB-LAN/0/21/8310/status 38.0
Dec 06 12:29:01 BSB-LAN/0/21/8314/status 38.2
Dec 06 12:29:01 BSB-LAN/0/22/8830/status 54.3
Dec 06 12:29:01 BSB-LAN/0/21/8326/status 0
Dec 06 12:29:01 BSB-LAN/0/21/8328/status 0
Dec 06 12:29:01 BSB-LAN/0/22/8743/status 38.0
Dec 06 12:29:02 BSB-LAN/0/22/8744/status 40.5
Dec 06 12:29:02 BSB-LAN/0/22/8750/status 97

@fredlcore
Copy link
Owner

If you send me logs, then please serial monitor logs. Anything else is of little help.
Before I revert, can you please try and go to line 3876:

       while (retry) {
          if (mqtt_broker_addr[0] && (LoggingMode & CF_LOGMODE_MQTT)) { // Do MQTT housekeeping here in case delays occur...
            mqtt_connect();        //Luposoft, connect to mqtt
            MQTTPubSubClient->loop();    //Luposoft: listen to incoming messages
          }

And first remove the mqtt_connect(); line and see if the problem persists, and if it does not, then remove the loop() line and see if it's then working alright again.
If the problem is the loop() line, as I assume, then I can't fix the issue during the queries. Probably loop() is using the same buffer for receiving as for sending to MQTT, so there will be a clash...

@fredlcore
Copy link
Owner

I just had a look at the loop() function: It uses a timeout, and if that's expired it simply and uncerimoniously closes the connection.
Can you add these lines in include/mqtt_handler.h after line 198:

    MQTTPubSubClient->setKeepAlive(30);
    MQTTPubSubClient->setSocketTimeout(30);

Maybe this brings an improvement. If not, then I don't know if there's anything else I can do...

@uschindler
Copy link
Contributor Author

uschindler commented Dec 6, 2024

Hi, you reverted it already. If I remove the mqtt_connect() call it works better. BUT: I also looked at the code: Calling loop() inside the BSB-LAN code is very risky, because the loop() will invoke the callback when it receives a message. As not all variables used there are local and also the status of the BSB implementation is not known, it may suddenly call another query to BSB while a new message arrives. One example where it breaks: I have seen, two polls behind each other will give partial results.

So basically, calling of loop should be as often as possible, but due to the way the code is written (many global variables), this causes havoc. So we should not do this!

I will give the socket timeout and keep alives a few updates to try out.

P.S.: The serial log does not give much information except "connection lost".

@uschindler
Copy link
Contributor Author

uschindler commented Dec 6, 2024

I now added:

    MQTTPubSubClient->setKeepAlive(15);
    MQTTPubSubClient->setSocketTimeout(180);

Setting keep alive to 30 is a bad idea, because it would make it more brittle. The keep alive tells the client to send a packet to the broker to tell it "yes I am there". If the broker does not get keep alive packets often enough, it will disconnect.

Actually the socket timeout identical to keep alive looks strange to me. I raised it to 180 seconds. This would mean that if the server does not send a keep alive to the client in time, the client will kill the connection. The default keep alive value in Mosquitto is 60 secs, minimum configurable is 5s:

MAN page:

   keepalive_interval seconds
      Set the number of seconds after which the bridge should send a ping if no other traffic has occurred. Defaults to 60. A minimum value of 5 seconds is allowed.

Basically the socket timeout needs >> keep-alives (thats what my knowledge about similar protocols tell me, I am devloping large network protocols with Elasticsearch/Apache Lucene/Apache Solr so that's my daily business).

@uschindler
Copy link
Contributor Author

I give it a few hours while adding heater pain and report back. If all works well I can provide a PR.

@uschindler
Copy link
Contributor Author

uschindler commented Dec 6, 2024

It also looks like one of the many keepalive bugs in the (now unmaintained) PubSubClient:

As you have the code of PubSubClient 2.8 included, I will patch it there. PR following

@uschindler
Copy link
Contributor Author

Here is the PR: #682

To me this looks fine after making it Xmas-warm in my office (turn heater to 30 °C). This also seems to fix the duplicate connects on startup of BSB-LAN.

Let's move over the discussion to the PR.

Now this is really closed!

@fredlcore
Copy link
Owner

P.S.: The serial log does not give much information except "connection lost".
You should let me decide on that. You don't know the code as well as I do, and even if there are things that may or may not look obvious to you, it may still tell me important things.

@fredlcore
Copy link
Owner

fredlcore commented Dec 6, 2024

Calling loop() inside the BSB-LAN code is very risky, because the loop() will invoke the callback when it receives a message.

What do you mean by that? Where else should it be called if not as part of the BSB-LAN code? And under what conditions would it be called

As not all variables used there are local and also the status of the BSB implementation is not known, it may suddenly call another query to BSB while a new message arrives.

What status is not known? When/how may it suddenly call another query to BSB? Please don't make assumptions without actually substantiating them. Like above, where you claim that the destination ID might be only above 0x80 if it's a broadcast message, these assumptions don't help. In the case above, it was immediately clear to me that there is no base for this claim, but with other things, I may have to examine it, and without any substantiation, this becomes difficult.

One example where it breaks: I have seen, two polls behind each other will give partial results.

Great. But without any logs, what should I do now? The logging function does much more than two polls right after each other, and there is no issue with that. So how do I replicate this situation?

@uschindler
Copy link
Contributor Author

P.S.: The serial log does not give much information except "connection lost".
You should let me decide on that. You don't know the code as well as I do, and even if there are things that may or may not look obvious to you, it may still tell me important things.

That's fine. I already figured out that the problem seems to not come from your code.

For an issue that happens once per hour it is very hard to record a lot of serial logs, so I did my programmer's preference and reading the code. I am working a lot in software security checking and revieweing PRs of large projects like Elasticsearch / Apache Lucene/Solr, so I am one of those few people that can follow the program workflow without actually ever executing it.

This is not meant bad, but I am not a fan of logfile debugging, because it does not help much. If you prefer that, I can provide log files, yes, but that's not my highest priority.

@fredlcore
Copy link
Owner

fredlcore commented Dec 6, 2024

Great that you work that way. If you fix all that by yourself, then that's perfectly fine. But if want action on my part, however, it would be polite to follow my requirements, to say the least. You don't know how many (real or not) self-proclaimed experts I've had over the years who thought they understood the code when in fact they didn't. I don't blame anyone for not "getting" the code as it has grown organically in many directions over the years, but that means if someone wants action on my part, they have to play by my rules.
P.S.: And yes, replicating an error can be tedious, I know. But if you are an expert programmer, you know that this is key to fixing bugs...

@uschindler
Copy link
Contributor Author

Calling loop() inside the BSB-LAN code is very risky, because the loop() will invoke the callback when it receives a message.

What do you mean by that? Where else should it be called if not as part of the BSB-LAN code? And under what conditions would it be called

See below!

As not all variables used there are local and also the status of the BSB implementation is not known, it may suddenly call another query to BSB while a new message arrives.

What status is not known? When/how may it suddenly call another query to BSB? Please don't make assumptions without actually substantiating them. Like above, where you claim that the destination ID might be only above 0x80 if it's a broadcast message, these assumptions don't help. In the case above, it was immediately clear to me that there is no base for this claim, but with other things, I may have to examine it, and without any substantiation, this becomes difficult.

Sorry for that, I was just guessing here and made a comment. Sometimes another eye on code helps. The wire protocoll is not documented, so based on the "magic value" 128 and the fact that the broadcast address is 0x7F, my assumption was that the maximum device ID is 126; 127 is the broadcast. So the 8th bit could be some special marker.

The issue with the "minus 0x80" I still disagree, but I let you do whatever you want. If you want to prevent issues when the BSB returns a device ID < 0x80 , then log a message and stop the code with a return false (or similar). Now it goes into an invalid state. This is why I suggested to use & 0x7F. YES, I may not know everything about the BSB wire protocol, but thats something which looked like a "strange code pattern".

One example where it breaks: I have seen, two polls behind each other will give partial results.

I tried to explain that. If you call the pubsub loop() method from within a BSB query (as far as I remeber you did this in your commit before each try, so also on first try), the PubSubClient might call the callback for new topics it got in the meantime. Due to global variables the following happens when two polls are published behind each other:

  • the first poll is received and your code prepares to fetch all stati one after each other
  • your code calls query function on BSB bus for the first topic's parameter
  • the query function calls pubsub loop() again (the code added above, now reverted)
  • in the meantime another poll request was published by Home Assistant
  • the pubsub loop() function calls your code to handle the poll received
  • The code then processes only the second poll request, because of the recursive call. It does it one by one till end.
  • the recursive call due to callback returns
  • The remaining entries of the first poll request are no longer executed correctly (I don't know the exact reason for that).

The problem is simply said: If you call the loop() from inside the BSB code that is executed in response to the callback called by the main loop(), it may recursively call itsself which goes wrong due to global variables and state on the BSB bus (an ongoing query).

This is why I said "revert, revert" (this is a common phrase used in the ASF when referring to me....)

Great. But without any logs, what should I do now? The logging function does much more than two polls right after each other, and there is no issue with that. So how do I replicate this situation?

I don't think you need to do anything at moment. The issue seems to be in the PubSubClient and a logic error with regards to when to send or wait for PING requests. So the code disconnects on its own once it gets into a bad state.

@uschindler
Copy link
Contributor Author

As you reverted the call to pubsub's loop() already, I don't think we need too much to handle the race. It should be avoided to execute the loop() while it is already executing a callback.

Based on the comments in the pubsubclient issue tracker (see issues above), many people already call the loop() several times per second and still observing the issue, so I don't think we need to hook it into more places.

@fredlcore
Copy link
Owner

Sorry, but you knew the code in question was reverted and then you made the unspecific statement that a) loop() shouldn't be called inside the BSB-LAN code (which would be all the code?) and b) you say that you have seen (generally?) "two polls behind each other will give partial results". Both of these statements are at least misleading. I take bug reports very seriously and spend time to understand and follow-up on them, so I ask everyone to be as clear and specific about it as possible.

As for the race condition, I didn't add it because of the loop() issue but because there may be a race condition regarding the destination device ID. I don't know whether the callbacks are execuded in parallel or in serial. If they are executed in serial, then they the added code wouldn't be necessary. If they are added in parallel, the destinaton device ID couly be changed (temporarily) from one call and while this is executed and the second call comes in, the second call would take the temporarily changed destination device ID as the general destination device ID and then later revert back to that one upon completion. That's why we need to make sure that there is serial execution.
If there is any documentation on how callbacks are executed on the ESP32 and Arduino Due, this would help understanding when they are executed and then determine whether a race condition could occur...

@fredlcore
Copy link
Owner

It seems that the callback is only called as part of the loop() function, so there can't be a race condition. Only recursive calls could be an issue, obviously...

@uschindler
Copy link
Contributor Author

Sorry, but you knew the code in question was reverted and then you made the unspecific statement that a) loop() shouldn't be called inside the BSB-LAN code (which would be all the code?) and b) you say that you have seen (generally?) "two polls behind each other will give partial results". Both of these statements are at least misleading. I take bug reports very seriously and spend time to understand and follow-up on them, so I ask everyone to be as clear and specific about it as possible.

Sorry for that, you were faster with reverting than I was with writing. I have seen the problem with the code and wrote why the call to loop() from withing the BSB code is not a good idea.

Later I wanted to describe exactly why the first approach with the call to the loop wasn't working. Because I took it seriously and tested your code!

As for the race condition, I didn't add it because of the loop() issue but because there may be a race condition regarding the destination device ID. I don't know whether the callbacks are execuded in parallel or in serial. If they are executed in serial, then they the added code wouldn't be necessary. If they are added in parallel, the destinaton device ID couly be changed (temporarily) from one call and while this is executed and the second call comes in, the second call would take the temporarily changed destination device ID as the general destination device ID and then later revert back to that one upon completion. That's why we need to make sure that there is serial execution. If there is any documentation on how callbacks are executed on the ESP32 and Arduino Due, this would help understanding when they are executed and then determine whether a race condition could occur...

Thats fine! Appreciate that you try to find the problem.

When you look at the pubsubclient code, the callback is directly called in the loop() function. The loop takes one message from the socket and if it is a published message it calls the callback. Theres no magic, its just calling the function behind the pointer synchronously.

@uschindler
Copy link
Contributor Author

In the current code all is fine, it only breaks when you call pubsub#loop() from inside the callback so it is recursive.

@fredlcore
Copy link
Owner

fredlcore commented Dec 6, 2024

why the call to loop() from withing the BSB code is not a good idea.

That statement is still wrong, it's still called within the BSB code, just no longer recursively as part of the query() function. But let's leave it at that.

@uschindler
Copy link
Contributor Author

why the call to loop() from withing the BSB code is not a good idea.

That statement is still wrong, it's still called within the BSB code, just no longer recursively as part of the query() function. But let's leave it at that.

Sorry for being imprecise: I meant inside the "BSB protocol handling code" or specifically in the query function.

P.S.: I know what the issue is of being the only maintainer of a project. I do https://github.com/policeman-tools/forbidden-apis which is used in many many open source projects for code quality checks. Sometimes you can't handle stupid issues/pull requests. I went over to just ignore them and keep my release schedule. This project is also more or less a "finished" tool with no new features, it just needs to be updated once every half year....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants