Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nodes may not agree on leader consensus #885

Open
dansrogers opened this issue Aug 27, 2021 · 4 comments
Open

nodes may not agree on leader consensus #885

dansrogers opened this issue Aug 27, 2021 · 4 comments

Comments

@dansrogers
Copy link

Describe the bug
The tests for leader conflicts do not account for out of order message delivery to multiple nodes.

Generally speaking there does not appear to be any guarantee that nodes in the cluster will agree on who the leader is. The guarantee appears to be that each node will believe there is exactly one leader. As far as I can tell there is no enforcement of leadership as well. The client can ask any node for an answer and room assistant will respond, even if it's not the leader. It's also not required that the cluster itself produce a single coherent leader.
To reproduce
add and remove nodes quickly from the cluster. Eventually it will get out of sync. Restoring power after a power outage could create this condition.

Additional context
Paxos or raft or any other distributed consensus algorithm would be required to ensure that nodes agree on who the leader is.

@mKeRix
Copy link
Owner

mKeRix commented Aug 29, 2021

Thanks for reporting this - it's one of the things that has annoyed me about the underlying communication stack that room-assistant uses for a while now. I had the plan to re-do the whole cluster stuff based on more standardized protocols (was looking at ZeroMQ for communication, with Zyre for the clustering on top), but last time I checked most libraries still leave a lot of things open.

@mKeRix mKeRix added this to To do in room-assistant roadmap via automation Aug 29, 2021
@BelgarionNL
Copy link

I have the same thing happening now where 1 of 3 rpi0's simply refuse to join the cluster and follow the leader as it were.

I have tried Weight, quorum etc. nothing works.

the software is honestly too buggy to be used outside of a testing/development function and I think that should be posted on the website in HUGE LETTERS.

its a bit of a let down. so I reverted back to motion sensors for now and I will keep testing new versions when they come out since I really do appreciate the effort. Truly.

@JeroenTuinstra
Copy link

Wanted to indicate that I am experiencing the same issue. Leadership of the cluster can move around a lot. Assigned weight 100 to one node and the rest way lower numbers. Still for 3 out of the 4 nodes, it selects another leader then the one with weight 100.

It seems the weight doesn't actually do anything. Problem is that the 1 node that does select the node with the highest weight is himself.

I also appreciate the effort with the software, but at this moment it is really buggy. The BLE app for the iphone is really inconsistent with distance and reporting. It could however be connected with the instability of the leader consensus.

@alastaid
Copy link

Just for info I have experienced the same, I got round it by 1) turning off auto discovery and hard coding the nodes and weights in all my room-assistant nodes, 2) running an automation in Home assistant that checks the state of the cluster leader, and in my case if it is not "hall" then restart the room_assistant addon on my HA install which is the cluster leader. I have found maintaining the cluster leader is essential to accurate room presence, also as stated in another issue I have given up with BLE and switched back to BluetoothClassic as it is reliable with both IOS and Android all the time, not as fast or accurate, but reliable. Example automation for cluster reset:

- alias: tvroom_cluster_reset
  trigger:
    platform: template
    value_template: '{{ (states.sensor.tvroom_cluster_leader.state != "hall") }}'
  condition:
    - condition: state
      entity_id: switch.tvroommusic
      state: 'on'
      for:
        minutes: 3
    - condition: template
      value_template: '{{ (as_timestamp(now()) - as_timestamp(state_attr("automation.tvroom_cluster_reset", "last_triggered"),0) | int > 600 ) }}'
  action:
    - service: hassio.addon_restart
      data:
        addon: 6e66619d_room_assistant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants