Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"agent.server.serf.lan: serf: query reply ID mismatch" triggered by widespread keyring queries #21059

Open
markblackman opened this issue May 7, 2024 · 1 comment

Comments

@markblackman
Copy link
Contributor

markblackman commented May 7, 2024

Overview of the Issue

When reading LAN and WAN keyrings (operator/keyring) from every node in a cluster (~300 node) which is also federated with several other similar cluster, we start seeing the following messages for the WAN and LAN gossip pools:

May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.117Z [WARN]  agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1978383755)
May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.129Z [WARN]  agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1765095158)
May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.507Z [WARN]  agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1765095158)
May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.980Z [WARN]  agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1978383755)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 441950090, Response: 991518040)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 2126224088)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 2033316133)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1562544182)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1422278651)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1742329462)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1842479929)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 406350588)


Reproduction Steps

Consul info for both Client and Server

Server info
agent:
	check_monitors = 16
	check_ttls = 1
	checks = 30
	services = 26
build:
	prerelease = 
	revision = 7ba4e1b+
	version = 1.13.5+3
	version_metadata = 
consul:
	acl = enabled
	bootstrap = false
	known_datacenters = 9
	leader = false
	leader_addr = 10.98.8.4:8498
	server = true
raft:
	applied_index = 2951624713
	commit_index = 2951624713
	fsm_pending = 0
	last_contact = 1.991887ms
	last_log_index = 2951624713
	last_log_term = 7256
	last_snapshot_index = 2951614510
	last_snapshot_term = 7256
	latest_configuration = [{Suffrage:Voter ID:9fbe7e25-0371-578d-ac8e-886b4dfa9a5d Address:10.98.8.4:8498} {Suffrage:Voter ID:542c9d7e-732c-5803-8e36-ba11168bdf99 Address:10.98.8.5:8498} {Suffrage:Voter ID:97e86b31-ae3e-52f7-8746-4876adbbf2c4 Address:10.98.9.44:8498} {Suffrage:Voter ID:5f909fab-fe67-5502-bf8a-f561c8c66e08 Address:10.98.10.65:8498} {Suffrage:Voter ID:8ce84c87-1b7c-5836-aa21-24e5aa087d78 Address:10.98.10.213:8498}]
	latest_configuration_index = 0
	num_peers = 4
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Follower
	term = 7256
runtime:
	arch = amd64
	cpu_count = 128
	goroutines = 9258
	max_procs = 128
	os = linux
	version = go1.21.3 X:boringcrypto
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 487
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 150611
	members = 223
	query_queue = 0
	query_time = 310159
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 144768
	members = 33
	query_queue = 0
	query_time = 231467

too much redaction required for the whole thing config.hcl, some highlights below

bind_addr = "0.0.0.0"
serf_wan = "[::]"
translate_wan_addrs = true

ports = {
        server = 8498
        serf_lan = 8499
        http = 8500

        dns = 8600
        serf_wan = 8399
        grpc = -1
}
Client info

client info and config nearly identical to server above

Operating system and Environment details

Linux, 6.6.23 kernel, Debian bookworm

Log Fragments

May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.117Z [WARN]  agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1978383755)
May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.129Z [WARN]  agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1765095158)
May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.507Z [WARN]  agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1765095158)
May 07 07:53:48 hostXYZ consul[30719]: 2024-05-07T07:53:48.980Z [WARN]  agent.server.serf.wan: serf: query reply ID mismatch (Local: 324682337, Response: 1978383755)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 441950090, Response: 991518040)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 2126224088)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 2033316133)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1562544182)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1422278651)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1742329462)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 1842479929)
May 07 07:53:54 hostXYZ consul[30719]: 2024-05-07T07:53:54.312Z [WARN]  agent.server.serf.lan: serf: query reply ID mismatch (Local: 1790442446, Response: 406350588)
@markblackman
Copy link
Contributor Author

my guess is that the consul agents are seeing replies to queries they didn't initiate and are confused by that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant