Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: support online UpdateCellInfo #17560

Open
tanjinx opened this issue Jan 16, 2025 · 1 comment
Open

Feature Request: support online UpdateCellInfo #17560

tanjinx opened this issue Jan 16, 2025 · 1 comment

Comments

@tanjinx
Copy link

tanjinx commented Jan 16, 2025

Feature Description

The vtctld command UpdateCellInfo can be quite destructive when the topo server address is updated in real time, which resulted in crash/panic on all vtgates in that cell, causing significant customer downtime. The panic stack:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x10321ed]

goroutine 1362732539 [running]:
github.com/hashicorp/consul/api.(*KV).getInternal(0x0, {0xc09cc65ad0?, 0x0?}, 0x2501bb0?, 0xc04062eb60?)
        github.com/hashicorp/consul/[email protected]/kv.go:130 +0x6d
github.com/hashicorp/consul/api.(*KV).Get(0x2501b40?, {0xc09cc65ad0?, 0x4103a5?}, 0xc06d0227a0?)
        github.com/hashicorp/consul/[email protected]/kv.go:69 +0x53
vitess.io/vitess/go/vt/topo/consultopo.(*Server).Watch.func1()
        vitess.io/vitess/go/vt/topo/consultopo/watch.go:98 +0x1b0
created by vitess.io/vitess/go/vt/topo/consultopo.(*Server).Watch in goroutine 1362735945
        vitess.io/vitess/go/vt/topo/consultopo/watch.go:69 +0x3a5

The main problem is in this code block, when a new watcher (either a shard watcher, or topo watcher) came in, it found the cell server address is changed, and then closed all the cached connections which are still being used by other watchers:
In vt/topo/server.go:

		// Client exists in cache.
		// Let's verify that it is the same cell as we are looking for.
		// The cell name can be re-used with a different ServerAddress and/or Root
		// in which case we should get a new connection and update the cache
		if ci.ServerAddress == cc.cellInfo.ServerAddress && ci.Root == cc.cellInfo.Root {
			return cc.conn, nil
		}
		// Close the cached connection, we don't need it anymore
		if cc.conn != nil {
			cc.conn.Close()
		}

Supporting onlineUpdateCellInfo will be very useful in the context of topo server migrations. Although the crash stack is on the Consul client, the problem also exists on other topo servers.

Use Case(s)

Support topo server migration without customer downtime, which can be server DNS/url change or migration between supported topo server types.

@tanjinx tanjinx added the Needs Triage This issue needs to be correctly labelled and triaged label Jan 16, 2025
@deepthi
Copy link
Member

deepthi commented Jan 16, 2025

I'll just note that

migration between supported topo server types

is not possible using UpdateCellInfo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: To do
Development

No branches or pull requests

3 participants