-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: vtgate
handles replicas with stalled disk
#17610
Comments
Added Really MySQL is the problem here, it should be smarter, but I'm guessing we'll have to be smarter instead 😄 |
I found some weak signals, two are in global status variables, which can still be queried in this scenario: mysql> show global status like 'Innodb_data_pending_writes'\G
*************************** 1. row ***************************
Variable_name: Innodb_data_pending_writes
Value: 1
1 row in set (0.00 sec) And mysql> show global status like 'Innodb_os_log_pending_writes'\G
*************************** 1. row ***************************
Variable_name: Innodb_os_log_pending_writes
Value: 1
1 row in set (0.00 sec) But those being non-zero alone don't mean a stall really, it's just one of the few queryable signals I've found Another is queries to replication applier worker tables hang, but that's not something to rely on really: mysql> select * from performance_schema.replication_applier_status;
(hangs forever) |
vtgate
to understand replicas with stalled diskvtgate
handles replicas with stalled disk
Repro steps:
|
I think given that #17470 already added the ability to detect stalled disks, we should just use that information to mark the tablet not serving. This would make vtgate not route queries to the vttablet until it has recovered from the stalled disk (at which point even the replication lag would correct its value and we might still not send it any queries until the replication lag reduces.) |
@timvaillancourt I've made the change I suggested ☝ in #17624. Please take a look when you have time and let me know if this would address your use case. |
Feature Description
This feature request hopes to mitigate the user impact caused by a
REPLICA
/RDONLY
with a stalled MySQLdatadir
diskIn at least Vitess 19 (w/MySQL 8.0.36), when the disk storing the MySQL
datadir
is stalled (can be simulated withfsfreeze --freeze /mount/for/mysql/here
),vtgate
continues to believe the stalled replica is healthy, and the health stream updates sent tovtgate
do not reflect that there is any problem. In reality, some/all application queries to the underlyingmysqld
essentially hang, sovtgate
is sending traffic to a blackholeThis sort of makes sense, because the health stream stats don't contain too many metrics to infer the health of
mysqld
itself, and the updates continue to send when a disk is stalled, so the health stream never "times out".ReplicationLagSeconds
is one metric that could infer health, but it turns out even if--enable-heartbeat
is enabled this lag value comes purely fromSHOW REPLICA STATUS
Interestingly, on a totally-
datadir
-stalledmysqld
(in a shard with live writes)Seconds_Behind_Source
never increases, so in health stats we seeReplicationLagSeconds: 0
😱.SHOW REPLICA STATUS
output:And
Relay_Log_File
andRelay_Log_Pos
have no movement - which I believe is the reasonSeconds_Behind_Source
remains zero; the SQL thread is 0 seconds behind the relay logs, which have stopped receiving updates due to the stalled diskAnd when you un-freeze the disk, suddenly
mysqld
realizes it's behind:SHOW REPLICA STATUS
:I consider
vtgate
seeingReplicationLagSeconds: 0
here a bug. This bug could potentially be solved by considering the sidecar-based heartbeat (when--enable-heartbeat
is set); this value will increase in staleness if it stops receiving updates from replication workers. If there is no objection, I would like to update the logic that gathersReplicationLagSeconds
to use the sidecar-based heartbeat when--enable-heartbeat
is setRegardless of the accuracy of
ReplicationLagSeconds
, I feelvtgate
should know sooner that a replica is un-useable in this scenario. Waiting for the replication lag to grow extremely high to ignore a stalled replica still causes impact. Ideally we'd include a metric in the health stream updates that tellsvtgate
more about the health of the underlyingmysqld
or perhaps just aHealthError
(an existing health stats field)Some questions/braindump/ideas:
mysqld
. Should we repurpose that for replicas too?mysqld
can stall?--enable-heartbeat
is enabled and we know the heartbeat interval is say1s
, no movement in relay logs for that duration would indicate a stall? 🤔Deferred/shot-down ideas:
mysqld
is a no-go,REPLICA
s haveread_only = ON
andsql_log_bin = 0
feels hackycc @GuptaManan100 as this may involve the stalled-disk PR
Use Case(s)
Users who would like
vtgate
to stop sending traffic (that will probably fail) to aREPLICA
/RDONLY
will a stalled MySQLdatadir
diskThe text was updated successfully, but these errors were encountered: