-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
MDEV-32830 I. refactor XA binlogging for better integration with BGC/… #4041
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…replication/recovery This commit is the part I of the series of four that addresses MDEV-31949 in two main directions which are xa parallel slave performance and xa transaction crash-recovery. This one improves upon MDEV-742 design's XA binlogging to facilitate to the crash-recovery (the actual binlog-based recovery is coming in the part IV of MDEV-33168 et al). With the refactoring changes, when binlog is ON, handling of execution of a XA transaction, including binlogging, is made conceptually uniform with the normal BEGIN-COMMIT transaction. That is At XA-PREPARE the transaction first is prepared in engines and after that accumulated replication events are written to the binary log, naturally without any completion event as it's unknown yet. When later XA-"COMPLETE" that is XA-COMMIT and XA-PREPARE follows up, the binary logging of respective Query event takes place first. One can perceive such scheme as if a normal transaction logging is split in the middle into two parts (and nothing really happens in between of them but time passed by). And after the second chunk is sent to binlog the transaction gets committed (or rolled back) in engine. With binlog is enabled both phases' loggings go through binlog-group-commit, where XA-PREPARE "sub-transaction" merely groups for binary logging so skips the engine action while XA-"COMPLETE" does both, that is the logging and an ordered "complete". This behavior is also consistent between completions from the native and external connections. Being a participant of binlog-group-commit designates either XA phase is recoverable (not implemented here) from active binlogs determined by binlog-checkpoint. For the latter specifically this patch removes custom unlogging of XA-prepare group. See entry.need_unlog= 0 et al in MYSQL_BIN_LOG::write_transaction_to_binlog(). In addition to the above a corner case of engine read-only XA transaction is addressed. Previously it was streamlined with logging an empty XA-PREPARE group of binlog events concluded by XA-"COMPLETE" query-event. Now when a preparing XA transaction is found to have only read-only engine branches or none it is marked for rollback as XA_RDONLY optimization: - nothing gets logged at the prepare time an XA_RDONLY note is generated and - it's rolled back at disconnect For XA-COMPLETE to tell whether the prepare phase was logged or not the XID state object is extended with a boolean flag which is a part internal interface for recovery implementation. The flag is normally raised by XA-prepare at flushing to binlog and also at binlog recovery (will be done so it's fully implemented). Notable changes: sql/handler.cc - ha_prepare() a. is ensured to execute binlog_hton::prepare() as the last XA's branch for preparing; b. engine read-only is marked in the xid state to rollback and ER_XA_RDONLY *note* is generated. - conversely ha_rollback_trans() executes binlog hton::rollback() as first branch (the commit method was already equipped to do so) - ditto to the external completion of XA via ha_commit_or_rollback_by_xid(); the function is made a sort of recursive. It may be first be invoked on a top level to take on the binlog hton "completion" to be called from its stack once again now having is_xap_binlogged() false, so to carry out the engine commit. - xarecover_handlerton() now only simulates successful find of the user xid in binlog. sql/log.cc - binlog_commit,rollback() et al are simplified and cleaned up (like binlog_complete_by_xid() introduction). In particular binlog_{commit,rollback}() are rendered to retain just a single piece of XA footprint in either. The methods recognize naturally empty transaction caches at XA completion to proceed anyway into binlog-group-commit thickness; the binlog_commit's binlog_commit_flush_trx_cache() decides which type of transaction and which XA phase is being handled so a proper group event closure is computed. - MYSQL_BIN_LOG::trx_group_commit_with_engines() takes care to raise or drop XID::binlogged flag via xid_cache_update_xa_binlog_state(). - the new run_xa_complete_ordered() encapsulates XA specifics at execution of the engine ordered commit. It's defined with asserts due to MDEV-32455. It's not done as TC_LOG::member because of the scope of this work is limited. The new function mirrors the logic of the normal run_commit_ordered() in that it skips engine completion for those that lack hton::commit_ordered() in favor of doing that on the top level of ha_commit,rollback_trans(). sql/log_event_server.cc - use the transaction cache at XA-PREPARE handling (specifically when XA-END Query event is logged). sql/xa.cc - XID_cache_element extended with is-binlogged meaning flag and few rating functions added to use by binlogging and recovery (xid_cache_update_xa_binlog_state()); - trans_xa_commit,rollback() external action branches are converted into calls of a new largely common function. sql/xa.h - xid_cache_insert(XID *xid, bool is_binlogged) parameter list is extended for xa binlog recovery.
Affected binlog,main,rpl suites tests/results due to the empty XA's rollback optimization.
|
…_ordered .. .. otherwise it asserts expecting a non-zero value of ha_info::m_ht in the function's loop. However since it's a rollback operation such state is possible as a top-level resets the ha_info:s of transaction in a loop so some of them can get reset before binlog_rollback() is called. Such transaction that is "spawned" to binlog-and-rollback (because may contain unrollable side-effects) should not have visited TC_LOG::run_prepare_ordered() in the first place. And it did so 'cos of incorrect cache_mngr->using_xa value. This is fixed now.
65b4ee9
to
2478c28
Compare
MDEV-32830 base patch missed out supporting XA-COMPLETE via hton::"complete"_by_xid with multiple engines some of them are and some are not defined ordered_commit(). The base patch fails with an assert in a path that should have been unreachable in the first place. The assert path revealed a second commit-by-xid capable engine (Spider) was running commit_by_xid() as if it is ordered_commit()-type of while Spider does not define ordered_commit. For engines featuring the ordered-commit undefined and commit-by-xid defined commit_by_xid() is considered as a general so cannot be invoked out of binlog-group-commit (run_prepare_ordered()). It would be a full commit similarly to the normal transaction case be invoked after binlog-group-commit engine loop is done^\footnote{% also specified in p.5 low-level design points of MDEV-32830 description}. For that this patch - extends commit_or_rollback_xa_engine() with a hint for xarollback_handlerton() argument that enables to understand which phase of XA completion it's being run. - elaborates in xacommit_handlerton(). Specifically to fix this bug, not invoke a defined hton->complete_by_xid when the method is to run as ordered-"complete" by engine lacking the ordered_commit.
…lling back.. XA in XA_ROLLBACK_ONLY state. The assert was too strict having not anticipated to face a XA-PREPARE ending in rolling back. That's a correct behaviour of read-only disconnecting XA. Fixed to relax the assert.
2478c28
to
4331ea8
Compare
A ha_rollback_trans() DBUG_ASSERT(thd->lex->sql_command != SQLCOM_XA_ROLLBACK || thd->transaction->xid_state.is_explicit_XA()) expected an XA sql_command match a XID state. However Spider engine can visit this function even when Spider is not the xa transaction participant (sic!) through XA ROLLBACK 'xid' from an external connection. The assert is refined, at least temporarily to satisfy MDEV-32830 testing, to let the function run an internal Spider's transaction.
CACHE INDEX t PARTITION after XA-PREPARE also this a line previously assert for another "exceptional" sql command: ... This is really a variety of MDEV-32455. The assert did not expect another use case of implicit rollback with XA transaction while the sql command should have been (I think) rejected with XAER_RMFAIL: The command cannot be executed when global transaction is in the ACTIVE state The assert is refined and its another instance removed being redundant.
An observed assert in rpl_group_info::unmark_start_commit DBUG_ASSERT(!gco->next_gco || gco->next_gco->wait_count > e->count_committing_event_groups || allow_unmark_after_complete); was overreaction. The reason that execution got to that point was MDEV-33921 underfixed that issue for the slave parallel mode. That's done now. The former test is extended to hit the assert by the (buggy) BASE of this commit.
This is a huge patch, >7k lines.
In line with what we also discussed some weeks ago on the replication
meetings, I'll start by making sure things around the patch are clear, ie.
description and buildbot/testing.
As I understand, this is a follow-up to an earlier review of a related
patch, but that was now quite a long time ago, so forgive me for asking for
re-iterating any discussions/explanations that were also given back then.
There seem to be a lot of failures in the buildbot on the brach?
https://buildbot.mariadb.org/#/grid?branch=bb-12.1-MDEV-32830_xa
https://buildbot.mariadb.org/#/grid?branch=refs%2Fpull%2F4041%2Fhead
andrelkin ***@***.***> writes:
This one improves upon MDEV-742 design's XA binlogging to
facilitate to the crash-recovery (the actual binlog-based recovery
is coming in the part IV of MDEV-33168 et al).
Does this mean that this patch does not make XA + binlog crash-safe on the
master?
Are there any other major user-visible behaviour changes in this patch, or
is it only internal code-refactoring?
With the refactoring changes, when binlog is ON, handling of execution of
a XA transaction, including binlogging, is made conceptually uniform
with the normal BEGIN-COMMIT transaction.
So what are the motivations for doing this?
Is it mainly to make the code clearer? Are there other benefits?
Performance? Facilitating later improvements (and if so, which)? Others?
That is At XA-PREPARE the transaction first is prepared in engines and
after that accumulated replication events are written to the binary
log, naturally without any completion event as it's unknown yet.
When later XA-"COMPLETE" that is XA-COMMIT and XA-PREPARE follows up,
I don't understand what it means that XA-COMPLETE is XA-COMMIT and XA-PREPARE?
the binary logging of respective Query event takes place first.
I don't understand what it means "logging of respective Query event takes
place first". Respective to what, is this the Query event with the query
string "XA COMMIT '<name>'" or "XA PREPARE '<name>'"? What is "first"
relative to, what comes after?
One can perceive such scheme as if a normal transaction logging
is split in the middle into two parts (and nothing really happens
in between of them but time passed by). And after the second chunk is sent
to binlog the transaction gets committed (or rolled back) in engine.
With binlog is enabled both phases' loggings go through binlog-group-commit,
where XA-PREPARE "sub-transaction" merely groups for binary logging
so skips the engine action while XA-"COMPLETE" does both, that is the logging
and an ordered "complete".
Can you please describe more concretely what happens during the binlog group
commit for XA transactions? What are the exact steps taken, which major
functions in server layer and in engine are being called in what order?
For example, of the the primary purposes of the binlog group commit is to
ensure that binlog and engine contains the transactions in the same order,
that is the purpose of the commit_ordered() call. This ensures for example
that a non-locking mariabackup will be consistent with a specific GTID in
the binlog. Does this patch similarly ensure that XA PREPARE happens in the
engine in the same order as it appears in the binlog? What about XA COMMIT
or XA ROLLBACK? If so, how is this ensured?
Another primary motivation is on the (parallel) slave to ensure that commits
on the slave happen in the same order as in the master's binlog. Does this
patch ensure the same property for the XA PREPARE/COMMIT/ROLLBACK? If so, how?
What is the behavior for cross-engine XA transactions (eg. XA transaction
that modify two or more transactional engines)? Is it handled in this patch,
or if not is there a design for how it could be handled?
This behavior is also consistent between
completions from the native and external connections.
What does this mean, completions from native and external connections?
Being a participant of binlog-group-commit designates either XA phase
is recoverable (not implemented here) from active binlogs determined
by binlog-checkpoint.
For the latter specifically this patch removes custom unlogging of XA-prepare
group. See
entry.need_unlog= 0
et al in MYSQL_BIN_LOG::write_transaction_to_binlog().
The purpose of the binlog checkpoints is to save one fsync() at the end of
the two-phase commit between engine(s) and binlog. A binlog checkpoint means
that any (normal) transaction in that binlog has now been fsync()'ed to the
InnoDB redo log.
What is the corresponding mechanism for user XA in this patch? Please
describe the exact semantics of binlog checkpoints wrt. XA
PREPARE/COMMIT/ROLLBACK, and how the algorithm will be that makes each XA
phase recoverable.
What was the purpose of the custom unlogging that is removed in this patch,
and why is it no longer necessary?
Now when a preparing XA transaction is found to have only read-only engine
branches or none it is marked for rollback as XA_RDONLY optimization:
- nothing gets logged at the prepare time an XA_RDONLY note is generated and
- it's rolled back at disconnect
Where is the XA_RDONLY note generated? Just internally, in the XA hash
holding the current state of the XA transaction?
Then what happens if the server crashes after the read-only XA PREPARE, what
will be the state of the transaction once the server comes up?
For XA-COMPLETE to tell whether the prepare phase was logged or
not the XID state object is extended with a boolean flag which is a
part internal interface for recovery implementation.
The flag is normally raised by XA-prepare at flushing to binlog and
also at binlog recovery (will be done so it's fully implemented).
Maybe this was intended to answer the previous question, but I don't
understand or do not see any concrete description of how it works.
Notable changes:
<snip>
- Kristian.
|
'Failed test nm' and explicit
while the symptoms do not really worrying;
caused by ^. I need to make some test
Right, it's a prerequisite. Ultimately the binlog-based crash-recovery
The main goal to achieve recovery from binlog the normal transaction's rollback way.
Typo, sorry. XA-COMPLETE is XA-COMMIT or XA-ROLLBACK of course.
So the respective adjective attaches to the completion type.
is written first (to match Xid_log_event in the normal case).
Here is a gdb stack of commit from the native (defined as the connection that
The external (defined as a connection from which a prepared disconnected xa is
As you can guess from the above stack and a mentioned equivalency of
The parallel slave execution of XA-COMPLETE is governed similarly to the normal To the XA-PREPARE terminated group of events, it's a transaction from the
Of course it's unrecoverable yet waiting for MDEV-21777.
Without crashes this this refactoring part I must be able to handle such load, However now, see mdev-36802_multiple_xa_engine, I've had to delve deeper to
Sure, and that's not deny what was stated in what you quote.
I note that 'cos XA-COMPLETE has no specifics beyond normal transaction commit,
To the XA-PREPARE, the difference is p.3 skipped.
I think that should be much more clear now as the strategy is, to reiterate,
Right.
The answer is a read-only XA would be gone already after disconnect from the server. This is a MDEV-33168 objective to whom a piece of logics around
this refactoring provides with. At post-crash restart the recovery would
Right. A lighter "recovery" (still recovery from the user pov) is reconnect with |
andrelkin ***@***.***> writes:
>> the binary logging of respective Query event takes place first.
> I don't understand what it means "logging of respective Query event
> takes place first". Respective to what
So the respective adjective attaches to the completion type.
E.g
> "XA COMMIT '<name>'"
is written first (to match Xid_log_event in the normal case).
Aha, so I think you mean that the XA COMMIT Query event is written to the
binlog before the XA COMMIT is executed in the engine(s)?
And is this changed in the patch, eg. in the existing code in 10.5+, is the
XA COMMIT Query event written to the binlog before or after the XA COMMIT is
executed in the engine(s)?
> Can you please describe more concretely what happens during the
> binlog group commit for XA transactions? What are the exact steps
> taken, which major functions in server layer and in engine are being
> called in what order?
Here is a gdb stack of commit from the native (defined as the connection that
executes all phase of the xa) connection that it logically compatible to the
normal trx:
```
innobase_commit_ordered
run_xa_complete_ordered
TC_LOG::run_commit_ordered
...
binlog_commit
commit_one_phase_2
ha_commit_one_phase
trans_xa_commit
```
Hm, so you omitted part ("...") of the stack trace. So I'm wondering if the
"..." includes the binlog group commit, queue_for_group_commit() and/or
trx_group_commit_leader()?
Since you wrote this in the description:
"With binlog is enabled both phases' loggings go through binlog-group-commit"
The call path uses ha_commit_one_phase(), which I guess is because the XA
COMMIT or XA ROLLBACK does not do a two-phase commit with the binlog (it
cannot, as it is already prepared), but it can still take part in the binlog
group commit, just without the log_and_order() / unlog() part.
> For example, of the the primary purposes of the binlog group commit
> is to ensure that binlog and engine contains the transactions in the
> same order, that is the purpose of the commit_ordered() call. This
> ensures for example that a non-locking mariabackup will be
> consistent with a specific GTID in the binlog. Does this patch
> similarly ensure that XA PREPARE happens in the engine in the same
> order as it appears in the binlog? What about XA COMMIT or XA
> ROLLBACK? If so, how is this ensured?
To the XA-PREPARE, the ordering of prepared XA:s in Engine is as random as it is
on normal transactions. This is sufficient to achieve the same normal
trx' logics of backup restoration: when an xa transaction is present
in backup and is missed in binlog it's to be rolled back, otherwise it
remains alive (and of course in Prepared state, differently from the
normal case).
This I do not understand.
I understand that during *crash recovery*, an XA PREPARE will be kept or
rolled back depending on what is in the binlog. IIUC this will be
implemented not in this patch but in a follow-up.
But when restoring a *backup*, we do not have the binlog available, right?
I am considering an example where we have XA PREPARE 't1' and XA PREPARE 't2'.
Suppose we have the following sequence of execution on a (parallel) slave:
t2 prepare in engine
t2 release backup lock in queue_for_group_commit()
(*)
t1 prepare in engine
t1 release backup lock in queue for group commit()
leader binlogs t1, t2
The binlog order here (on master and slave both) is 't1' followed by 't2'.
If we do a mariabackup at point (*), then it seems there is no way to
provision another slave from that backup, as there is no valid GTID position
corresponding to it. A position after XA PREPARE 't2' will skip replicating
XA PREPARE 't1'. A position before XA PREPARE 't1' will fail on duplicate XA
PREPARE 't2'.
So did I miss something, and there is a mechanism to ensure that a backup of
a slave can be used to provision another slave? If so, what mechanism?
Or there is no such mechanism in this patch, but will be in a follow-up
patch? If so, what mechanism?
Or this is a limitation, and a mariabackup of a (parallel) slave can not
always be used to provision a new slave? (This is after all already the case
for mysqldump).
Note that this is not a critisism of the patch at this point one way or the
other, it is an attempt to understand what is the intention of the patch to
be able to do the review. This is a huge patch, >8k lines, very difficult to
review correctly. It is very important during a complex review to understand
what the purpose of the patch is. Otherwise whenever something looks odd,
the reviewer is left to guess if it is a bug in the patch, if it is a
deliberate limitation of the patch, or if it is merely mis-understanding on
the part of the reviewer.
- Kristian.
|
Sure. In the recovery sense this binlog event is (going to be via MDEV-33168)
Does include.
Actually I naively thought binlog is around. However there's a method
so t2 is XA-PREPAREd in the backup it is still not in the list. Thanks for the case. This is worth of another ticket to report. As to your thorough approach, as I completely with you especially having pretty constructive feedback that you're always prolific with! |
…replication/recovery
This commit is the part I of the series of four that addresses
MDEV-31949 in two main directions which are xa parallel slave
performance and xa transaction crash-recovery.
This one improves upon MDEV-742 design's XA binlogging to
facilitate to the crash-recovery (the actual binlog-based recovery
is coming in the part IV of MDEV-33168 et al).
With the refactoring changes, when binlog is ON, handling of execution of
a XA transaction, including binlogging, is made conceptually uniform
with the normal BEGIN-COMMIT transaction.
That is At XA-PREPARE the transaction first is prepared in engines and
after that accumulated replication events are written to the binary
log, naturally without any completion event as it's unknown yet.
When later XA-"COMPLETE" that is XA-COMMIT and XA-PREPARE follows up,
the binary logging of respective Query event takes place first.
One can perceive such scheme as if a normal transaction logging
is split in the middle into two parts (and nothing really happens
in between of them but time passed by). And after the second chunk is sent
to binlog the transaction gets committed (or rolled back) in engine.
With binlog is enabled both phases' loggings go through binlog-group-commit,
where XA-PREPARE "sub-transaction" merely groups for binary logging
so skips the engine action while XA-"COMPLETE" does both, that is the logging
and an ordered "complete". This behavior is also consistent between
completions from the native and external connections.
Being a participant of binlog-group-commit designates either XA phase
is recoverable (not implemented here) from active binlogs determined
by binlog-checkpoint.
For the latter specifically this patch removes custom unlogging of XA-prepare
group. See
entry.need_unlog= 0
et al in MYSQL_BIN_LOG::write_transaction_to_binlog().
In addition to the above a corner case of engine read-only XA transaction
is addressed. Previously it was streamlined with logging an empty XA-PREPARE
group of binlog events concluded by XA-"COMPLETE" query-event.
Now when a preparing XA transaction is found to have only read-only engine
branches or none it is marked for rollback as XA_RDONLY optimization:
For XA-COMPLETE to tell whether the prepare phase was logged or
not the XID state object is extended with a boolean flag which is a
part internal interface for recovery implementation.
The flag is normally raised by XA-prepare at flushing to binlog and
also at binlog recovery (will be done so it's fully implemented).
Notable changes:
sql/handler.cc
ha_prepare()
a. is ensured to execute binlog_hton::prepare() as
the last XA's branch for preparing;
b. engine read-only is marked in the xid state to rollback and
ER_XA_RDONLY note is generated.
conversely ha_rollback_trans() executes binlog hton::rollback() as first
branch (the commit method was already equipped to do so)
ditto to the external completion of XA via ha_commit_or_rollback_by_xid();
the function is made a sort of recursive. It may be first be invoked
on a top level to take on the binlog hton "completion" to be
called from its stack once again now having is_xap_binlogged() false,
so to carry out the engine commit.
xarecover_handlerton() now only simulates successful find of the user
xid in binlog.
sql/log.cc
binlog_commit,rollback() et al are simplified and cleaned up (like
binlog_complete_by_xid() introduction).
In particular binlog_{commit,rollback}() are rendered to retain
just a single piece of XA footprint in either. The methods recognize
naturally empty transaction caches at XA completion to proceed anyway
into binlog-group-commit thickness;
the binlog_commit's binlog_commit_flush_trx_cache() decides which
type of transaction and which XA phase is being handled so a proper
group event closure is computed.
MYSQL_BIN_LOG::trx_group_commit_with_engines() takes care to
raise or drop XID::binlogged flag via xid_cache_update_xa_binlog_state().
the new run_xa_complete_ordered() encapsulates XA specifics at
execution of the engine ordered commit. It's defined with asserts due to
MDEV-32455.
It's not done as TC_LOG::member because of the scope of this work is
limited. The new function mirrors the logic of the normal run_commit_ordered()
in that it skips engine completion for those that lack
hton::commit_ordered() in favor of doing that on the top level
of ha_commit,rollback_trans().
sql/log_event_server.cc
use the transaction cache at XA-PREPARE handling (specifically
when XA-END Query event is logged).
sql/xa.cc
XID_cache_element extended with is-binlogged meaning flag and
few rating functions added to use by binlogging and recovery
(xid_cache_update_xa_binlog_state());
trans_xa_commit,rollback() external action branches are converted
into calls of a new largely common function.
sql/xa.h
xid_cache_insert(XID *xid, bool is_binlogged) parameter list is
extended for xa binlog recovery.<!--
Thank you for contributing to the MariaDB Server repository!
You can help us review your changes faster by filling in this template <3
If you have any questions related to MariaDB or you just want to hang out and meet other community members, please join us on https://mariadb.zulipchat.com/ .
-->
Description
TODO: fill description here
Release Notes
TODO: What should the release notes say about this change?
Include any changed system variables, status variables or behaviour. Optionally list any https://mariadb.com/kb/ pages that need changing.
How can this PR be tested?
TODO: modify the automated test suite to verify that the PR causes MariaDB to behave as intended.
Consult the documentation on "Writing good test cases".
If the changes are not amenable to automated testing, please explain why not and carefully describe how to test manually.
Basing the PR against the correct MariaDB version
main
branch.PR quality check