Skip to content

Commit 8b95d25

Browse files
committed
HA: Abort Transaction Commit after Timeout
A strange HA behavior was reported in #787, resulting in both instances being active. The logs contained an entry of the previous active instance exiting the HA.realize() method successfully after 1m9s. This, however, should not be possible as the method's context is deadlined to a minute after the heartbeat was received. However, as it turns out, executing COMMIT on a database transaction is not bound to the transaction's context, allowing to survive longer. To mitigate this, another context watch was introduced. Doing so allows directly handing over, while the other instance can now take over due to the expired heartbeat in the database.
1 parent c2d8bd6 commit 8b95d25

File tree

1 file changed

+36
-2
lines changed

1 file changed

+36
-2
lines changed

pkg/icingadb/ha.go

Lines changed: 36 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -267,6 +267,9 @@ func (h *HA) controller() {
267267

268268
// realize a HA cycle triggered by a heartbeat event.
269269
//
270+
// The context passed is expected to have a deadline, otherwise the method will panic. This deadline is strictly
271+
// enforced to abort the realization logic the moment the context expires.
272+
//
270273
// shouldLogRoutineEvents indicates if recurrent events should be logged.
271274
//
272275
// The internal, retryable function always fetches the last received heartbeat's timestamp instead of reusing the one
@@ -388,8 +391,39 @@ func (h *HA) realize(
388391
}
389392
}
390393

391-
if err := tx.Commit(); err != nil {
392-
return errors.Wrap(err, "can't commit transaction")
394+
// In general, cancellation does not work for COMMIT and ROLLBACK. Some database drivers may support a
395+
// context-based abort, but only if the DBMS allows it. This was also discussed in the initial issue about
396+
// context support to Go's sql package: https://github.com/golang/go/issues/15123#issuecomment-245882486
397+
//
398+
// This paragraph is implementation knowledge, not covered by the API specification. Go's sql.Tx.Commit() -
399+
// which is not being overridden by sqlx.Tx - performs a preflight check on the context before handing over
400+
// to the driver's Commit() method. Drivers may behave differently. For example, the used
401+
// github.com/go-sql-driver/mysql package calls its internal exec() method with a COMMIT query, writing and
402+
// reading packets without honoring the context.
403+
//
404+
// In a nutshell, one cannot expect a Tx.Commit() call to be covered by the transaction context. For this
405+
// reason, the following Commit() call has been moved to its own goroutine, which communicates back via a
406+
// channel selected along with the context. If the context ends before Commit(), this retryable function
407+
// returns with a non-retryable error.
408+
//
409+
// However, while the COMMIT continues in the background, it may still succeed. In this case, the state of
410+
// the database does not match the state of Icinga DB, specifically the database says that this instance is
411+
// active while this instance thinks otherwise. Fortunately, this mismatch is not critical because when this
412+
// function is re-entered, the initial SELECT query would be empty for this Icinga DB node and imply the
413+
// presence of another active instance for the other node. Effectively, this could result in a single HA
414+
// cycle with no active node. Afterwards, either this instance takes over due to the false impression that
415+
// no other node is active, or the other instances does so as the inserted heartbeat has already expired.
416+
// Not great, not terrible.
417+
commitErrCh := make(chan error, 1)
418+
go func() { commitErrCh <- tx.Commit() }()
419+
420+
select {
421+
case err := <-commitErrCh:
422+
if err != nil {
423+
return errors.Wrap(err, "can't commit transaction")
424+
}
425+
case <-ctx.Done():
426+
return ctx.Err()
393427
}
394428

395429
return nil

0 commit comments

Comments
 (0)