Retry pgrewind by dyson · Pull Request #14 · gocardless/stolon

dyson · 2019-08-19T16:17:16Z

We want to implement cascading replication which requires the retrying of pg_rewind to avoid expensive pg_basebackups. This is important for our backup project and will also be useful for other use cases in the future.

With cascading replication, the old sync is rebooted to repoint the recovery at the new primary. The old master will try resyncing against the new sync, while it's rebooting, which causes resync to fail. This falls back to pg_basebackup meaning the old master doesn't recover for several hours.

When implementing the pg_rewind retry functionality there were a number of considerations and design decisions:

We think exponential backoff is not an ideal strategy for allocating the retries because we have a reasonable expectation that the first tries may fail. If we were to use exponential backoff we would be trying many times at the start (when we may be failing) and then fewer times towards the end of the timeout period when we expect to succeed. Instead we set a pg_rewind interval and timeout so that we can distribute the retries evenly across the timeout period. We also introduce jitter in case many clients are trying at the same time. This could be revisited at a later date if our use case changes and we have many nodes trying to pg_rewind from a single node causing performance issues. It's still hoped in this situation that appropriate configuration of the interval and timeout (and perhaps modifying the jitter) would be sufficient.
During the retry period (especially if it is long) the host we need to pg_rewind from could change from underneath us and all subsequent pg_rewind retries would never succeed. We read the host to sync from before each retry to prevent this.
During the retry period an operator (or some other process) may cause the keeper to restart. We could keep track of where in the timeout period the retries are but this would add complexity without any compelling enough reason that we could think of. If the keeper is restart, the retry logic starts again from scratch.
There is the possibility that a keeper may not be able to rewind for a known reason. In this case, an operator might want to force a pg_basebackup. With the current implementation this would involve patching the cluster spec so that the pg_rewind timeout or interval is set to 0s. The keeper would then need to be restarted. Once the pg_basebackup has started, the cluster spec can be patched back. This has a disadvantage that if any other keepers restart during this period, they are likely to also do a pg_basebackup. For the moment it's thought that this wont come up often enough to cause too much risk, but it is something to think about.

Jira issue: https://gocardless.atlassian.net/browse/CI-54

Allow the setting of the a pg_rewind interval and timeout so that the retry behaviour of pg_rewind can be customised. The defualt values if these are not set in the clusterspec are 0s which results in the current default behaviour of only trying to pg_rewind once if usePgrewind is set to true.

cmd/keeper/cmd/keeper.go

dyson · 2019-08-20T13:27:22Z

cmd/keeper/cmd/keeper.go

+			intervalTick := time.Tick(interval + jitter)
+
+		Rewind:
+			for true {


Using for true instead of just for seems to be the convention in the codebase.

dyson · 2019-08-20T13:27:57Z

cmd/sentinel/cmd/sentinel.go

 		db.Spec.RequestTimeout = *clusterSpec.RequestTimeout
 		db.Spec.MaxStandbys = *clusterSpec.MaxStandbys
 		db.Spec.UsePgrewind = *clusterSpec.UsePgrewind
+		db.Spec.PgrewindInterval = *clusterSpec.PgrewindInterval


The case of these new config options is a little sad but just following the convention used elsewhere.

GC has implemented cascading replication which requires the retrying of pg_rewind to avoid expensive pg_basebackups. With cascading replication, the old sync is rebooted to repoint the recovery at the new primary. The old master will try resyncing against the new sync, while it's rebooting, which causes resync to fail. This falls back to pg_basebackup meaning the old master doesn't recover for several hours.

Make the forcing of a checkpoint on the primary before performing a pg_rewind a cluster spec option so that it can be turned on or off.

Add pgrewindInternal, prgrewindTimeout, and pgrewindCheckpoint to the cluster spec documentation.

lawrencejones · 2019-08-27T11:12:39Z

doc/cluster_spec.md

 | usePgrewind               | try to use pg_rewind for faster instance resyncronization.                                                                                                                                                                                                                                                                                                                                                                                                                        | no                        | bool              | false                                                                                                                               |
+| pgrewindInterval          | interval to wait until next pg_rewind try                                                                                                                                                                                                                                                                                                                                                                                                                                         | no                        | string (duration) | 0s                                                                                                                                  |
+| pgrewindTimeout           | time after which we stop trying to pg_rewind                                                                                                                                                                                                                                                                                                                                                                                                                                      | no                        | string (duration) | 0s                                                                                                                                  |
+| pgrewindCheckpoint        | force checkpoint on primary before performing pg_rewind                                                                                                                                                                                                                                                                                                                                                                                                                           | no                        | bool              | false                                                                                                                               |


This is named differently to the PR to add this configuration parameter: sorintlab#644

I think maybe we should leave the removal of this hardcoding out of this PR?

1. #14 Retry pg_rewind for up-to 5m after the first attempt. This allows us to handle the followed database booting for a moment before falling back to the more expensive pg_basebackup. This comes as a crappier alternative to [1], where we were creating proper configuration values. We want to try this before we attempt to create a more mature setup, as we intend to leave that work for whenever we upstream our cascading replication.

lawrencejones · 2019-08-27T13:30:53Z

We've decided to hardcode the retry for the moment in this change: #16

We intend to come back to this PR when we upstream the cascading replication changes, which will force some type of solution for the rewind handling failure.

1. #14 Retry pg_rewind for up-to 5m after the first attempt. This allows us to handle the followed database booting for a moment before falling back to the more expensive pg_basebackup. This comes as a crappier alternative to [1], where we were creating proper configuration values. We want to try this before we attempt to create a more mature setup, as we intend to leave that work for whenever we upstream our cascading replication.

dyson force-pushed the dyson-retry-pgrewind branch 3 times, most recently from 679da0c to fc51965 Compare August 20, 2019 11:53

mos3abof reviewed Aug 20, 2019

View reviewed changes

cmd/keeper/cmd/keeper.go Show resolved Hide resolved

mos3abof reviewed Aug 20, 2019

View reviewed changes

cmd/keeper/cmd/keeper.go Show resolved Hide resolved

mos3abof changed the title ~~WIP: retry pgrewind~~ Retry pgrewind Aug 20, 2019

dyson force-pushed the dyson-retry-pgrewind branch from fc51965 to 94f09db Compare August 20, 2019 13:19

mos3abof marked this pull request as ready for review August 20, 2019 13:24

dyson commented Aug 20, 2019

View reviewed changes

dyson force-pushed the dyson-retry-pgrewind branch from 94f09db to bb68edb Compare August 21, 2019 08:31

dyson force-pushed the dyson-retry-pgrewind branch from bb68edb to 2ecac75 Compare August 21, 2019 11:38

dyson added 2 commits August 21, 2019 14:24

Add pg_rewind checkpoint option to cluster spec

4f95347

Make the forcing of a checkpoint on the primary before performing a pg_rewind a cluster spec option so that it can be turned on or off.

Update cluster spec documentation

821e93c

Add pgrewindInternal, prgrewindTimeout, and pgrewindCheckpoint to the cluster spec documentation.

dyson force-pushed the dyson-retry-pgrewind branch from 2ecac75 to 821e93c Compare August 21, 2019 13:24

rnaveiras self-assigned this Aug 23, 2019

lawrencejones reviewed Aug 27, 2019

View reviewed changes

rnaveiras removed their assignment Aug 27, 2019

lawrencejones mentioned this pull request Aug 27, 2019

Hardcode pg_rewind retrying (up-to 5m) #16

Merged

lawrencejones closed this Aug 27, 2019

lawrencejones deleted the dyson-retry-pgrewind branch August 27, 2019 13:31

VSpike mentioned this pull request Mar 21, 2022

Hardcode pg_rewind retrying (up-to 5m) #60

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry pgrewind#14

Retry pgrewind#14
dyson wants to merge 4 commits intogocardlessfrom
dyson-retry-pgrewind

dyson commented Aug 19, 2019 •

edited by mos3abof

Loading

Uh oh!

Uh oh!

Uh oh!

dyson Aug 20, 2019

Uh oh!

dyson Aug 20, 2019 •

edited

Loading

Uh oh!

lawrencejones Aug 27, 2019

Uh oh!

lawrencejones commented Aug 27, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

dyson commented Aug 19, 2019 • edited by mos3abof Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dyson Aug 20, 2019

Choose a reason for hiding this comment

Uh oh!

dyson Aug 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lawrencejones Aug 27, 2019

Choose a reason for hiding this comment

Uh oh!

lawrencejones commented Aug 27, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dyson commented Aug 19, 2019 •

edited by mos3abof

Loading

dyson Aug 20, 2019 •

edited

Loading