Closed
Conversation
Allow the setting of the a pg_rewind interval and timeout so that the retry behaviour of pg_rewind can be customised. The defualt values if these are not set in the clusterspec are 0s which results in the current default behaviour of only trying to pg_rewind once if usePgrewind is set to true.
679da0c to
fc51965
Compare
mos3abof
reviewed
Aug 20, 2019
mos3abof
reviewed
Aug 20, 2019
fc51965 to
94f09db
Compare
dyson
commented
Aug 20, 2019
| intervalTick := time.Tick(interval + jitter) | ||
|
|
||
| Rewind: | ||
| for true { |
Author
There was a problem hiding this comment.
Using for true instead of just for seems to be the convention in the codebase.
dyson
commented
Aug 20, 2019
| db.Spec.RequestTimeout = *clusterSpec.RequestTimeout | ||
| db.Spec.MaxStandbys = *clusterSpec.MaxStandbys | ||
| db.Spec.UsePgrewind = *clusterSpec.UsePgrewind | ||
| db.Spec.PgrewindInterval = *clusterSpec.PgrewindInterval |
Author
There was a problem hiding this comment.
The case of these new config options is a little sad but just following the convention used elsewhere.
94f09db to
bb68edb
Compare
GC has implemented cascading replication which requires the retrying of pg_rewind to avoid expensive pg_basebackups. With cascading replication, the old sync is rebooted to repoint the recovery at the new primary. The old master will try resyncing against the new sync, while it's rebooting, which causes resync to fail. This falls back to pg_basebackup meaning the old master doesn't recover for several hours.
bb68edb to
2ecac75
Compare
Make the forcing of a checkpoint on the primary before performing a pg_rewind a cluster spec option so that it can be turned on or off.
Add pgrewindInternal, prgrewindTimeout, and pgrewindCheckpoint to the cluster spec documentation.
2ecac75 to
821e93c
Compare
| | usePgrewind | try to use pg_rewind for faster instance resyncronization. | no | bool | false | | ||
| | pgrewindInterval | interval to wait until next pg_rewind try | no | string (duration) | 0s | | ||
| | pgrewindTimeout | time after which we stop trying to pg_rewind | no | string (duration) | 0s | | ||
| | pgrewindCheckpoint | force checkpoint on primary before performing pg_rewind | no | bool | false | |
There was a problem hiding this comment.
This is named differently to the PR to add this configuration parameter: sorintlab#644
I think maybe we should leave the removal of this hardcoding out of this PR?
lawrencejones
added a commit
that referenced
this pull request
Aug 27, 2019
1. #14 Retry pg_rewind for up-to 5m after the first attempt. This allows us to handle the followed database booting for a moment before falling back to the more expensive pg_basebackup. This comes as a crappier alternative to [1], where we were creating proper configuration values. We want to try this before we attempt to create a more mature setup, as we intend to leave that work for whenever we upstream our cascading replication.
lawrencejones
added a commit
that referenced
this pull request
Aug 27, 2019
1. #14 Retry pg_rewind for up-to 5m after the first attempt. This allows us to handle the followed database booting for a moment before falling back to the more expensive pg_basebackup. This comes as a crappier alternative to [1], where we were creating proper configuration values. We want to try this before we attempt to create a more mature setup, as we intend to leave that work for whenever we upstream our cascading replication.
|
We've decided to hardcode the retry for the moment in this change: #16 We intend to come back to this PR when we upstream the cascading replication changes, which will force some type of solution for the rewind handling failure. |
VSpike
pushed a commit
that referenced
this pull request
Mar 21, 2022
1. #14 Retry pg_rewind for up-to 5m after the first attempt. This allows us to handle the followed database booting for a moment before falling back to the more expensive pg_basebackup. This comes as a crappier alternative to [1], where we were creating proper configuration values. We want to try this before we attempt to create a more mature setup, as we intend to leave that work for whenever we upstream our cascading replication.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
We want to implement cascading replication which requires the retrying of pg_rewind to avoid expensive pg_basebackups. This is important for our backup project and will also be useful for other use cases in the future.
With cascading replication, the old sync is rebooted to repoint the recovery at the new primary. The old master will try resyncing against the new sync, while it's rebooting, which causes resync to fail. This falls back to pg_basebackup meaning the old master doesn't recover for several hours.
When implementing the pg_rewind retry functionality there were a number of considerations and design decisions:
We think exponential backoff is not an ideal strategy for allocating the retries because we have a reasonable expectation that the first tries may fail. If we were to use exponential backoff we would be trying many times at the start (when we may be failing) and then fewer times towards the end of the timeout period when we expect to succeed. Instead we set a pg_rewind interval and timeout so that we can distribute the retries evenly across the timeout period. We also introduce jitter in case many clients are trying at the same time. This could be revisited at a later date if our use case changes and we have many nodes trying to pg_rewind from a single node causing performance issues. It's still hoped in this situation that appropriate configuration of the interval and timeout (and perhaps modifying the jitter) would be sufficient.
During the retry period (especially if it is long) the host we need to pg_rewind from could change from underneath us and all subsequent pg_rewind retries would never succeed. We read the host to sync from before each retry to prevent this.
During the retry period an operator (or some other process) may cause the keeper to restart. We could keep track of where in the timeout period the retries are but this would add complexity without any compelling enough reason that we could think of. If the keeper is restart, the retry logic starts again from scratch.
There is the possibility that a keeper may not be able to rewind for a known reason. In this case, an operator might want to force a pg_basebackup. With the current implementation this would involve patching the cluster spec so that the pg_rewind timeout or interval is set to 0s. The keeper would then need to be restarted. Once the pg_basebackup has started, the cluster spec can be patched back. This has a disadvantage that if any other keepers restart during this period, they are likely to also do a pg_basebackup. For the moment it's thought that this wont come up often enough to cause too much risk, but it is something to think about.
Jira issue: https://gocardless.atlassian.net/browse/CI-54