Missing tests - recovery tests #1766

steve-chavez · 2021-03-03T22:04:19Z

Currently recovery tests are done manually, it'd be great to have them as automated tests.

These are the main scenarios:

(the connection recovery worker is referred as just "worker")

1. postgrest started with a pg connection, then pg becomes unavailable

worker starts only after a request to postgrest, postgrest should respond with a {"details":"no connection to the server\n","message":"Database client error. Retrying the connection."}
- if db-channel-enabled=true, worker starts immediately, not necessary to prove this in tests though.
pg becomes available, postgrest succeeds reconnecting, reloads the schema cache and responds with 200
if db-load-guc-config=true, it should also re-read the in-db config.
- test with an ALTER ROLE postgrest_test_authenticator SET pgrst.db_schemas = 'public'; and try a GET /public_consumers which should give a 404 if the in-db config isn't re-read.

2. unavailable pg, postgrest started

worker starts immediately, postgrest should respond with a 503 {"message":"Database connection lost. Retrying the connection."}
- Bug: if db-channel-enabled=true, postgrest doesn't reply and curl gives Connection refused. This must be because of the mvarConnectionStatus MVar, it doesn't happen on 1 though.
pg becomes available, postgrest succeeds reconnecting, reloads the schema cache and responds with 200
if db-load-guc-config=true, it should also re-read the in-db config.
- Same test as 1

3. SIGUSR1 - NOTIFY reload schema

when these are done, no running requests using pg connections must be interrupted
when postgrest has a pg connection, both SIGUSR1 and NOTIFY will reload the schema cache
- if db-load-guc-config=true, it should also re-read the in-db config.
- ensure SIGUSR1 starts the worker when db-channel-enabled=true(got it to lock before, and worker was not starting, so this must be ensured)
when postgrest loses the connection, and db-channel-enabled=false(only SIGUSR1)
- SIGUSR1 starts the worker, only one can run at a time. Ensured by refIsWorkerOn, this can be confirmed by doing several SIGUSR1 and just noting one Attempting to reconnect to the database in 1 seconds... message. If refIsWorkerOn is removed, there will be several Attempting to reconnect to the database in 1 seconds... mesagges.
  - Not sure how to test this, maybe count the number of threads?
- pg becomes available, postgrest succeeds reconnecting, reloads the schema cache and responds with 200
when postgrest loses the connection, and db-channel-enabled=true
- ensure the listener recovers, e.g. doing a NOTIFY 'reload cache/load config' should work after recovery.

The text was updated successfully, but these errors were encountered:

steve-chavez · 2021-12-22T02:44:18Z

For the admin port health check(#2092), we'd also need to test:

Starting with a down connection, /health must reply with 503
Starting with a healthy connection, /health must reply with 200.. then a disconnection happens.. /health must reply with 503.. connection up again.. /health must reply with 200 again.

With db-channel-enabled as True and False for both cases.

wolfgangwalther · 2022-01-03T22:09:08Z

Dropping an idea for my own future reference here:

To allow breaking / unbreaking the pgrst <-> pg connection, we can create an individual symlink to the pg socket for each test-case - and then rename that accordingly. If renamed to something else this will break the connection. This will allow us to keep the PG server up - should be a lot faster than starting and stopping all the time. And it will not prevent us from running the io tests in parallel down the road.

steve-chavez · 2022-01-07T03:19:03Z

At startup, when the schema cache load fails(could be because a statement_timeout=1 or REVOKEd privileges on pg_catalog) we try to reload it on an endless loop.

This is done here:

postgrest/src/PostgREST/Workers.hs

Lines 93 to 94 in e13d912

 SCOnRetry -> 

 work

Is not simple to remove the autoreload of the schema cache because the connection can be lost in the connection worker (which can be because of DNS error, too many clients error from pg, etc).

If we remove the autoreload we might have a regression like #1685.

Somehow joining the schema cache loading process on the connection worker and retrying it with exponential backoff(instead of endless loop) could be a solution.

steve-chavez · 2022-02-09T19:15:36Z

Optional: A recovery test for EMFILE could be added as well: #2158 (comment)

steve-chavez · 2022-07-11T02:00:44Z

Metrics(#2129) is required to test #1094 and #2364 (pool protection).

Pending refactor

The recovery logic can now be totally inside AppState, this would make it more understadable/manageable.

So right now we pass connectionWorker to the main app:

postgrest/src/PostgREST/App.hs

Lines 102 to 103 in 54b9a0b

 postgrest :: AppConfig -> AppState.AppState -> IO () -> Wai.Application 

 postgrest conf appState connWorker =

And then we activate connectionWorker based on a 503 response(Response.isServiceUnavailable checks this):

postgrest/src/PostgREST/App.hs

Lines 126 to 130 in 54b9a0b

 when (Response.isServiceUnavailable response) connWorker 

 resp <- do 

 delay <- AppState.getRetryNextIn appState 

 return $ Response.addRetryHint delay response 

 respond resp

Instead, this logic could be inside usePool(without the 503 indirection):

postgrest/src/PostgREST/AppState.hs

Lines 142 to 148 in 07cb0b5

 usePool :: AppState -> SQL.Session a -> IO (Either SQL.UsageError a) 

 usePool AppState{..} x = do 

 res <- SQL.use statePool x 

 whenLeft res (\case 

 SQL.AcquisitionTimeoutUsageError -> debounceLogAcquisitionTimeout -- this can happen rapidly for many requests, so we debounce 

 _ -> pure ()) 

 return res

We'd just have to catch SQL exceptions that we map to 503 there. Like:

postgrest/src/PostgREST/Error.hs

Line 408 in 54b9a0b

pgErrorStatus _ (SQL.ConnectionUsageError _) = HTTP.status503

I hesitate to refactor this right now that we don't have tests.

steve-chavez · 2023-06-07T20:20:55Z

Looking at the libpq haskell lib, it has a reset function that says:

This might be useful for error recovery if a working connection is lost.

That makes me wonder if we could have the connection recovery off-core. Say in a hasql-pool-retry library.

The ideal interface for us would expose a useRetry wrapper like:

useRetry :: Pool -> IO () -> IO () -> Session a -> IO (Either WrappedUsageError a)
useRetry pool retryAction successAction sess = Hasql.use pool -- ...

On retryAction we could log the failed retries like we do now. hasql-pool-retry would internally do a reset here(or even a simple SELECT 1 could be good for starters).

On successAction we could reload our schema cache + config like we do now. This means that hasql-pool-retry acquired the connection.

WrappedUsageError is a wrapper over UsageError, which would contain an additional InRecoveryError x where x is time until next retry. With this we could inform clients that we're retrying the connection like we do now.

This seems generally useful outside of PostgREST to me. @robx WDYT?

The checkIsFatal we do could also be represented by a FailedRecoveryError, we could use this for killing the main thread as we do now.

Having a wait time(which we could equate to db-pool-acquisition-timeout) for getting a connection after loss would be great too. This way our requests would be resilient to a pg_terminate_backend (which is needed for read replicas #2781).

This would be very interesting bc it's somewhat similar to pgbouncer pause/resume. Later on it could be used as a way to scale to multiple databases(#2798).

robx · 2023-06-20T09:23:38Z

Sorry, have been offline for a bit and missed this. Catching up these days

Regarding the concrete question about reset (PQreset in libpq):

I don't think that's going to be particularly useful for us. All it does is close the underlying connection and open a new one. This would only save allocating a new connection object (and the pool management overhead); but that should be insignificant compared to actually establishing a new connection to the postgres server.

I think I like the idea of a generic useRetry, though, although at the time of writing I'm a bit fuzzy on how the retrying would work. Are there errors where it makes sense for PostgREST to retry internally (with some backoff strategy), and others where we defer this to the client? I'm not really clear on what useRetry would do, particularly when successAction is run. I'm imagining something roughly like:

useRetry retryState retryAction session = do
  res <- use pool session
  case checkRetriable retryState res of
    Retry retryState' -> do
      retryAction retryState'
      backoffSleep retryState'
      useRetry retryState' retryAction session
    RanOutOfRetries -> do
      return $ RetryFailed res -- maybe augmented with some retry details
    NonRetriableError -> do
      return $ Error res
    Success -> do
      successAction
      return $ Success res

But then why not leave running successAction to the caller?

steve-chavez · 2023-06-20T17:24:18Z

Are there errors where it makes sense for PostgREST to retry internally (with some backoff strategy), and others where we defer this to the client?

Yes, for example when the password changes upstream (retrying is no use) - then the user would have to edit the database connection string anyway. We also have some extra conditions on checkIsFatal for stopping retrying. Hm, maybe useRetry could also accept a list of ResultError for knowing when to stop?

But then why not leave running successAction to the caller?

Hm, yeah. I think that could work too. The interface was just an idea.

In case it helps, I've documented the recovery process here.

steve-chavez · 2023-06-25T20:02:48Z

@robx Thinking more about it, we can have a much simpler interface. Just:

useRetry :: Pool -> Session a -> IO (Either WrappedUsageError a)
useRetry pool sess = Hasql.use pool -- ...

On retryAction we could log the failed retries like we do now.

No retryAction. We don't really need to do this, it would be enough to log to stderr like we do for the acquisiton timeout:

postgrest/src/PostgREST/AppState.hs

Lines 142 to 148 in 078c6ec

 usePool :: AppState -> SQL.Session a -> IO (Either SQL.UsageError a) 

 usePool AppState{..} x = do 

 res <- SQL.use statePool x 

 whenLeft res (\case 

 SQL.AcquisitionTimeoutUsageError -> debounceLogAcquisitionTimeout -- this can happen rapidly for many requests, so we debounce 

 _ -> pure ()) 

 return res

The resulting WrappedUsageError should be enough for this.

On successAction we could reload our schema cache + config like we do now. This means that hasql-pool-retry acquired the connection.

No successAction. We can change the logic of the current "connection worker" to do the schema cache reloading by using useRetry.

Hm, maybe useRetry could also accept a list of ResultError for knowing when to stop?

That would also be unnecessary since we can also act on the WrappedUsageError when it happens.

So really the main goal is to have useRetry wait like we do for the acquisition timeout now. With the difference that it would wait for the db to be reachable. It might even fit in hasql-pool itself (but no problem if we do it on hasql-pool-retry).

steve-chavez · 2023-06-25T20:23:06Z

Also, I was thinking we should have this timeout be equal to the acquisition timeout but maybe it can be another configurable timeout. I see HikariCP having a initializationFailTimeout, which is similar to what we want to do here.

steve-chavez · 2023-06-25T20:29:23Z

The simplest initial test case I think would be having a useRetry be resilient to a pg_terminate_backend (#2781 (comment)). After initializing the pool, use just fails. useRetry would wait and succeed.

Then we would cover other cases like a socket error as Wolfgang mentioned above.

steve-chavez · 2023-06-25T20:51:29Z

useRetry :: Pool -> Session a -> IO (Either WrappedUsageError a)
useRetry pool sess = Hasql.use pool -- ...
No successAction. We can change the logic of the current "connection worker" to do the schema cache reloading by using useRetry.

Hm, forgot about one thing. So say we lose the connection and at this time the user runs migrations on the db, event trigger notifications won't fire for us. useRetry then recovers the connection and we can serve requests again. However our schema cache is stale. This is why it's important to know when the pool reestablished a connection.

So maybe:

useRetry :: Pool -> Session a -> IO (Either WrappedUsageError (a, Bool))

The Bool would indicate that the connection was recovered, with that we can reload the schema cache. Maybe that's preferrable to a successAction.

steve-chavez · 2024-05-07T14:41:16Z

To allow breaking / unbreaking the pgrst <-> pg connection, we can create an individual symlink to the pg socket for each test-case - and then rename that accordingly.

#1766 (comment)

Related to the above, I just tried moving the socket file:

/run/user/1000/postgrest/postgrest-with-postgresql-16-FRk/socket$ mv .s.PGSQL.5432 ..
/run/user/1000/postgrest/postgrest-with-postgresql-16-FRk/socket$ mv ../.s.PGSQL.5432 .

And it does not break the connection if it's already established, doing curl localhost:3000/items keeps working. But if the pool max idletime/lifetime is reached, then the new pool connection creation will fail.

The listener doesn't fail too.

So it looks like if we want to add io tests for this we also need to wait for the pool lifetime (looks prone to CI errors though) or else find another way to immediately break connections.

steve-chavez added hygiene cleanup or refactoring ci Related to CI setup labels Mar 3, 2021

steve-chavez mentioned this issue Mar 3, 2021

db settings: reload on connection recovery, only get them from global/current-database scope and change dash to underscore #1760

Merged

4 tasks

wolfgangwalther mentioned this issue Mar 7, 2021

Increase test coverage #1719

Open

46 tasks

steve-chavez mentioned this issue Aug 17, 2021

Add Retry-After hint #1916

Merged

steve-chavez mentioned this issue Dec 22, 2021

feat: minimal health check #2092

Merged

4 tasks

steve-chavez mentioned this issue Jan 7, 2022

Consider schema cache state on the health check #2107

Merged

steve-chavez mentioned this issue Feb 9, 2022

fix: Keep working when EMFILE is reached #2158

Merged

1 task

steve-chavez changed the title ~~Add recovery tests~~ Add recovery tests and pool protection tests Jul 11, 2022

steve-chavez pinned this issue Jul 11, 2022

This comment was marked as outdated.

Sign in to view

steve-chavez changed the title ~~Add recovery tests and pool protection tests~~ Add recovery tests Aug 4, 2022

steve-chavez mentioned this issue Aug 4, 2022

Upgrade hasql-pool #2391

Closed

steve-chavez changed the title ~~Add recovery tests~~ Missing tests - recovery tests - postgres 15 Aug 16, 2022

This comment was marked as outdated.

Sign in to view

steve-chavez mentioned this issue Aug 16, 2022

Fix bug "UNION types \"char\" and text cannot be matched" with v9.0.1.+++ versions #2413

Merged

steve-chavez changed the title ~~Missing tests - recovery tests - postgres 15~~ Missing tests - recovery tests Feb 1, 2023

This was referenced Jun 2, 2023

Fix dropping schema cache reload notifications #2810

Merged

Improved connection recovery on read replicas #2813

Merged

This was referenced Jun 7, 2023

Add snippet for reloading on read replicas PostgREST/postgrest-docs#638

Open

LISTEN/NOTIFY doesn't work on read replicas #2781

Open

steve-chavez unpinned this issue Jun 8, 2023

steve-chavez pinned this issue Jun 8, 2023

This was referenced Feb 20, 2024

refactor: add observation module #3232

Merged

refactor: make observation messages pure #3250

Merged

steve-chavez mentioned this issue Mar 12, 2024

fix: don't hide error on LISTEN channel failure #3323

Merged

steve-chavez mentioned this issue Apr 16, 2024

Event triggers don't work on read replicas #3414

Open

steve-chavez mentioned this issue May 8, 2024

IO tests depend too heavily on timing / sleeping #3424

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing tests - recovery tests #1766

Missing tests - recovery tests #1766

steve-chavez commented Mar 3, 2021 •

edited

steve-chavez commented Dec 22, 2021

wolfgangwalther commented Jan 3, 2022

steve-chavez commented Jan 7, 2022

steve-chavez commented Feb 9, 2022

steve-chavez commented Jul 11, 2022

This comment was marked as outdated.

This comment was marked as outdated.

steve-chavez commented Jun 6, 2023

steve-chavez commented Jun 6, 2023

steve-chavez commented Jun 7, 2023 •

edited

robx commented Jun 20, 2023

steve-chavez commented Jun 20, 2023

steve-chavez commented Jun 25, 2023 •

edited

steve-chavez commented Jun 25, 2023 •

edited

steve-chavez commented Jun 25, 2023

steve-chavez commented Jun 25, 2023

steve-chavez commented May 7, 2024 •

edited

Missing tests - recovery tests #1766

Missing tests - recovery tests #1766

Comments

steve-chavez commented Mar 3, 2021 • edited

1. postgrest started with a pg connection, then pg becomes unavailable

2. unavailable pg, postgrest started

3. SIGUSR1 - NOTIFY reload schema

steve-chavez commented Dec 22, 2021

wolfgangwalther commented Jan 3, 2022

steve-chavez commented Jan 7, 2022

steve-chavez commented Feb 9, 2022

steve-chavez commented Jul 11, 2022

This comment was marked as outdated.

This comment was marked as outdated.

steve-chavez commented Jun 6, 2023

steve-chavez commented Jun 6, 2023

Pending refactor

steve-chavez commented Jun 7, 2023 • edited

robx commented Jun 20, 2023

steve-chavez commented Jun 20, 2023

steve-chavez commented Jun 25, 2023 • edited

steve-chavez commented Jun 25, 2023 • edited

steve-chavez commented Jun 25, 2023

steve-chavez commented Jun 25, 2023

steve-chavez commented May 7, 2024 • edited

steve-chavez commented Mar 3, 2021 •

edited

steve-chavez commented Jun 7, 2023 •

edited

steve-chavez commented Jun 25, 2023 •

edited

steve-chavez commented Jun 25, 2023 •

edited

steve-chavez commented May 7, 2024 •

edited