IO tests depend too heavily on timing / sleeping #3424

wolfgangwalther · 2024-04-20T14:09:19Z

I just ran the IO tests on a heavily overloaded system.. and had 32 tests failing, because of some timings related issues. Those pass just fine without load. This is annoying, because I can't continue working on PostgREST while I have other heavy tasks (rebuilding the world for nixpkgs...) running. This is also potentially an issue on much slower systems.

steve-chavez · 2024-04-22T18:25:26Z

Maybe we can separate those tests (test_io_timing.py), so they're not always ran. Like what we do with postgrest-test-memory.

wolfgangwalther · 2024-04-22T19:16:43Z

This seems like a related issue in CI: https://github.com/PostgREST/postgrest/actions/runs/8788477073/job/24116006025?pr=3427

We sure want to run them in CI, so separating them wouldn't help, I guess.

wolfgangwalther · 2024-04-26T06:33:03Z

Another random CI failure: https://github.com/PostgREST/postgrest/actions/runs/8842033759/job/24280038657?pr=3453

wolfgangwalther · 2024-04-26T06:43:01Z

And one more: https://github.com/PostgREST/postgrest/actions/runs/8844111670/job/24285424746

wolfgangwalther · 2024-04-26T07:30:23Z

https://github.com/PostgREST/postgrest/actions/runs/8844127847/job/24285463347

steve-chavez · 2024-04-26T15:16:03Z

Hm, perhaps some of the pool timeout errors could be solved by using the new /metrics endpoint instead of checking the logs.

Specifically https://postgrest.org/en/latest/references/observability.html#pgrst-db-pool-timeouts-total

steve-chavez · 2024-04-26T15:19:37Z

Additionally I think the acq timeout can be less than a second https://hackage.haskell.org/package/hasql-pool-1.1/docs/Hasql-Pool-Config.html#v:acquisitionTimeout.. but not sure if our config allows it now

steve-chavez · 2024-04-29T18:01:09Z

I've also noted sometimes the io tests get stuck: https://github.com/PostgREST/postgrest/actions/runs/8857677646/job/24325423347?pr=3383

I had that happen just one time locally after #3229. That should be solved by using metrics, I think.

steve-chavez · 2024-04-30T18:20:08Z

io test got stuck again https://github.com/PostgREST/postgrest/actions/runs/8857677646/job/24325423347?pr=3383

We do some waiting when reading the logs

postgrest/test/io/postgrest.py

Lines 60 to 70 in 7e5fd31

 def read_stdout(self, nlines=1): 

 "Wait for line(s) on standard output." 

 output = [] 

 for _ in range(10): 

 l = self.process.stdout.readline() 

 if l: 

 output.append(l.decode()) 

 if len(output) >= nlines: 

 break 

 time.sleep(0.1) 

 return output

But it's capped at 1 second.. it's weird why it gets stuck

Edit 1: https://stackoverflow.com/questions/24640079/process-stdout-readline-hangs-how-to-use-it-properly

Edit 2: Seems flushing worked #3465

This looks to be the test that fails

postgrest/test/io/test_io.py

Lines 1361 to 1389 in 7e5fd31

 @pytest.mark.parametrize("level", ["crit", "error", "warn", "info", "debug"]) 

 def test_db_error_logging_to_stderr(level, defaultenv, metapostgrest): 

 "verify that DB errors are logged to stderr" 

 role = "timeout_authenticator" 

 set_statement_timeout(metapostgrest, role, 500) 

 env = { 

 **defaultenv, 

 "PGUSER": role, 

 "PGRST_DB_ANON_ROLE": role, 

 "PGRST_LOG_LEVEL": level, 

 } 

 with run(env=env) as postgrest: 

 response = postgrest.session.get("/rpc/sleep?seconds=1") 

 assert response.status_code == 500 

 # ensure the message appears on the logs 

 output = sorted(postgrest.read_stdout(nlines=4)) 

 if level == "crit": 

 assert len(output) == 0 

 elif level in ["info", "debug"]: 

 assert " 500 " in output[0] 

 assert "canceling statement due to statement timeout" in output[3] 

 else: 

 assert " 500 " in output[0] 

 assert "canceling statement due to statement timeout" in output[1]

wolfgangwalther · 2024-05-04T11:34:01Z

One more: https://github.com/PostgREST/postgrest/actions/runs/8949824082/job/24584542317

And one more: https://github.com/PostgREST/postgrest/actions/runs/8952889247/job/24590783068

wolfgangwalther · 2024-05-05T13:18:28Z

I looked into this a bit - it seems like the main challenge for IO tests is that it's impossible to determine whether PostgREST is currently in the process of reloading the config / schema cache - or has already finished doing so and failed. Or in other words: We only return either 200 or 503 from /ready, but nothing else.

I think it would be helpful if we'd return more fine-grained status codes:

200 when ready
503 Service Unavailable while PostgREST is in the process of doing something: During connection to the database, during config reload, during schema cache reload.
500 Internal Server Error as soon as one of the above fails.

This should allow us to replace the following with calls to some variation of wait_until_ready:

def sleep_until_postgrest_scache_reload():
    "Sleep until schema cache reload"
    time.sleep(0.3)


def sleep_until_postgrest_config_reload():
    "Sleep until config reload"
    time.sleep(0.2)


def sleep_until_postgrest_full_reload():
    "Sleep until schema cache plus config reload"
    time.sleep(0.3)

It would also better for observability, I guess?

steve-chavez · 2024-05-05T15:26:31Z

@wolfgangwalther Great idea! I agree.

wolfgangwalther · 2024-05-07T12:58:25Z

This is happening on almost every commit I merge to main now - they just get stuck entirely somewhere in the IO tests. Would be great if @steve-chavez you could look at implementing my proposal above. I am not really familiar with those parts of the code.

steve-chavez · 2024-05-07T14:03:53Z

@wolfgangwalther Looking into this now. Admin API has another bug too (connection drop not detected), so will fix that first.

steve-chavez · 2024-05-07T18:18:06Z

200 when ready

503 Service Unavailable while PostgREST is in the process of doing something: During connection to the database, during config reload, during schema cache reload.

500 Internal Server Error as soon as one of the above fails.

Looking more closely at the above proposal, 500 will only happen when the API socket is unreachable. For all other cases we always retry and the status will be 503. This also implies that the /live endpoint will now have to reply with a 500 status to be consistent. Thus, this will be a breaking change.

This changed turned to be a bit more complex than what I thought.

steve-chavez · 2024-05-07T18:50:13Z

during config reload

Not sure if we should conflate 503 for config reloading. During config reloading we're operational and serving requests normally. Unlike schema cache reloading (makes requests wait) and connection reloading (makes requests fail).

Perhaps we can use another 2xx for config reloading?

steve-chavez · 2024-05-07T19:12:49Z

Thus, this will be a breaking change.

To avoid the breaking change and also the problem with responding with 503 while we're operational. How about adding a custom health check endpoint? There we could have the more fine-grained behavior as is.

I'm thinking about a /loaded endpoint. Shouldn't this be better since this endpoint will be only for development/testing purposes too?

@wolfgangwalther WDYT?

wolfgangwalther · 2024-05-08T07:59:22Z

For all other cases we always retry and the status will be 503.

Do we not have any kind of back-off there? Are we really doing and endless loop of immediately retrying again?

This also implies that the /live endpoint will now have to reply with a 500 status to be consistent. Thus, this will be a breaking change.

503 is considered a "temporary" problem, so I would argue that this is actually a bugfix for /live. Also, I did we really promise "/live fails with 503 exactly" or is the promise our api gives rather like "/live gives you 4xx or 5xx if something is wrong"? I doubt that anyone is using the live endpoint with an exact match on the error code. So I don't think we need to treat this as a breaking change.

Not sure if we should conflate 503 for config reloading. During config reloading we're operational and serving requests normally. Unlike schema cache reloading (makes requests wait) and connection reloading (makes requests fail).

Right, we should not try to put other stuff into the ready endpoint that is clearly not related.

To avoid the breaking change and also the problem with responding with 503 while we're operational. How about adding a custom health check endpoint? There we could have the more fine-grained behavior as is.

I'm thinking about a /loaded endpoint. Shouldn't this be better since this endpoint will be only for development/testing purposes too?

At this stage it might be easier to just add better log parsing to the IO tests, because all this information (config reload started, config reload finished) etc. are available through the log, I think. This should still allow us to remove the quite random "wait for X seconds" cases I mentioned above.

TLDR: I think we don't need to change /ready and/or /live for the purpose of IO tests. If you still want to do the improvement regarding "schema cache reload in progress" - cool.

steve-chavez · 2024-05-08T14:53:47Z

Do we not have any kind of back-off there? Are we really doing and endless loop of immediately retrying again?

Yes, we do have a backoff.

503 is considered a "temporary" problem, so I would argue that this is actually a bugfix for /live.
...So I don't think we need to treat this as a breaking change.

Agree.

TLDR: I think we don't need to change /ready and/or /live for the purpose of IO tests. If you still want to do the improvement regarding "schema cache reload in progress" - cool.

Cool. Will finish that on #3490.

At this stage it might be easier to just add better log parsing to the IO tests, because all this information (config reload started, config reload finished) etc. are available through the log, I think. This should still allow us to remove the quite random "wait for X seconds" cases I mentioned above.

Parsing the log is still prone to failure though, implementing a /loaded would be easier. I also wonder if there's another way to signal we're done reloading state...

steve-chavez · 2024-05-08T14:57:15Z

So check this out, kubernetes has individual health checks.

Since the admin server already has /schema_cache and /config. How about adding a /schema_cache/ready and /config/ready?

wolfgangwalther · 2024-05-08T15:01:47Z

So check this out, kubernetes has individual health checks.

Since the admin server already has /schema_cache and /config. How about adding a /schema_cache/ready and /config/ready?

Hm, but that's only because k8s has different components which work independently of each other. That's not the case for us. It's not that kubernetes could consume separate ready endpoints in any way. You'd still need only one ready endpoint to actually make use of it.

wolfgangwalther · 2024-05-08T15:02:04Z

Parsing the log is still prone to failure though

What kinds of failures do you have in mind?

steve-chavez · 2024-05-08T15:08:28Z

These individual health checks should not be consumed by machines but can be helpful for a human operator to debug a system

https://kubernetes.io/docs/reference/using-api/health-checks/#individual-health-checks

I don't see why we cannot adopt that and have /ready/schema_cache and /ready/config, it looks like a great idea/design. Our purpose aligns well with what's described above.

What kinds of failures do you have in mind?

The error message can change, reading the log can hang #3465 (and this PR didn't solve it, it still happens https://github.com/PostgREST/postgrest/actions/runs/8972240697/job/24639793119).

wolfgangwalther · 2024-05-08T19:14:19Z

Do we not have any kind of back-off there? Are we really doing and endless loop of immediately retrying again?

Yes, we do have a backoff.

Are you sure the backoff works for example when the schema cache query fails due to timeout like in the test_admin_ready_includes_schema_cache_state test?

I can reproduce the hanging IO tests with this test-case, like this:

increase sleep_until_postgrest_scache_reload() to something like 2 seconds
run postgrest-test-io -vv -k test_admin_ready_includes_schema_cache_state

This immediately hangs forever.

When I then add a timeout=1 to the postgrest.admin.get('/ready') I get a failing test and some log output. This shortened (I took out everything else except those lines) output looks like this:

08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...
08/May/2024:21:06:46 +0200: Attempting to connect to the database...

Why are the timestamps not increasing here? This doesn't look like a backoff.

This does not explain the hanging test, yet, just an observation so far.

wolfgangwalther · 2024-05-08T20:29:49Z

Are you sure the backoff works for example when the schema cache query fails due to timeout like in the test_admin_ready_includes_schema_cache_state test?

This indicates that there is no backoff involved and instead the reload of the schema cache is tried 10x / s:

postgrest/src/PostgREST/AppState.hs

Lines 169 to 175 in f9e9740

 debWorker <- 

 let decisecond = 100000 in 

 mkDebounce defaultDebounceSettings 

 { debounceAction = internalConnectionWorker appState 

 , debounceFreq = decisecond 

 , debounceEdge = leadingEdge -- runs the worker at the start and the end 

 }

This does not match my observation, however, because I had 33 attempts to connect with the exact same timestamp (second precision). So either the display of the timestamp is broken, or this is repeated much more than just 10x / s.

I assume the 10x / s debounce only affects the "happy path" inside internalConnectionWorker. But once any path with killThread <mainthread> is involved.. the main thread is restarted immediately without any delay or backoff?

steve-chavez · 2024-05-08T22:37:22Z

TLDR: I think we don't need to change /ready and/or /live for the purpose of IO tests. If you still want to do the improvement regarding "schema cache reload in progress" - cool.

Merged that on #3490.

I tried replacing the def sleep_until_postgrest_scache_reload() but it's used in some places where actually it's waiting for the connection recovery to be done, not the schema cache. I'm thinking we not only need a /ready/config but also a /ready/connection, to make testing independent of the timings.

Overall this part of the code is too hard to refactor/modify because we don't have tests #1766 (my latest attempt at making progress was on #1766 (comment)). It's effectively at the "make it work" stage for many years now.

steve-chavez · 2024-05-08T22:41:54Z

This indicates that there is no backoff involved and instead the reload of the schema cache is tried 10x / s:

Right, the backoff is only for the connection recovery (establishConnection), if that succeeds then work is repeated and it goes straight ahead to loadSchemaCache:

postgrest/src/PostgREST/AppState.hs

Lines 365 to 398 in 1b584f7

 internalConnectionWorker :: AppState -> IO () 

 internalConnectionWorker appState@AppState{stateObserver=observer} = work 

 where 

 work = do 

 AppConfig{..} <- getConfig appState 

 observer DBConnectAttemptObs 

 connStatus <- establishConnection appState 

 case connStatus of 

 ConnFatalFail reason -> 

 -- Fatal error when connecting 

 observer (ExitFatalObs reason) >> killThread (getMainThreadId appState) 

 ConnPending -> 

 unless configDbPoolAutomaticRecovery 

 $ observer ExitDBNoRecoveryObs >> killThread (getMainThreadId appState) 

 ConnEstablished -> do 

 -- Procede with initialization 

 when configDbChannelEnabled $ 

 signalListener appState 

 actualPgVersion <- getPgVersion appState 

 observer (DBConnectedObs $ pgvFullName actualPgVersion) 

 -- this could be fail because the connection drops, but the loadSchemaCache will pick the error and retry again 

 -- We cannot retry after it fails immediately, because db-pre-config could have user errors. We just log the error and continue. 

 when configDbConfig $ reReadConfig False appState 

 scStatus <- loadSchemaCache appState 

 case scStatus of 

 SCLoaded -> 

 -- do nothing and proceed if the load was successful 

 return () 

 SCPending -> 

 -- retry reloading the schema cache 

 work 

 SCFatalFail -> 

 -- die if our schema cache query has an error 

 killThread $ getMainThreadId appState

I remember I kept it that way because the schema cache concept is tied to connections too (they have their own cache #2620). But it could be improved.

If we really want to improve this part we need the testing utilities (#1766), otherwise any attempt at refactoring will cause regressions.

steve-chavez · 2024-05-10T16:34:35Z

> 08/May/2024:21:06:46 +0200: Attempting to connect to the database...
> 08/May/2024:21:06:46 +0200: Attempting to connect to the database...
> 08/May/2024:21:06:46 +0200: Attempting to connect to the database...
> ...

I can reproduce the above by just doing alter role postgrest_test_authenticator set statement_timeout to '10' . To add a way to test the failure, I think we could add a metric pgrst_schema_cache_failed_loads_total and check if it increases rapidly.

Additionally, since the connection recovery process is so tied to the schema cache, I think I can use the same backoff for both queries (pg version for the connection, schema cache idem). Otherwise if we have a dedicated backoff for the schema cache query we wouldn't know if it's because the connection is down or if there was a statement timeout.

wolfgangwalther · 2024-05-11T18:58:09Z

There is another failure of the IO tests after the latest commit to main (changelog): https://github.com/PostgREST/postgrest/actions/runs/9045664975/job/24855643429

This seems just like a simple timeout when starting up postgrest in that test. Quite random, though.

steve-chavez · 2024-05-11T19:06:02Z

From the log lines, that failure looks like the request was hitting the API server instead of the Admin server.

::1 - postgrest_test_anonymous [11/May/2024:18:35:17 +0000] "GET /ready HTTP/1.1" 404 - "" "python-requests/2.31.0"
11/May/2024:18:35:17 +0000: Connection ec7abfbd-8046-4c73-b9b7-33c131add6cd is used
11/May/2024:18:35:17 +0000: Connection ec7abfbd-8046-4c73-b9b7-33c131add6cd is available

Because:

The admin server doesn't log anything when its endpoints are hit.
A pool connection is used

So maybe an error on the python code? Maybe the freeport() returned the API server port somehow.

wolfgangwalther · 2024-05-11T19:22:30Z

The log says:

 11/May/2024:18:35:17 +0000: Admin server listening on port 48177
11/May/2024:18:35:17 +0000: Listening on port 48177

So surely quite random. I assume freeport() is called twice and then collides (very low probability).

More interesting is the fact that PostgREST doesn't seem to fail when both ports are set to the same value?

When I try this locally, then I always get postgrest: Network.Socket.bind: resource busy (Address already in use). Maybe it's related to ipv4 / ipv6 somehow? The access log is also giving ::1 as the remote ip.

wolfgangwalther · 2024-05-11T19:34:37Z

In fact... I can reproduce this:

PGRST_SERVER_PORT=3000 PGRST_SERVER_HOST="localhost" PGRST_ADMIN_SERVER_PORT=3000 postgrest-with-postgresql-16 postgrest-run

This only works when setting host to localhost explicitly. Then I get:

11/May/2024:21:30:27 +0200: Admin server listening on port 3000
11/May/2024:21:30:27 +0200: Listening on port 3000

And then both are listening on different ipv4/ipv6:

curl --fail http://127.0.0.1:3000/ready
curl --fail http://[::1]:3000/ready

both return different results...

But since we do http://localhost:<port> in the io tests for both, we end up on the wrong server.

wolfgangwalther added the hygiene cleanup or refactoring label Apr 20, 2024

steve-chavez mentioned this issue Apr 30, 2024

test: flush stdout so readline doesn't hang #3465

Merged

steve-chavez mentioned this issue May 7, 2024

feat: /ready pending state as 503 and 500 as fail #3490

Merged

2 tasks

wolfgangwalther mentioned this issue May 9, 2024

Make IO tests more stable #3497

Merged

This was referenced May 11, 2024

Server port and admin server port can end up with the same value #3508

Open

test: Avoid freeport() collisions in io tests #3509

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IO tests depend too heavily on timing / sleeping #3424

IO tests depend too heavily on timing / sleeping #3424

wolfgangwalther commented Apr 20, 2024

steve-chavez commented Apr 22, 2024

wolfgangwalther commented Apr 22, 2024

wolfgangwalther commented Apr 26, 2024

wolfgangwalther commented Apr 26, 2024

wolfgangwalther commented Apr 26, 2024

steve-chavez commented Apr 26, 2024 •

edited

steve-chavez commented Apr 26, 2024

steve-chavez commented Apr 29, 2024

steve-chavez commented Apr 30, 2024 •

edited

wolfgangwalther commented May 4, 2024 •

edited

wolfgangwalther commented May 5, 2024

steve-chavez commented May 5, 2024

wolfgangwalther commented May 7, 2024 •

edited

steve-chavez commented May 7, 2024

steve-chavez commented May 7, 2024 •

edited

steve-chavez commented May 7, 2024 •

edited

steve-chavez commented May 7, 2024

wolfgangwalther commented May 8, 2024

steve-chavez commented May 8, 2024

steve-chavez commented May 8, 2024

wolfgangwalther commented May 8, 2024

wolfgangwalther commented May 8, 2024

steve-chavez commented May 8, 2024 •

edited

wolfgangwalther commented May 8, 2024

wolfgangwalther commented May 8, 2024

steve-chavez commented May 8, 2024

steve-chavez commented May 8, 2024

steve-chavez commented May 10, 2024

wolfgangwalther commented May 11, 2024

steve-chavez commented May 11, 2024 •

edited

wolfgangwalther commented May 11, 2024

wolfgangwalther commented May 11, 2024

IO tests depend too heavily on timing / sleeping #3424

IO tests depend too heavily on timing / sleeping #3424

Comments

wolfgangwalther commented Apr 20, 2024

steve-chavez commented Apr 22, 2024

wolfgangwalther commented Apr 22, 2024

wolfgangwalther commented Apr 26, 2024

wolfgangwalther commented Apr 26, 2024

wolfgangwalther commented Apr 26, 2024

steve-chavez commented Apr 26, 2024 • edited

steve-chavez commented Apr 26, 2024

steve-chavez commented Apr 29, 2024

steve-chavez commented Apr 30, 2024 • edited

wolfgangwalther commented May 4, 2024 • edited

wolfgangwalther commented May 5, 2024

steve-chavez commented May 5, 2024

wolfgangwalther commented May 7, 2024 • edited

steve-chavez commented May 7, 2024

steve-chavez commented May 7, 2024 • edited

steve-chavez commented May 7, 2024 • edited

steve-chavez commented May 7, 2024

wolfgangwalther commented May 8, 2024

steve-chavez commented May 8, 2024

steve-chavez commented May 8, 2024

wolfgangwalther commented May 8, 2024

wolfgangwalther commented May 8, 2024

steve-chavez commented May 8, 2024 • edited

wolfgangwalther commented May 8, 2024

wolfgangwalther commented May 8, 2024

steve-chavez commented May 8, 2024

steve-chavez commented May 8, 2024

steve-chavez commented May 10, 2024

wolfgangwalther commented May 11, 2024

steve-chavez commented May 11, 2024 • edited

wolfgangwalther commented May 11, 2024

wolfgangwalther commented May 11, 2024

steve-chavez commented Apr 26, 2024 •

edited

steve-chavez commented Apr 30, 2024 •

edited

wolfgangwalther commented May 4, 2024 •

edited

wolfgangwalther commented May 7, 2024 •

edited

steve-chavez commented May 7, 2024 •

edited

steve-chavez commented May 7, 2024 •

edited

steve-chavez commented May 8, 2024 •

edited

steve-chavez commented May 11, 2024 •

edited