Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres running out of memory (12+ GB) when processing large room → Panic with error Closed #123

Open
JeWe37 opened this issue Jul 23, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@JeWe37
Copy link

JeWe37 commented Jul 23, 2023

Running with matrix-docker-ansible-deploy, I get the following error

Jul 23 00:00:07 jendrikserver systemd[1]: Starting Synapse State Auto Compressor...
Jul 23 00:00:07 jendrikserver matrix-synapse-auto-compressor[270871]: 9dc6498df3811da4e068dbc8454a5dd9880b41a50e8b177ec0e694b50bd81871
Jul 23 00:00:07 jendrikserver matrix-synapse-auto-compressor[270885]: [2023-07-23T00:00:07Z INFO  synapse_auto_compressor] synapse_auto_compressor started
Jul 23 00:00:08 jendrikserver matrix-synapse-auto-compressor[270885]: [2023-07-23T00:00:08Z INFO  synapse_auto_compressor::manager] Running compressor on room !YTvKGNlinIzlkMTVRl:matrix.org with chunk size 500
Jul 23 00:05:54 jendrikserver matrix-synapse-auto-compressor[270885]: [2023-07-23T00:05:54Z ERROR panic] thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: Closed, cause: None }': src/database.rs:443

This corresponds to this version of the code, I believe, which is now this and doesn't seem to have changed.

The room in question(!YTvKGNlinIzlkMTVRl:matrix.org) is the element room.

I'm guessing you will need some more information, how could I provide that?

@reivilibre
Copy link
Contributor

Do you get this error every time?

Hard to really say since the error is a bit ... nondescript ... but any reason you might expect your database connection to get closed? (Maybe the database has been restarted for some reason?)

How are you connecting to your database? TCP or a UNIX socket? Not saying it would help, but you could try using a UNIX socket connection instead of TCP if you're not already (in case it's some TCP timeout or so...)

@JeWe37
Copy link
Author

JeWe37 commented Jul 24, 2023

Error happens on every run. This seems to be where matrix-docker-ansible-deploy sets the connection: https://github.com/spantaleev/matrix-docker-ansible-deploy/blob/856a328e96c679bc7d51760d00708d921a08cb17/roles/custom/matrix-synapse-auto-compressor/defaults/main.yml#L45

I have no real idea why the database might be getting closed. If I look at the postgres logs I see this

Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:03.996 UTC [1] LOG:  background worker "parallel worker" (PID 36969) was terminated by signal 9: Killed
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:03.996 UTC [1] DETAIL:  Failed process was running:
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]:                 SELECT target.prev_state_group, source.prev_state_group, state.type, state.state_key, state.event_id
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]:                 FROM state_group_edges AS target
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]:                 LEFT JOIN state_group_edges AS source ON (target.prev_state_group = source.state_group)
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]:                 LEFT JOIN state_groups_state AS state ON (target.prev_state_group = state.state_group)
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]:                 WHERE target.prev_state_group = ANY($1)
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]:
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:03.996 UTC [1] LOG:  terminating any other active server processes
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:04.114 UTC [37002] FATAL:  the database system is in recovery mode
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:04.119 UTC [37003] FATAL:  the database system is in recovery mode
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:04.123 UTC [37004] FATAL:  the database system is in recovery mode
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:04.127 UTC [37005] FATAL:  the database system is in recovery mode
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:04.133 UTC [37006] FATAL:  the database system is in recovery mode
Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:04.137 UTC [37007] FATAL:  the database system is in recovery mode
Jul 24 00:05:05 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:05.091 UTC [1] LOG:  all server processes terminated; reinitializing
Jul 24 00:05:10 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:10.884 UTC [37008] LOG:  database system was interrupted; last known up at 2023-07-23 23:57:25 UTC

which to me looks like it's crashing postgres? Looking at the timings in the logs of the auto compressor also definitely not a timeout:

Jul 24 00:00:02 jendrikserver matrix-synapse-auto-compressor[563082]: 71f93083a03dd2c8be5a90deae601d4c5547d009268ba24f3dba937f4899a7ce
Jul 24 00:00:02 jendrikserver matrix-synapse-auto-compressor[563103]: [2023-07-24T00:00:02Z INFO  synapse_auto_compressor] synapse_auto_compressor started
Jul 24 00:00:03 jendrikserver matrix-synapse-auto-compressor[563103]: [2023-07-24T00:00:03Z INFO  synapse_auto_compressor::manager] Running compressor on room !YTvKGNlinIzlkMTVRl:matrix.org with chunk size 500
Jul 24 00:05:04 jendrikserver matrix-synapse-auto-compressor[563103]: [2023-07-24T00:05:04Z ERROR panic] thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { kind: Closed, cause: None }': src/database.rs:443

@DMRobertson
Copy link
Contributor

DMRobertson commented Jul 24, 2023

Jul 24 00:05:04 jendrikserver matrix-postgres[157976]: 2023-07-24 00:05:03.996 UTC [1] LOG:  background worker "parallel worker" (PID 36969) was terminated by signal 9: Killed

This sounds like some other mechanism killed a postgres process because it ate up too much RAM. (OOMKiller?) Check your system logs for clues.

@JeWe37
Copy link
Author

JeWe37 commented Jul 24, 2023

It was indeed the OOM killer. Processing that room seems to be eating 12GB+ of RAM. Is that expected behavior for large rooms like that?

@reivilibre reivilibre changed the title Panic with error closed Postgres running out of memory (12+ GB) when processing large room → Panic with error Closed Jul 24, 2023
@reivilibre
Copy link
Contributor

It doesn't sound like wanted behaviour at the very least!

@reivilibre reivilibre added the bug Something isn't working label Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants