Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"This slot has been invalidated because it exceeded the maximum reserved size" error #1089

Open
djbutler opened this issue Mar 23, 2024 · 11 comments
Assignees

Comments

@djbutler
Copy link

djbutler commented Mar 23, 2024

My electric service crashed, and when I try to restart it with docker compose, I get this:

supabase-electric  | 18:42:35.329 pid=<0.2487.0> [info] Starting ElectricSQL 0.9.2 in direct_writes mode.
supabase-electric  | 18:42:35.332 pid=<0.2343.0> [notice]     :alarm_handler: {:set, {:system_memory_high_watermark, []}}
supabase-electric  | 18:42:35.341 pid=<0.2488.0> [info] Running Electric.Plug.Router with Bandit 1.1.3 at :::5133 (http)
supabase-electric  | 18:42:35.360 pid=<0.2803.0> origin=postgres_1 [info] Electric.Replication.Postgres.Client.with_conn(%{database: ~c"postgres", host: ~c"db", ip_addr: ~c"172.19.0.4", ipv6: true, nulls: [nil, :null, :undefined], password: ~c"******", port: 5432, ssl: true, ssl_opts: [server_name_indication: ~c"db"], timeout: 5000, username: ~c"postgres"})
supabase-electric  | 18:42:35.504 pid=<0.2803.0> origin=postgres_1 [info] Successfully initialized origin postgres_1 at extension version 
supabase-electric  | 18:42:35.505 pid=<0.2808.0> [info] Starting Proxy server listening on port 65432
supabase-electric  | 18:42:35.505 pid=<0.2809.0> pg_producer=postgres_1 [info] Starting Elixir.Electric.Postgres.Extension.SchemaCache for postgres_1
supabase-electric  | 18:42:35.506 pid=<0.2809.0> pg_producer=postgres_1 [warning] SchemaCache "postgres_1" registered as the global instance
supabase-electric  | 18:42:35.506 pid=<0.2812.0> [info] Starting replication from postgres_1
supabase-electric  | 18:42:35.506 pid=<0.2812.0> [info] Electric.Replication.Postgres.LogicalReplicationProducer.init(%{database: ~c"postgres", host: ~c"db", ip_addr: ~c"172.19.0.4", ipv6: true, nulls: [nil, :null, :undefined], password: ~c"******", port: 5432, replication: ~c"database", ssl: true, ssl_opts: [server_name_indication: ~c"db"], timeout: 5000, username: ~c"postgres"})
supabase-electric  | ▓ ┌────────────────────────────────────────────────────────┐
supabase-electric  | ▓ │  MODULE ERROR: Electric.Replication.PostgresConnector  │
supabase-electric  | ▓ ┕━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
supabase-electric  | ▓ 
supabase-electric  | ▓ Failed to start child :postgres_producer:
supabase-electric  | ▓   {:bad_return_value, {:error, {:error, :error, "55000", :object_not_in_prerequisite_state, "cannot read from logical replication slot \"electric_replication_out_postgres\"", [detail: "This slot has been invalidated because it exceeded the maximum reserved size.", file: "walsender.c", line: "1269", routine: "StartLogicalReplication", severity: "ERROR"]}}}
supabase-electric  | ▓ 
supabase-electric  | ▓ Please file a new issue on GitHub[1], including the contents of this error.
supabase-electric  | ▓ 
supabase-electric  | ▓ [1]: https://github.com/electric-sql/electric/issues
supabase-electric  | 18:42:35.581 pid=<0.2803.0> origin=postgres_1 [error] PostgresConnectorSup failed to start child :postgres_producer with reason: {:bad_return_value, {:error, {:error, :error, "55000", :object_not_in_prerequisite_state, "cannot read from logical replication slot \"electric_replication_out_postgres\"", [detail: "This slot has been invalidated because it exceeded the maximum reserved size.", file: "walsender.c", line: "1269", routine: "StartLogicalReplication", severity: "ERROR"]}}}.
supabase-electric  | 
supabase-electric  | ••• Shutting down •••
supabase-electric  | 
supabase-electric  | [os_mon] memory supervisor port (memsup): Erlang has closed
supabase-electric  | [os_mon] cpu supervisor port (cpu_sup): Erlang has closed
supabase-electric exited with code 1
@djbutler
Copy link
Author

This appears to have been caused by me accidentally sending a very large number of SQL queries to the Electric migration proxy rather than to my Postgres database (where I intended to send them). Easy mistake to make... might be worth handling this case more gracefully.

@hongsw-aibizon
Copy link

hongsw-aibizon commented Mar 28, 2024

@djbutler How did you get things back to normal?
I also encountered the same error while using the createMany function.

@djbutler
Copy link
Author

djbutler commented Mar 28, 2024 via email

@djbutler
Copy link
Author

djbutler commented Mar 28, 2024 via email

@alco
Copy link
Member

alco commented Apr 24, 2024

Hey @djbutler. Thanks for reporting this!

By default, Postgres does not put a limit on replication slot sizes. Are you using a hosted database or do you set the relevant configuration option in your Postgres config?

On the issue of disk usage growth caused by the replication slot Electric creates: this is, in fact, a hairy problem. You'll find more relevant info here - #1083. TL;DR, any write to PG will cause Postgres disk usage to grow, even if the write itself does not touch electrified tables. In the next release of Electric, there will be a new configuration option to limit the maximum size of disk space Electric's replication slot can retain.

@djbutler
Copy link
Author

Thanks for the explanation. Yeah this would be great to fix, since it's come up a few times for me.

This db is self-hosted on EC2 - it's a Supabase docker image. I don't know if any of these look relevant:

name setting unit category short_desc
auto_explain.log_wal off Customized Options Log WAL usage.
max_slot_wal_keep_size 1024 MB Replication / Sending Servers Sets the maximum WAL size that can be reserved by replication slots.
max_wal_senders 10 Replication / Sending Servers Sets the maximum number of simultaneously running WAL sender processes.
max_wal_size 1024 MB Write-Ahead Log / Checkpoints Sets the WAL size that triggers a checkpoint.
min_wal_size 80 MB Write-Ahead Log / Checkpoints Sets the minimum size to shrink the WAL to.
track_wal_io_timing off Statistics / Cumulative Query and Index Statistics Collects timing statistics for WAL I/O activity.
wal_block_size 8192 Preset Options Shows the block size in the write ahead log.
wal_buffers 512 8kB Write-Ahead Log / Settings Sets the number of disk-page buffers in shared memory for WAL.
wal_compression off Write-Ahead Log / Settings Compresses full-page writes written in WAL file with specified method.
wal_consistency_checking Developer Options Sets the WAL resource managers for which WAL consistency checks are done.
wal_decode_buffer_size 524288 B Write-Ahead Log / Recovery Buffer size for reading ahead in the WAL during recovery.
wal_init_zero on Write-Ahead Log / Settings Writes zeroes to new WAL files before first use.
wal_keep_size 0 MB Replication / Sending Servers Sets the size of WAL files held for standby servers.
wal_level logical Write-Ahead Log / Settings Sets the level of information written to the WAL.
wal_log_hints off Write-Ahead Log / Settings Writes full pages to WAL when first modified after a checkpoint, even for a non-critical modification.
wal_receiver_create_temp_slot off Replication / Standby Servers Sets whether a WAL receiver should create a temporary replication slot if no permanent slot is configured.
wal_receiver_status_interval 10 s Replication / Standby Servers Sets the maximum interval between WAL receiver status reports to the sending server.
wal_receiver_timeout 60000 ms Replication / Standby Servers Sets the maximum wait time to receive data from the sending server.
wal_recycle on Write-Ahead Log / Settings Recycles WAL files by renaming them.
wal_retrieve_retry_interval 5000 ms Replication / Standby Servers Sets the time to wait before retrying to retrieve WAL after a failed attempt.
wal_segment_size 16777216 B Preset Options Shows the size of write ahead log segments.
wal_sender_timeout 60000 ms Replication / Sending Servers Sets the maximum time to wait for WAL replication.
wal_skip_threshold 2048 kB Write-Ahead Log / Settings Minimum size of new file to fsync instead of writing WAL.
wal_sync_method fdatasync Write-Ahead Log / Settings Selects the method used for forcing WAL updates to disk.
wal_writer_delay 200 ms Write-Ahead Log / Settings Time between WAL flushes performed in the WAL writer.
wal_writer_flush_after 128 8kB Write-Ahead Log / Settings Amount of WAL written out by WAL writer that triggers a flush.

ALTER SYSTEM SET max_slot_wal_keep_size = '5GB'; doesn't throw any errors, but also doesn't seem to have any effect. If you have any pointers for how to clear my WAL and increase the limit, that would be amazing - otherwise I'll just try to troubleshoot it myself.

@alco
Copy link
Member

alco commented Apr 30, 2024

@djbutler Thanks for sharing your config!

The max_slot_wal_keep_size setting is the reason you got the error about the replication slot getting invalidated. By default, its value is set to -1 in Postgres, so replication slots can retain as much WAL as there is available disk space. Chaging the value of this setting won't have any effect until the Postgres instance is restarted, per the docs:

This parameter can only be set in the postgresql.conf file or on the server command line.

Even then, I guess once a replication slot has been invalidated, it can no longer be used. You can drop Electric's slot with SELECT pg_drop_replication_slot('electric_replication_out_<db name>') and restart Electric to let it create a new one. You can also query the database for all replication slots to double-check that the name is right:

SELECT * FROM pg_replication_slots;

In the upcoming release of Electric we're introducing a configuration parameter to set the upper limit on WAL size that Electric can retain, see the preview here.

@alco alco self-assigned this Apr 30, 2024
@djbutler
Copy link
Author

djbutler commented Apr 30, 2024

Thanks @alco, I appreciate the detailed explanation. One quick question about the preview you linked to - it says:

Setting this to a low value may lead to clients having to discard their local copy of the server state and restart their replication state from scratch.

So if I understand correctly, the local copy will be discarded because the "diff" information between the client's copy and the server's newer copy has been thrown away?

Also, discarding the local copy of the server state doesn't cause pending client-side transactions to be dropped does it? That would be pretty bad - it could lead to data loss. I assume it's just a performance issue - the client will have to download the server state again, but won't lose pending local transactions, right?

@alco
Copy link
Member

alco commented May 1, 2024

So if I understand correctly, the local copy will be discarded because the "diff" information between the client's copy and the server's newer copy has been thrown away?

Electric does not compute diffs currently. It can stream transactions from Postgres' WAL and, provided there's no gap between what it has already sent to a given client and what it can stream from Postgres, it will keep streaming new transactions to the client.

But if the replication slot is removed, Postgres will discard old WAL segments and so, upon creating a new replication slot, Electric will only be able to resuming streaming from the latest state in Postgres. If it detects that there's a gap between the last transaction the client received and the earliest transaction Electric can stream from Postgres, it will ask the client to reset its local state to force it to request the latest snapshot of the shape data from the database before going back to the transaction streaming mode.

Also, discarding the local copy of the server state doesn't cause pending client-side transactions to be dropped does it?

The idea is that client's local updates will survive the reset. This is a gray area currently since we don't yet have sufficient test coverage for this edge case but we'll get there eventually.

@djbutler
Copy link
Author

djbutler commented May 1, 2024

Okay got it, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants