"This slot has been invalidated because it exceeded the maximum reserved size" error #1089

djbutler · 2024-03-23T19:14:10Z

My electric service crashed, and when I try to restart it with docker compose, I get this:

supabase-electric  | 18:42:35.329 pid=<0.2487.0> [info] Starting ElectricSQL 0.9.2 in direct_writes mode.
supabase-electric  | 18:42:35.332 pid=<0.2343.0> [notice]     :alarm_handler: {:set, {:system_memory_high_watermark, []}}
supabase-electric  | 18:42:35.341 pid=<0.2488.0> [info] Running Electric.Plug.Router with Bandit 1.1.3 at :::5133 (http)
supabase-electric  | 18:42:35.360 pid=<0.2803.0> origin=postgres_1 [info] Electric.Replication.Postgres.Client.with_conn(%{database: ~c"postgres", host: ~c"db", ip_addr: ~c"172.19.0.4", ipv6: true, nulls: [nil, :null, :undefined], password: ~c"******", port: 5432, ssl: true, ssl_opts: [server_name_indication: ~c"db"], timeout: 5000, username: ~c"postgres"})
supabase-electric  | 18:42:35.504 pid=<0.2803.0> origin=postgres_1 [info] Successfully initialized origin postgres_1 at extension version 
supabase-electric  | 18:42:35.505 pid=<0.2808.0> [info] Starting Proxy server listening on port 65432
supabase-electric  | 18:42:35.505 pid=<0.2809.0> pg_producer=postgres_1 [info] Starting Elixir.Electric.Postgres.Extension.SchemaCache for postgres_1
supabase-electric  | 18:42:35.506 pid=<0.2809.0> pg_producer=postgres_1 [warning] SchemaCache "postgres_1" registered as the global instance
supabase-electric  | 18:42:35.506 pid=<0.2812.0> [info] Starting replication from postgres_1
supabase-electric  | 18:42:35.506 pid=<0.2812.0> [info] Electric.Replication.Postgres.LogicalReplicationProducer.init(%{database: ~c"postgres", host: ~c"db", ip_addr: ~c"172.19.0.4", ipv6: true, nulls: [nil, :null, :undefined], password: ~c"******", port: 5432, replication: ~c"database", ssl: true, ssl_opts: [server_name_indication: ~c"db"], timeout: 5000, username: ~c"postgres"})
supabase-electric  | ▓ ┌────────────────────────────────────────────────────────┐
supabase-electric  | ▓ │  MODULE ERROR: Electric.Replication.PostgresConnector  │
supabase-electric  | ▓ ┕━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┙
supabase-electric  | ▓ 
supabase-electric  | ▓ Failed to start child :postgres_producer:
supabase-electric  | ▓   {:bad_return_value, {:error, {:error, :error, "55000", :object_not_in_prerequisite_state, "cannot read from logical replication slot \"electric_replication_out_postgres\"", [detail: "This slot has been invalidated because it exceeded the maximum reserved size.", file: "walsender.c", line: "1269", routine: "StartLogicalReplication", severity: "ERROR"]}}}
supabase-electric  | ▓ 
supabase-electric  | ▓ Please file a new issue on GitHub[1], including the contents of this error.
supabase-electric  | ▓ 
supabase-electric  | ▓ [1]: https://github.com/electric-sql/electric/issues
supabase-electric  | 18:42:35.581 pid=<0.2803.0> origin=postgres_1 [error] PostgresConnectorSup failed to start child :postgres_producer with reason: {:bad_return_value, {:error, {:error, :error, "55000", :object_not_in_prerequisite_state, "cannot read from logical replication slot \"electric_replication_out_postgres\"", [detail: "This slot has been invalidated because it exceeded the maximum reserved size.", file: "walsender.c", line: "1269", routine: "StartLogicalReplication", severity: "ERROR"]}}}.
supabase-electric  | 
supabase-electric  | ••• Shutting down •••
supabase-electric  | 
supabase-electric  | [os_mon] memory supervisor port (memsup): Erlang has closed
supabase-electric  | [os_mon] cpu supervisor port (cpu_sup): Erlang has closed
supabase-electric exited with code 1

The text was updated successfully, but these errors were encountered:

linear · 2024-03-23T19:14:12Z

VAX-1755 "This slot has been invalidated because it exceeded the maximum reserved size" error

djbutler · 2024-03-26T00:40:36Z

This appears to have been caused by me accidentally sending a very large number of SQL queries to the Electric migration proxy rather than to my Postgres database (where I intended to send them). Easy mistake to make... might be worth handling this case more gracefully.

hongsw-aibizon · 2024-03-28T06:54:13Z

@djbutler How did you get things back to normal?
I also encountered the same error while using the createMany function.

djbutler · 2024-03-28T16:08:04Z

I ran the SQL given on the “Roadmap” page for deleting the database structures created by electric. When you start the electric service again, it will re-created the needed structures. The errors stopped. But unfortunately, when I re-electrified my tables, the data already in the tables does not sync, so that’s the next thing to debug. New data added to the tables syncs fine.

…

On Wed, Mar 27, 2024 at 11:54 PM 승우 ***@***.***> wrote: @djbutler <https://github.com/djbutler> How did you get things back to normal? — Reply to this email directly, view it on GitHub <#1089 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABKG4KRPM6V4V5HZIPAXMLY2O5CVAVCNFSM6AAAAABFE3ZROCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRUGUZDGMZXGU> . You are receiving this because you were mentioned.Message ID: ***@***.***>

djbutler · 2024-03-28T16:10:06Z

Also, if I didn’t mention it earlier, the root cause seems to be that I accidentally issued a ton of data-intensive queries to the migration proxy that were in fact intended for the underlying PSQL database. That messed up electric.

…

On Thu, Mar 28, 2024 at 9:07 AM Dan Butler ***@***.***> wrote: I ran the SQL given on the “Roadmap” page for deleting the database structures created by electric. When you start the electric service again, it will re-created the needed structures. The errors stopped. But unfortunately, when I re-electrified my tables, the data already in the tables does not sync, so that’s the next thing to debug. New data added to the tables syncs fine. On Wed, Mar 27, 2024 at 11:54 PM 승우 ***@***.***> wrote: > @djbutler <https://github.com/djbutler> How did you get things back to > normal? > > — > Reply to this email directly, view it on GitHub > <#1089 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AABKG4KRPM6V4V5HZIPAXMLY2O5CVAVCNFSM6AAAAABFE3ZROCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRUGUZDGMZXGU> > . > You are receiving this because you were mentioned.Message ID: > ***@***.***> >

alco · 2024-04-24T14:25:44Z

Hey @djbutler. Thanks for reporting this!

By default, Postgres does not put a limit on replication slot sizes. Are you using a hosted database or do you set the relevant configuration option in your Postgres config?

On the issue of disk usage growth caused by the replication slot Electric creates: this is, in fact, a hairy problem. You'll find more relevant info here - #1083. TL;DR, any write to PG will cause Postgres disk usage to grow, even if the write itself does not touch electrified tables. In the next release of Electric, there will be a new configuration option to limit the maximum size of disk space Electric's replication slot can retain.

djbutler · 2024-04-27T00:44:30Z

Thanks for the explanation. Yeah this would be great to fix, since it's come up a few times for me.

This db is self-hosted on EC2 - it's a Supabase docker image. I don't know if any of these look relevant:

name	setting	unit	category	short_desc
auto_explain.log_wal	off		Customized Options	Log WAL usage.
max_slot_wal_keep_size	1024	MB	Replication / Sending Servers	Sets the maximum WAL size that can be reserved by replication slots.
max_wal_senders	10		Replication / Sending Servers	Sets the maximum number of simultaneously running WAL sender processes.
max_wal_size	1024	MB	Write-Ahead Log / Checkpoints	Sets the WAL size that triggers a checkpoint.
min_wal_size	80	MB	Write-Ahead Log / Checkpoints	Sets the minimum size to shrink the WAL to.
track_wal_io_timing	off		Statistics / Cumulative Query and Index Statistics	Collects timing statistics for WAL I/O activity.
wal_block_size	8192		Preset Options	Shows the block size in the write ahead log.
wal_buffers	512	8kB	Write-Ahead Log / Settings	Sets the number of disk-page buffers in shared memory for WAL.
wal_compression	off		Write-Ahead Log / Settings	Compresses full-page writes written in WAL file with specified method.
wal_consistency_checking			Developer Options	Sets the WAL resource managers for which WAL consistency checks are done.
wal_decode_buffer_size	524288	B	Write-Ahead Log / Recovery	Buffer size for reading ahead in the WAL during recovery.
wal_init_zero	on		Write-Ahead Log / Settings	Writes zeroes to new WAL files before first use.
wal_keep_size	0	MB	Replication / Sending Servers	Sets the size of WAL files held for standby servers.
wal_level	logical		Write-Ahead Log / Settings	Sets the level of information written to the WAL.
wal_log_hints	off		Write-Ahead Log / Settings	Writes full pages to WAL when first modified after a checkpoint, even for a non-critical modification.
wal_receiver_create_temp_slot	off		Replication / Standby Servers	Sets whether a WAL receiver should create a temporary replication slot if no permanent slot is configured.
wal_receiver_status_interval	10	s	Replication / Standby Servers	Sets the maximum interval between WAL receiver status reports to the sending server.
wal_receiver_timeout	60000	ms	Replication / Standby Servers	Sets the maximum wait time to receive data from the sending server.
wal_recycle	on		Write-Ahead Log / Settings	Recycles WAL files by renaming them.
wal_retrieve_retry_interval	5000	ms	Replication / Standby Servers	Sets the time to wait before retrying to retrieve WAL after a failed attempt.
wal_segment_size	16777216	B	Preset Options	Shows the size of write ahead log segments.
wal_sender_timeout	60000	ms	Replication / Sending Servers	Sets the maximum time to wait for WAL replication.
wal_skip_threshold	2048	kB	Write-Ahead Log / Settings	Minimum size of new file to fsync instead of writing WAL.
wal_sync_method	fdatasync		Write-Ahead Log / Settings	Selects the method used for forcing WAL updates to disk.
wal_writer_delay	200	ms	Write-Ahead Log / Settings	Time between WAL flushes performed in the WAL writer.
wal_writer_flush_after	128	8kB	Write-Ahead Log / Settings	Amount of WAL written out by WAL writer that triggers a flush.

ALTER SYSTEM SET max_slot_wal_keep_size = '5GB'; doesn't throw any errors, but also doesn't seem to have any effect. If you have any pointers for how to clear my WAL and increase the limit, that would be amazing - otherwise I'll just try to troubleshoot it myself.

alco · 2024-04-30T07:12:50Z

@djbutler Thanks for sharing your config!

The max_slot_wal_keep_size setting is the reason you got the error about the replication slot getting invalidated. By default, its value is set to -1 in Postgres, so replication slots can retain as much WAL as there is available disk space. Chaging the value of this setting won't have any effect until the Postgres instance is restarted, per the docs:

This parameter can only be set in the postgresql.conf file or on the server command line.

Even then, I guess once a replication slot has been invalidated, it can no longer be used. You can drop Electric's slot with SELECT pg_drop_replication_slot('electric_replication_out_<db name>') and restart Electric to let it create a new one. You can also query the database for all replication slots to double-check that the name is right:

SELECT * FROM pg_replication_slots;

In the upcoming release of Electric we're introducing a configuration parameter to set the upper limit on WAL size that Electric can retain, see the preview here.

djbutler · 2024-04-30T20:40:25Z

Thanks @alco, I appreciate the detailed explanation. One quick question about the preview you linked to - it says:

Setting this to a low value may lead to clients having to discard their local copy of the server state and restart their replication state from scratch.

So if I understand correctly, the local copy will be discarded because the "diff" information between the client's copy and the server's newer copy has been thrown away?

Also, discarding the local copy of the server state doesn't cause pending client-side transactions to be dropped does it? That would be pretty bad - it could lead to data loss. I assume it's just a performance issue - the client will have to download the server state again, but won't lose pending local transactions, right?

alco · 2024-05-01T09:29:01Z

So if I understand correctly, the local copy will be discarded because the "diff" information between the client's copy and the server's newer copy has been thrown away?

Electric does not compute diffs currently. It can stream transactions from Postgres' WAL and, provided there's no gap between what it has already sent to a given client and what it can stream from Postgres, it will keep streaming new transactions to the client.

But if the replication slot is removed, Postgres will discard old WAL segments and so, upon creating a new replication slot, Electric will only be able to resuming streaming from the latest state in Postgres. If it detects that there's a gap between the last transaction the client received and the earliest transaction Electric can stream from Postgres, it will ask the client to reset its local state to force it to request the latest snapshot of the shape data from the database before going back to the transaction streaming mode.

Also, discarding the local copy of the server state doesn't cause pending client-side transactions to be dropped does it?

The idea is that client's local updates will survive the reset. This is a gray area currently since we don't yet have sufficient test coverage for this edge case but we'll get there eventually.

djbutler · 2024-05-01T20:19:12Z

Okay got it, thanks!

djbutler mentioned this issue Apr 27, 2024

Replica size / disk usage #1083

Closed

alco self-assigned this Apr 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"This slot has been invalidated because it exceeded the maximum reserved size" error #1089

"This slot has been invalidated because it exceeded the maximum reserved size" error #1089

djbutler commented Mar 23, 2024 •

edited by balegas

linear bot commented Mar 23, 2024

djbutler commented Mar 26, 2024

hongsw-aibizon commented Mar 28, 2024 •

edited

djbutler commented Mar 28, 2024 via email

djbutler commented Mar 28, 2024 via email

alco commented Apr 24, 2024

djbutler commented Apr 27, 2024

alco commented Apr 30, 2024

djbutler commented Apr 30, 2024 •

edited

alco commented May 1, 2024

djbutler commented May 1, 2024

"This slot has been invalidated because it exceeded the maximum reserved size" error #1089

"This slot has been invalidated because it exceeded the maximum reserved size" error #1089

Comments

djbutler commented Mar 23, 2024 • edited by balegas

linear bot commented Mar 23, 2024

djbutler commented Mar 26, 2024

hongsw-aibizon commented Mar 28, 2024 • edited

djbutler commented Mar 28, 2024 via email

djbutler commented Mar 28, 2024 via email

alco commented Apr 24, 2024

djbutler commented Apr 27, 2024

alco commented Apr 30, 2024

djbutler commented Apr 30, 2024 • edited

alco commented May 1, 2024

djbutler commented May 1, 2024

djbutler commented Mar 23, 2024 •

edited by balegas

hongsw-aibizon commented Mar 28, 2024 •

edited

djbutler commented Apr 30, 2024 •

edited