Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

synapse_auto_compressor panics on start after using the history delete api to remove remote events #78

Open
MTRNord opened this issue Nov 24, 2021 · 15 comments
Labels
bug Something isn't working

Comments

@MTRNord
Copy link

MTRNord commented Nov 24, 2021

Describe the bug

Nov 25 00:25:08 nordgedanken.dev systemd[1]: Starting State Compress Synapse...
Nov 25 00:25:08 nordgedanken.dev synapse_auto_compressor[3600348]: [2021-11-24T23:25:08Z INFO  synapse_auto_compressor] synapse_auto_compressor started
Nov 25 00:25:08 nordgedanken.dev synapse_auto_compressor[3600348]: [2021-11-24T23:25:08Z INFO  synapse_auto_compressor::manager] Running compressor on room !OGEhHVWSdvArJzumhm:matrix.org with chunk size 1000
Nov 25 00:25:10 nordgedanken.dev synapse_auto_compressor[3600348]: [2021-11-24T23:25:10Z ERROR panic] thread 'main' panicked at 'Missing 13719240': src/lib.rs:666
Nov 25 00:25:10 nordgedanken.dev systemd[1]: synapse-state-compress.service: Main process exited, code=exited, status=101/n/a
Nov 25 00:25:10 nordgedanken.dev systemd[1]: synapse-state-compress.service: Failed with result 'exit-code'.
Nov 25 00:25:10 nordgedanken.dev systemd[1]: Failed to start State Compress Synapse.

To Reproduce
Steps to reproduce the behavior:

  1. Run purge_history on a room
  2. start synapse_auto_compressor

Expected behavior
It runs or atleast ignores the invalid rooms

@DMRobertson
Copy link
Contributor

Could you run this with the environment variable RUST_BACKTRACE=1 set, to have a better idea of where the panic comes from?

@DMRobertson DMRobertson added the bug Something isn't working label Nov 25, 2021
@MTRNord
Copy link
Author

MTRNord commented Nov 25, 2021

While I would love to it seems that is all I am getting even with the env variable set and in a debug binary:

➜  synapse_auto_compressor git:(main) export RUST_BACKTRACE=1
➜  synapse_auto_compressor git:(main) RUST_BACKTRACE=1 synapse_auto_compressor -p postgresql:// -c 1000 -n 10000000
[2021-11-25T18:23:14Z INFO  synapse_auto_compressor] synapse_auto_compressor started
[2021-11-25T18:23:14Z INFO  synapse_auto_compressor::manager] Running compressor on room !QQpfJfZvqxbCfeDgCj:matrix.org with chunk size 1000
[2021-11-25T18:23:14Z ERROR panic] thread 'main' panicked at 'Missing 13719470': src/lib.rs:666

@DMRobertson
Copy link
Contributor

Thanks for trying that---I wonder what's happened there?

@maranda
Copy link

maranda commented Jan 24, 2022

I'm not sure if this is helpful, but deleting the relevant room entries into state_compressor_state and state_compressor_progress. At least made the auto compressor resume working for me.

@MTRNord
Copy link
Author

MTRNord commented Apr 18, 2022

By the way I now tried the workaround of deleting it from the table. However that just makes it fail on missing some other position :/

Also I now know that for me the affected room is TWIM !QQpfJfZvqxbCfeDgCj:matrix.org (its however now changing the rooms it fails at)

@maranda
Copy link

maranda commented Apr 18, 2022

Anytime that happens just repeat the previous step to delete references of the room from both tables and the compressor will resume.

You may have to do that multiple times.

@MTRNord
Copy link
Author

MTRNord commented Apr 18, 2022

Anytime that happens just repeat the previous step to delete references of the room from both tables and the compressor will resume.

You may have to do that multiple times.

Yeah but thats kinda annoying to do on an automated setup :) I will instead now have a look at a) changing to logging to get useful backtraces and b) take a look at handling this or atleast do the cleanup automatically instead of doing a panic here. :)

@MTRNord

This comment was marked as outdated.

@MTRNord
Copy link
Author

MTRNord commented Apr 18, 2022

(See hidden comment for minor more info)

Proper debug stack on where it crashes:

Apr 18 14:41:18 nordgedanken.dev synapse_auto_compressor[1461640]: 2022-04-18T12:41:18.278082Z  INFO synapse_auto_compressor::manager: Running compressor on room !zXfJBqSUvXySmsZMtB:jki.re with chunk size 1000
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]: The application panicked (crashed).
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]: Message:  Missing 13719624
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]: Location: src/lib.rs:664
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ BACKTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:                                 ⋮ 7 frames hidden ⋮
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:    8: synapse_compress_state::collapse_state_maps::h44a9bbb9616014eb
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:       at /root/.cargo/git/checkouts/rust-synapse-compress-state-68477a18a6ccb974/95206e2/src/lib.rs:664
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        662 │ fn collapse_state_maps(map: &BTreeMap<i64, StateGroupEntry>, state_group: i64) -> StateMap<Atom> {
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        663 │     if !map.contains_key(&state_group) {
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        664 >         panic!("Missing {}", state_group);
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        665 │     }
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        666 │
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:    9: synapse_compress_state::compressor::Compressor::get_delta::he6c004a548606169
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:       at /root/.cargo/git/checkouts/rust-synapse-compress-state-68477a18a6ccb974/95206e2/src/compressor.rs:267
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        265 │         let mut prev_state_map;
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        266 │         'outer: loop {
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        267 >             prev_state_map = collapse_state_maps(self.original_state_map, prev_sg);
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        268 │             for (t, s) in prev_state_map.keys() {
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        269 │                 if !state_map.contains_key(t, s) {
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:   10: synapse_compress_state::compressor::Compressor::create_new_tree::h46dc928a5f4d15ab
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:       at /root/.cargo/git/checkouts/rust-synapse-compress-state-68477a18a6ccb974/95206e2/src/compressor.rs:228
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        226 │             } else {
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        227 │                 self.stats.state_groups_changed += 1;
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        228 >                 self.get_delta(prev_state_group, state_group)
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        229 │             };
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        230 │
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:   11: synapse_compress_state::compressor::Compressor::compress_from_save::hb4323518edfb5fa4
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:       at /root/.cargo/git/checkouts/rust-synapse-compress-state-68477a18a6ccb974/95206e2/src/compressor.rs:169
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        167 │         };
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        168 │
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        169 >         compressor.create_new_tree();
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        170 │         compressor
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        171 │     }
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:   12: synapse_compress_state::continue_run::hdaabfe372f1e3a79
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:       at /root/.cargo/git/checkouts/rust-synapse-compress-state-68477a18a6ccb974/95206e2/src/lib.rs:570
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        568 │
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        569 │     // Now we actually call the compression algorithm.
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        570 >     let compressor = Compressor::compress_from_save(&state_group_map, level_info);
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        571 │     let new_state_group_map = &compressor.new_state_group_map;
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        572 │
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:   13: synapse_auto_compressor::manager::run_compressor_on_room_chunk::h1a04765da5e58a53
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:       at /root/.cargo/git/checkouts/rust-synapse-compress-state-68477a18a6ccb974/95206e2/synapse_auto_compressor/src/manager.rs:61
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:         59 │
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:         60 │     // run the compressor on this chunk
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:         61 >     let option_chunk_stats = continue_run(start, chunk_size, db_url, room_id, &level_info);
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:         62 │
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:         63 │     if option_chunk_stats.is_none() {
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:   14: synapse_auto_compressor::manager::compress_chunks_of_database::h12736a1fee37efa6
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:       at /root/.cargo/git/checkouts/rust-synapse-compress-state-68477a18a6ccb974/95206e2/synapse_auto_compressor/src/manager.rs:171
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        169 │
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        170 │         let work_done =
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        171 >             run_compressor_on_room_chunk(db_url, &room_to_compress, chunk_size, default_levels)?;
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        172 │
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        173 │         if let Some(ref chunk_stats) = work_done {
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:   15: synapse_auto_compressor::main::ha895a792595d0dd7
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:       at /root/.cargo/git/checkouts/rust-synapse-compress-state-68477a18a6ccb974/95206e2/synapse_auto_compressor/src/main.rs:174
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        172 │     // call compress_largest_rooms with the arguments supplied
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        173 │     // panic if an error is produced
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        174 >     manager::compress_chunks_of_database(db_url, chunk_size, &default_levels.0, number_of_chunks)?;
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        175 │
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:        176 │     info!("synapse_auto_compressor finished");
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:   16: core::ops::function::FnOnce::call_once::hf6e65f2765cf1ff7
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:       at /rustc/ec77f252434a532fdb5699ae4f21a3072d211edd/library/core/src/ops/function.rs:248
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]:                                 ⋮ 15 frames hidden ⋮
Apr 18 14:42:05 nordgedanken.dev synapse_auto_compressor[1461640]: Run with COLORBT_SHOW_HIDDEN=1 environment variable to disable frame filtering.

This is using this forked version (replacing the logging functions mainly: https://github.com/MTRNord/rust-synapse-compress-state/tree/95206e295599973bc676c32ec4639852830ef4c0 )

@MTRNord
Copy link
Author

MTRNord commented Apr 18, 2022

Also, MTRNord@940aba6 contains experimental error handling for this. It will essentially just skip the room in case of this error entirely. It won't do any cleanup currently, but at least not panic. Furthermore, it's not tested and potentially breaking stuff at this time. Hence, the warning below.

The potentially incorrect and problematic changes are here: MTRNord@940aba6#diff-aecc1fa342e82a5188eb5d3fce02ecc1adc1c82ae413ed814a6f9f362d5cba55R229-R236

As I am not sure if it is safe to break there.

DO NOT USE THE ABOVE COMMIT! (It may very well break your synapse as it changes stuff in the actual logic of the algo)

Edit: There indeed was a bug, and it still would have committed the new (now broken) state to the db. This is fixed in MTRNord@427a37f

Edit:

Initial testing seems to be working after some choking:

INFO synapse_auto_compressor::manager: Running compressor on room !zXfJBqSUvXySmsZMtB:jki.re with chunk size 1000
ERROR synapse_compress_state::compressor: Failed to get delta for state group 13719847: Missing state_goup: 13719624
INFO synapse_compress_state: Checking that state maps match...
WARN synapse_compress_state: Invalid state found: expected state to match: Missing state_goup: 13719847
WARN synapse_auto_compressor::manager: The compressor tried to increase the number of rows in !zXfJBqSUvXySmsZMtB:jki.re between Some(13719624) and 13926124. Skipping...
INFO synapse_auto_compressor::manager: Running compressor on room !ping-v6:maunium.net with chunk size 1000
ERROR synapse_compress_state::compressor: Failed to get delta for state group 13719872: Missing state_goup: 13719793
INFO synapse_compress_state: Checking that state maps match...
WARN synapse_compress_state: Invalid state found: expected state to match: Missing state_goup: 13720144
WARN synapse_auto_compressor::manager: The compressor tried to increase the number of rows in !ping-v6:maunium.net between Some(13719793) and 13722699. Skipping...
INFO synapse_auto_compressor::manager: Running compressor on room !fudRWyBUmBnUrIuyXq:matrix.org with chunk size 1000
INFO synapse_compress_state: Checking that state maps match...
INFO synapse_compress_state: New state map matches old one

It now just prints the error and not modifies the state for that room.

@speatzle
Copy link

I just ran into the same Issue. Where the Compressor fails on a Room that got purged in the Past:

[2023-01-30T08:29:49Z INFO  synapse_auto_compressor] synapse_auto_compressor started
[2023-01-30T08:29:49Z INFO  synapse_auto_compressor::manager] Running compressor on room !YTvKGNlinIzlkMTVRl:matrix.org with chunk size 1000
[2023-01-30T08:30:31Z ERROR panic] thread 'main' panicked at 'Missing 88328': src/lib.rs:666

@FrankNagel
Copy link

I'm always calling the following script in preparation of a compressor run:

#!/bin/bash

# Clean up synapse_auto_compressor state tables.
#
# The compressor doesn't take into account
# - deleted rooms
# - state groups which got deleted either as unreferenced 
#   or due to retention time
#
# Procedure:
# 1. Delete progress related to deleted rooms
# 2. Delete progress for rooms where one of the referenced state groups
#    no longer exist
# 3. Replicate changes from state_compressor_state to state_compressor_progress

set -e -u -o pipefail

export PGHOST=$1
export PGDATABASE=$2
export PGUSER=$3
export PGPASSWORD=$4

SCRIPT=$(basename $0)

echo "[$(date -u '+%Y-%m-%dT%H:%M:%SZ') INFO ${SCRIPT}]" Starting

psql <<_EOF
BEGIN;

DELETE
FROM state_compressor_state AS scs
WHERE NOT EXISTS
    (SELECT *
     FROM rooms AS r
     WHERE r.room_id = scs.room_id);

DELETE
FROM state_compressor_state AS scs
WHERE scs.room_id in
    (SELECT DISTINCT room_id
     FROM state_compressor_state AS scs2
     WHERE scs2.current_head IS NOT NULL
       AND NOT EXISTS
         (SELECT *
          FROM state_groups AS sg
          WHERE sg.id = scs2.current_head));

DELETE
FROM state_compressor_progress AS scp
WHERE NOT EXISTS
    (SELECT *
     FROM state_compressor_state AS scs
     WHERE scs.room_id = scp.room_id);

COMMIT;
_EOF

echo "[$(date -u '+%Y-%m-%dT%H:%M:%SZ') INFO ${SCRIPT}]" Finished

@speatzle
Copy link

speatzle commented Feb 2, 2023

@FrankNagel running this script fixes my issues

@realtyem
Copy link

It would be great if this could be incorporated into this program, as a nice simple way to do this recovery after a purge.

@doobry-systemli
Copy link

Is there any plans to tackle this issue? It seems like a major usability bug to me that the tool always crashes after a room has been deleted. We ran into the same issues and #78 (comment) indeed fixes the issue (until next time a room got deleted).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants