RND: really big tables snapshoting #223

BorisTyshkevich · 2025-02-20T11:21:25Z

When we need to replicate a very big table from postgres/mysql/etc with 1B rows and 1Tb of data, simple select * from the table could be problematic and fail for different reasons. Do we have any checkpointing mechanism to continue loading not from the very beginning each time? How does it work?

Another problem is setting a lock on the server or table for too long during initial loading.

My old way (used with Debezium and such) for syncing to Clickhouse was to initiate a CDC from the current position, placing data to ReplacingMergeTree(version) with version = now(), and next run a full snapshot with version=0. That way, it proves to be more stable and doesn't create locks in Postgres/mysql.

How can I implement such with the transfer? Is it possible to set version=0 for SNAPSHOT_ONLY mode? Or maybe has it been done already?

Boris.

The text was updated successfully, but these errors were encountered:

laskoviymishka · 2025-02-20T14:15:11Z

Probably related to #204
Need to test how it goes with this patch

work-vv · 2025-02-21T04:44:22Z

@BorisTyshkevich , @laskoviymishka Isnt the same question you discussed not long ago? #209

BorisTyshkevich · 2025-02-21T08:41:53Z

@work-vv It's a different topic, not related to position management.

There are two problems when dealing with really big tables:

select * from table could fail for different reasons, and we need to start over from the very beginning
to be consistent with ongoing updates, we have to set a lock to the table during that huge and long select

It can't be solved in a general way for any DB-DB transfers. However, for Clickhouse dst, when writing to ReplacingMergeTree (with version) the latter problem could be solved by copying historic data after CDC starts.

Checkpointing is a more complicated topic. Probably, iterating over the src table's PK would help.

work-vv · 2025-02-21T10:02:47Z

@BorisTyshkevich Looks like we tackle the same problem but different way. How about to forget the table lock and transfer historical data to clickhouse in comfortable time and way. But sync it later with command ./trcli replicate starting with know binlog position. The gap will be rewritten with correct data changes.

BorisTyshkevich · 2025-02-21T10:44:58Z

This manual approach is also valid, and I use it.

However, the zero-version method allows automating the process - you can run StatefulSet (doing CDC) and Job (doing historical load) simultaneously.

My current task is creating a Helm Chart with simple descriptions and actions ready to be placed in UI. The manual way of saving and restoring the CDC position would be very complicated regarding user-friendly UI.

laskoviymishka changed the title ~~really big tables snapshoting~~ RND: really big tables snapshoting Feb 21, 2025

laskoviymishka added the enhancement New feature or request label Feb 21, 2025

BorisTyshkevich mentioned this issue Feb 21, 2025

Helm questions #220

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RND: really big tables snapshoting #223

RND: really big tables snapshoting #223

BorisTyshkevich commented Feb 20, 2025

laskoviymishka commented Feb 20, 2025

work-vv commented Feb 21, 2025

BorisTyshkevich commented Feb 21, 2025

work-vv commented Feb 21, 2025 •

edited

Loading

BorisTyshkevich commented Feb 21, 2025

RND: really big tables snapshoting #223

RND: really big tables snapshoting #223

Comments

BorisTyshkevich commented Feb 20, 2025

laskoviymishka commented Feb 20, 2025

work-vv commented Feb 21, 2025

BorisTyshkevich commented Feb 21, 2025

work-vv commented Feb 21, 2025 • edited Loading

BorisTyshkevich commented Feb 21, 2025

work-vv commented Feb 21, 2025 •

edited

Loading