Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RND: really big tables snapshoting #223

Open
BorisTyshkevich opened this issue Feb 20, 2025 · 5 comments
Open

RND: really big tables snapshoting #223

BorisTyshkevich opened this issue Feb 20, 2025 · 5 comments
Labels
enhancement New feature or request

Comments

@BorisTyshkevich
Copy link

When we need to replicate a very big table from postgres/mysql/etc with 1B rows and 1Tb of data, simple select * from the table could be problematic and fail for different reasons. Do we have any checkpointing mechanism to continue loading not from the very beginning each time? How does it work?

Another problem is setting a lock on the server or table for too long during initial loading.

My old way (used with Debezium and such) for syncing to Clickhouse was to initiate a CDC from the current position, placing data to ReplacingMergeTree(version) with version = now(), and next run a full snapshot with version=0. That way, it proves to be more stable and doesn't create locks in Postgres/mysql.

How can I implement such with the transfer? Is it possible to set version=0 for SNAPSHOT_ONLY mode? Or maybe has it been done already?

Boris.

@laskoviymishka
Copy link
Contributor

Probably related to #204
Need to test how it goes with this patch

@work-vv
Copy link

work-vv commented Feb 21, 2025

@BorisTyshkevich , @laskoviymishka Isnt the same question you discussed not long ago? #209

@BorisTyshkevich
Copy link
Author

@work-vv It's a different topic, not related to position management.

There are two problems when dealing with really big tables:

  • select * from table could fail for different reasons, and we need to start over from the very beginning
  • to be consistent with ongoing updates, we have to set a lock to the table during that huge and long select

It can't be solved in a general way for any DB-DB transfers. However, for Clickhouse dst, when writing to ReplacingMergeTree (with version) the latter problem could be solved by copying historic data after CDC starts.

Checkpointing is a more complicated topic. Probably, iterating over the src table's PK would help.

@work-vv
Copy link

work-vv commented Feb 21, 2025

@BorisTyshkevich Looks like we tackle the same problem but different way. How about to forget the table lock and transfer historical data to clickhouse in comfortable time and way. But sync it later with command ./trcli replicate starting with know binlog position. The gap will be rewritten with correct data changes.

@BorisTyshkevich
Copy link
Author

This manual approach is also valid, and I use it.

However, the zero-version method allows automating the process - you can run StatefulSet (doing CDC) and Job (doing historical load) simultaneously.

My current task is creating a Helm Chart with simple descriptions and actions ready to be placed in UI. The manual way of saving and restoring the CDC position would be very complicated regarding user-friendly UI.

@laskoviymishka laskoviymishka changed the title really big tables snapshoting RND: really big tables snapshoting Feb 21, 2025
@laskoviymishka laskoviymishka added the enhancement New feature or request label Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants