-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RND: really big tables snapshoting #223
Comments
Probably related to #204 |
@BorisTyshkevich , @laskoviymishka Isnt the same question you discussed not long ago? #209 |
@work-vv It's a different topic, not related to position management. There are two problems when dealing with really big tables:
It can't be solved in a general way for any DB-DB transfers. However, for Clickhouse dst, when writing to ReplacingMergeTree (with version) the latter problem could be solved by copying historic data after CDC starts. Checkpointing is a more complicated topic. Probably, iterating over the src table's PK would help. |
@BorisTyshkevich Looks like we tackle the same problem but different way. How about to forget the table lock and transfer historical data to clickhouse in comfortable time and way. But sync it later with command ./trcli replicate starting with know binlog position. The gap will be rewritten with correct data changes. |
This manual approach is also valid, and I use it. However, the zero-version method allows automating the process - you can run StatefulSet (doing CDC) and Job (doing historical load) simultaneously. My current task is creating a Helm Chart with simple descriptions and actions ready to be placed in UI. The manual way of saving and restoring the CDC position would be very complicated regarding user-friendly UI. |
When we need to replicate a very big table from postgres/mysql/etc with 1B rows and 1Tb of data, simple
select * from the table
could be problematic and fail for different reasons. Do we have any checkpointing mechanism to continue loading not from the very beginning each time? How does it work?Another problem is setting a lock on the server or table for too long during initial loading.
My old way (used with Debezium and such) for syncing to Clickhouse was to initiate a CDC from the current position, placing data to ReplacingMergeTree(version) with version = now(), and next run a full snapshot with version=0. That way, it proves to be more stable and doesn't create locks in Postgres/mysql.
How can I implement such with the transfer? Is it possible to set version=0 for SNAPSHOT_ONLY mode? Or maybe has it been done already?
Boris.
The text was updated successfully, but these errors were encountered: