The slave sync thread model is not reasonable #2637

cheniujh · 2024-05-07T15:22:29Z

Is this a regression?

No

Description

当前，Pika从节点消费Binlog部分的线程模型是：
1 取conf文件中的sync-thread-num值，产生 sync-thread-num * 2数量的worker线程，这批worker的前一半会被选取来Apply Binlog，后一半Worker用于Apply DB
2 消费Binlog时，为了保整消费顺序，每个DB的Binlog都确保是同一个worker处理的，此时的worker选取策略是对db_name做hash来得到worker index，从worker vector的前一半中取一个固定的worker
3 某个worker完成Apply Binlog以后，会使用key做hash来取得index，从worker数组的后一半中取得一个worker，提交异步的WriteDB任务

问题在于：
针对1，用户并不知情pika内部对sync-thread-num乘以了2，是否不太合适，而且这样一来其实用户无法精确控制具体的线程数：比如为了保证WriteDB部分线程数不会太少，该配置项的默认值是6，那么Pika内部就一共有12个Worker，其中前6个用于写Binlog，后6个用于写DB，而在单DB的情况下，前6个worker中有5个是闲置且永远不会被使用的。
针对2，使用db_name做hash来取得index，存在倾斜问题，经过实测，在DB数量为8，且sync-thread-num为8的情况下，根据hash映射：
DB 1,4,7会都绑定到worker 3上；
DB 3,6绑定到worker 0上；
DB0会绑定到worker2；
DB2会绑定到worker7上；
DB5会绑定到worker6上；
（这里说的绑定是指:DB只向该worker线程委托(Schedule)写Binlog的任务）
而worker1，4，5完全闲置，但此时有8个DB，也有8个worker也是专门用于写Binlog的，完全可以每个DB使用一个worker。
这种倾斜不仅会带来资源浪费，更重要的是如果某个DB暂时因为WriteStall阻塞了，这种阻塞可能会被放大（因为共用worker的缘故，会把其他DB的WriteBinlog任务也给阻塞了）

Please provide a link to a minimal reproduction of the bug

No response

Screenshots or videos

No response

Please provide the version you discovered this bug in (check about page for version information)

No response

Anything else?

No response

cheniujh · 2024-05-07T15:26:15Z

Description:

Currently, Pika's thread model for consuming Binlog from nodes is as follows:

The value of sync-thread-num from the configuration file is taken to generate sync-thread-num * 2 worker threads. The first half of these workers are selected to apply Binlog, while the second half are used for applying operations to the database (DB).
To ensure the order of consumption, each DB's Binlog is processed by the same worker. The worker selection strategy involves hashing the db_name to obtain a worker index, and a fixed worker is chosen from the first half of the worker vector.
After a worker completes applying Binlog, it uses a hash of the key to obtain an index and selects a worker from the second half of the worker array to submit an asynchronous WriteDB task.

The issues are:
Regarding point 1, users are not informed that Pika internally multiplies sync-thread-num by two, which may not be appropriate as it prevents users from precisely controlling the actual number of threads. For instance, to ensure that the number of threads for WriteDB is not too few, the default value of this configuration item is 6, resulting in a total of 12 workers inside Pika. Of these, the first six are used for writing Binlog, and the last six for the DB. In a single DB scenario, five of the first six workers remain idle and are never used.

Regarding point 2, using the db_name hash to obtain an index can lead to an imbalance. For example, in a test with 8 DBs and sync-thread-num set to 8, the hash mapping results in:

DB 1, 4, and 7 all binding to worker 3;
DB 3 and 6 binding to worker 0;
DB 0 binding to worker 2;
DB 2 binding to worker 7;
DB 5 binding to worker 6.
In this setup, workers 1, 4, and 5 are completely idle. Although there are 8 DBs and an equal number of workers dedicated to writing Binlog, it would be more efficient for each DB to have its own worker. This imbalance not only leads to resource wastage but also risks amplifying issues such as WriteStall, where a stall in one DB could block Binlog writing tasks for other DBs sharing the same worker.

cheniujh added the ☢️ Bug Something isn't working label May 7, 2024

cheniujh mentioned this issue May 7, 2024

fix: Reconstruct slave sync thread model #2638

Merged

AlexStocks closed this as completed in #2638 May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The slave sync thread model is not reasonable #2637

The slave sync thread model is not reasonable #2637

cheniujh commented May 7, 2024 •

edited

cheniujh commented May 7, 2024 •

edited

The slave sync thread model is not reasonable #2637

The slave sync thread model is not reasonable #2637

Comments

cheniujh commented May 7, 2024 • edited

Is this a regression?

Description

Please provide a link to a minimal reproduction of the bug

Screenshots or videos

Please provide the version you discovered this bug in (check about page for version information)

Anything else?

cheniujh commented May 7, 2024 • edited

cheniujh commented May 7, 2024 •

edited

cheniujh commented May 7, 2024 •

edited