Adjust io-threads number in runtime to fully utilize multi threads and make CPU more efficient. #111

lipzhu · 2024-04-01T03:28:47Z

Description

This patch try to fix the issue when io-threads number is configured, io-threads will be stopped according below logic, this will cause the io-threads useless and longer latency. According to existing logic, the io-threads will be activated by a hardcode threshold value, this will cause QPS regression by the different configured io-threads number even the same clients requests.

int stopThreadedIOIfNeeded(void) {
    int pending = listLength(server.clients_pending_write);

    /* Return ASAP if IO threads are disabled (single threaded mode). */
    if (server.io_threads_num == 1) return 1;

    if (pending < (server.io_threads_num*2)) {
        if (server.io_threads_active) stopThreadedIO();
        return 1;
    } else {
        return 0;
    }
}

With this patch, it will dynamically adjust the /IO threads number in runtime according to real workloads to tradeoff between the latency and CPU efficiency. I use decay rate formula based on clients_pending_writes to make throughput stable.

Benchmark Result

Test Environment

OS: CentOS Stream 8
Kernel: 6.2.0
Platform: Intel Xeon Platinum 8380
Base commit: b2a3973

Test Steps

Start Valkey-Server

Start valkey-server taskset -c 0-3 ~/src/valkey-server /tmp/valkey_1.conf with below config.

port 9001
bind * -::*
daemonize yes
protected-mode no
save ""
io-threads 4
io-threads-do-reads yes

Test from throughput perspective

Start valkey-benchmark in localhost with below commands:
taskset -c 4-7 ./src/valkey-benchmark -p 9001 -t set,get -d 128 -r 5000000 -n 10000000 --threads 2 -c 10
to measure the SET GET performance gap correspondingly, results are listed as below:
we can find that the io-threads is not used due to clients_pending_write < (server.io_threads_num*2), the throughput is limited by only single thread even more CPU resources were allocated.

NOTE: QPS will use relative value instead of absolute value.
CPU utilized: perf stat -p `pidof valkey-server` sleep 5

	Base CPU utilized	Opt CPU utilized	QPS Gain(Opt/Base-1)
SET	1	2	23%
GET	1	2	25%

Test from CPU efficiency perspective

If we increase the requests number by below commands to make origin io-threads work, throughput increased as expected, but all CPU are fully utilized. With this patch, we can make CPU more efficiency by reducing the active io-threads number.
taskset -c 4-7 ./src/valkey-benchmark -p 9001 -t set,get -d 128 -r 5000000 -n 10000000 --threads 2 -c 15

	Base CPU utilized	Opt CPU utilized	QPS Gain(Opt/Base-1)
SET	3.999	2.908	8%
GET	3.999	2.918	3%

One test case issue is related to #397

Signed-off-by: Lipeng Zhu <[email protected]>

lipzhu · 2024-04-04T13:09:01Z

@valkey-io/core-team Could you help to review this patch?

madolson · 2024-04-04T18:00:21Z

@lipzhu I noticed there wasn't a test, we should at least add one for that.

lipzhu · 2024-04-07T07:11:53Z

@madolson I am thinking how to add the unit tests for io-threads, it's for performance purpose, I am uncertain how to simulate it in unit tests, can you provide some guidance?

Signed-off-by: Lipeng Zhu <[email protected]>

lipzhu · 2024-04-11T06:57:04Z

Ping @madolson .

src/networking.c

PingXie · 2024-04-24T05:17:17Z

tests/integration/io-threads.tcl

+ }
+ }
+
+ test "Enable io-threads" {


Can we add a test case that validates

the active thread count can increase to the max

the active thread count can decay back to 1

Writing a reliable test for this feature is tricky and we might need to introduce additional DEBUG hooks but I think it will be worth the effort.

Writing a reliable test for this feature is tricky and we might need to introduce additional DEBUG hooks but I think it will be worth the effort

Can you provide more guidance of how to introduce additional DEBUG hooks? In my local, I used memtier_benchmark -c 25 -t 4 to simulate the multiple clients/requests, I am not sure if the TCL test framework had the similar API or where can I test the correctness, debug print a message of the io_threads_active_num?

Kindly ping @PingXie

I was thinking about adding some new DBBUG command to modify the engine behavior (it looks like you made a change in debug.c but it doesn't look like you've completed it). After seeing #344, I think a real unit test would be more preferable.

@PingXie Thanks for your effort to review this.

I was thinking about adding some new DBBUG command to modify the engine behavior (it looks like you made a change in debug.c but it doesn't look like you've completed it)

I just add a monitor for DEBUG io-threads command, sorry, I didn't get your point to modify the engine behavior, can you be more specific?

After seeing #344, I think a real unit test would be more preferable.

Do you want to add the unit test in this PR, or we can add the unit test when new test framework merged?

can you be more specific?

No worries. I was thinking of manipulating the data structure like clients_pending_write in a new DEBUG command to simulate different workload. Now that we will have a true unit test framework, I don't think we should bother with this hack at all. Do you mind reverting the changes in debug.c?

Do you want to add the unit test in this PR, or we can add the unit test when new test framework merged?

I think @madolson is actively working on #344 and it is also close to completion. That said, I am fine with adding the tests in a separate PR and if you decide so, could you open an issue and assign it to yourself?

I like the idea of adding some DEBUG commands so we can properly do TCL tests of IO threading features. It probably doesn't make sense to play with the pending writes directly, but something like DEBUG set-active-io-thread-count could be a good DEBUG command, even outside of just TCL testing, so that we can force IO threads on and off. E.g. I remember debugging some TLS related issues with IO threading that only happened when IO threads turned off, and it required me to run and stop valkey-benchmark repeatedly.

So I would suggest:

Unit testing the "pending writes" to "active IO threads" logic

E2E testing the whole IO thread flow with a DEBUG command that sends commands while adjusting IO thread count up and down

But we can follow up on those improvements

WDYT?

That is a good point. I think it would be good to address it separately though. Let's open a separate issue?

For this change, I am mostly interested in proving that the EMA logic is working properly (it can scale up and also scale back down) so I think a real unit test would be a lot less flaky to maintain in the long run.

@PingXie #396 #397 were created to track the debug commands and tests. @murphyjacob4 please feel free to add. BTW, I have no permission to assign the issues.

PingXie · 2024-04-24T05:32:14Z

My understanding of this change is that it helps with the volatile workload case where request rates fluctuate quite a bit. This change doesn't allocate/activate more IO worker threads at runtime but it allows the IO worker threads to stay active a bit longer using the Exponential Moving Average (EMA). This trades off a bit of CPU efficiency for a more consistent latency through these workload spikes/troughs, potentially significant.

Overall I am onboard with this change. I think the only thing I am looking forward to is a test that proves we can still activate all the threads configured via io-threads and the IO threads can indeed quiesce.

PingXie · 2024-04-24T05:34:08Z

I am thinking how to add the unit tests for io-threads, it's for performance purpose, I am uncertain how to simulate it in unit tests, can you provide some guidance?

I don't think we need a (unit) test for performance. We will use release candidates to get feedback. What we really need is a correctness test as I mentioned above.

Signed-off-by: Lipeng Zhu <[email protected]>

codecov · 2024-04-26T02:34:54Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.47%. Comparing base (6fb90ad) to head (a0848a8).
Report is 172 commits behind head on unstable.

Additional details and impacted files

@@             Coverage Diff              @@
##           unstable     #111      +/-   ##
============================================
+ Coverage     70.02%   70.47%   +0.45%     
============================================
  Files           109      109              
  Lines         59957    59955       -2     
============================================
+ Hits          41985    42254     +269     
+ Misses        17972    17701     -271

Files	Coverage Δ
src/networking.c	`91.80% <100.00%> (+6.44%)`	⬆️
src/server.c	`88.93% <100.00%> (+<0.01%)`	⬆️

... and 11 files with indirect coverage changes

Signed-off-by: Lipeng Zhu <[email protected]>

src/server.h

src/server.c

src/networking.c

src/server.c

src/networking.c

PingXie · 2024-04-26T03:44:02Z

tests/integration/io-threads.tcl

+ }
+ }
+
+ test "Enable io-threads" {


I was thinking about adding some new DBBUG command to modify the engine behavior (it looks like you made a change in debug.c but it doesn't look like you've completed it). After seeing #344, I think a real unit test would be more preferable.

Signed-off-by: Lipeng Zhu <[email protected]>

lipzhu · 2024-04-28T10:18:47Z

@PingXie I added integration test to validate the io_threads_active_num EMA logic working properly. Could you help to review again?

Signed-off-by: Lipeng Zhu <[email protected]>

PingXie

LTGM. Some final touches and this PR is good to go.

src/networking.c

src/server.c

PingXie · 2024-04-30T05:27:05Z

tests/integration/io-threads.tcl

+ regexp {io_threads_active:([0-9]+)} $server_info -> \
+ io_threads_active
+ regexp {io_threads_maximum_num:([0-9]+)} $server_info -> \
+ io_threads_maximum_num
+ regexp {io_threads_active_num:([0-9]+)} $server_info -> \
+ io_threads_active_num


Suggested change

regexp {io_threads_active:([0-9]+)} $server_info -> \

io_threads_active

regexp {io_threads_maximum_num:([0-9]+)} $server_info -> \

io_threads_maximum_num

regexp {io_threads_active_num:([0-9]+)} $server_info -> \

io_threads_active_num

set io_threads_active [get_info_field $server_info io_threads_active]

set io_threads_maximum_num [get_info_field $server_info io_threads_maximum_num]

set io_threads_active_num [get_info_field $server_info io_threads_active_num]

Thanks, proc get_info_field is a common proc in cluster module, I just moved it to support/util.tcl.

PingXie · 2024-04-30T05:35:32Z

tests/integration/io-threads.tcl

+
+ # Scale up io_threads_active_num by sending large number requests to server.
+ set bench_pid [start_benchmark 10 50]
+ wait_for_condition 1000 500 {


This is quite a loop but judging by your test runs, this test doesn't seem to increase the test execution time.

Another concern of mine is the potential flakiness of this test since it has non-deterministic dependency in nature.

That said, I am fine with keeping this test for now. Push comes shove, we can remove it since we will get coverage from #397 as well.

this test doesn't seem to increase the test execution time

Depends on the write speed, the condition of meet total count of 10000 should be very fast.

Another concern of mine is the potential flakiness of this test since it has non-deterministic dependency in nature.

That said, I am fine with keeping this test for now. Push comes shove, we can remove it since we will get coverage from #397 as well.

Got it.

PingXie

Thanks @lipzhu!

Signed-off-by: Lipeng Zhu <[email protected]>

lipzhu · 2024-05-07T08:50:23Z

What is the next process, is it ready to merge? @PingXie

Signed-off-by: Lipeng Zhu <[email protected]>

madolson

LGTM , just two minor comments.

src/server.c

src/networking.c

Signed-off-by: Lipeng Zhu <[email protected]>

madolson · 2024-05-28T22:12:06Z

@valkey-io/core-team Need approval for adding a single new metric, io_threads_active_num, which will indicate the number of IO threads that are actively doing work. Some of the work in this PR overlaps with the other performance work that is ongoing, but the tests should all largely still be applicable. Asking for an async vote for the new metric.

The real world usage of the metric is probably so an operator can understand how many of the CPUs are actually doing useful work when debugging.

zuiderkwast · 2024-05-29T18:47:13Z

For some context, we already have io_threads_active 0 or 1 (on/off). I suppose we need to keep this one as is...

The config io-threads includes the main thread in the count, so the minimum is 1. Will io_threads_active_num include the main thread too?

lipzhu · 2024-05-29T23:01:46Z

For some context, we already have io_threads_active 0 or 1 (on/off). I suppose we need to keep this one as is...

The io_threads_active is kept.

The config io-threads includes the main thread in the count, so the minimum is 1. Will io_threads_active_num include the main thread too?

Yes, io_threads_active_num started from 1, main thread is included.

zuiderkwast · 2024-05-30T21:18:32Z

src/networking.c

-void startThreadedIO(void) {
- serverAssert(server.io_threads_active == 0);
- for (int j = 1; j < server.io_threads_num; j++) pthread_mutex_unlock(&io_threads_mutex[j]);
- server.io_threads_active = 1;
-}


These changes will just cause merge conflicts for the Async IO threads feature. It's better that we don't merge it now.

I saw the Async IO threads remove the startThreadIO and use the dynamic threads number too, this is similar with this patch. The difference is that Async IO threads keep the threshold based on the real time events count, this patch use EMA to make a more consistent latency when requests spikes/troughs.

Maybe we could merge this firstly and then rebase Async IO threads on this? If needed, I could help the rebase work.

@uriyage WDYT?

Signed-off-by: Lipeng Zhu <[email protected]>

lipzhu · 2024-06-03T01:50:09Z

So we are still waiting for the approval from @soloestoy @enjoy-binbin , could you help to take a look at this when you are free?

enjoy-binbin

LGTM, did not review the tests

zuiderkwast · 2024-06-09T17:12:16Z

I think it's better to merge Async IO threads first and then add this one afterwards, since this one is a smaller change.

madolson · 2024-08-26T14:44:54Z

@lipzhu Do you want to update this PR to handle the new IO threads or shall we close this?

lipzhu · 2024-08-27T00:08:49Z

@lipzhu Do you want to update this PR to handle the new IO threads or shall we close this?

Sure, let's close this PR as most logic is already covered by new IO threads.
The difference is the EMA logic for target thread num, I can open a new PR in future if needed.

Adjust io-threads number in runtime to fully utilize multi threads.

e3681fd

Signed-off-by: Lipeng Zhu <[email protected]>

madolson self-requested a review April 4, 2024 15:50

Merge remote-tracking branch 'origin/unstable' into io-threads

8da611a

Add unit tests for io-threads.

3800d00

Signed-off-by: Lipeng Zhu <[email protected]>

lipzhu mentioned this pull request Apr 17, 2024

[NEW] Reaching 1 million requests per second on a single Valkey instance #22

Closed

PingXie reviewed Apr 24, 2024

View reviewed changes

lipzhu added 3 commits April 24, 2024 04:22

Revert function adjustThreadedIOIfNeeded to stopThreadedIOIfNeeded.

525a306

Signed-off-by: Lipeng Zhu <[email protected]>

Solve conflicts with unstable.

852f12a

Signed-off-by: Lipeng Zhu <[email protected]>

remove unused io_threads_active in server.

17b5871

Signed-off-by: Lipeng Zhu <[email protected]>

lipzhu requested a review from PingXie April 26, 2024 02:24

Add debug io-threads command to help monitor io-threads-active-number.

13e11cd

Signed-off-by: Lipeng Zhu <[email protected]>

PingXie reviewed Apr 26, 2024

View reviewed changes

lipzhu added 4 commits April 26, 2024 02:47

Comments update and code refine.

7bfe910

Signed-off-by: Lipeng Zhu <[email protected]>

Change scale up behavior and unit test update.

32fdc5c

Signed-off-by: Lipeng Zhu <[email protected]>

fix unit test.

5e42f72

Signed-off-by: Lipeng Zhu <[email protected]>

Fix test failure.

038cadb

Signed-off-by: Lipeng Zhu <[email protected]>

This was referenced Apr 27, 2024

[NEW] Add subcommands for io-threads in DEBUG command. #396

Open

[NEW] Add tests for io-threads. #397

Open

lipzhu added 3 commits April 26, 2024 22:38

Revert debug.c.

45b1f8d

Signed-off-by: Lipeng Zhu <[email protected]>

Add integration test to scale up/down io_threads_active_num.

d6361e9

Signed-off-by: Lipeng Zhu <[email protected]>

Try to fix failure test.

415571a

Signed-off-by: Lipeng Zhu <[email protected]>

Remove validation of dbsize.

1433dc3

Signed-off-by: Lipeng Zhu <[email protected]>

lipzhu requested a review from PingXie April 29, 2024 07:26

PingXie reviewed Apr 30, 2024

View reviewed changes

PingXie approved these changes Apr 30, 2024

View reviewed changes

Code enhancement.

e67e398

Signed-off-by: Lipeng Zhu <[email protected]>

merge unstable

1eb7c54

Signed-off-by: Lipeng Zhu <[email protected]>

madolson changed the title ~~Adjust io-threads number in runtime to fully utilize multi threads and make CPU more efficiency.~~ Adjust io-threads number in runtime to fully utilize multi threads and make CPU more efficient. May 23, 2024

resolve conflicts.

f7d21c5

Signed-off-by: Lipeng Zhu <[email protected]>

madolson reviewed May 26, 2024

View reviewed changes

src/server.c Outdated Show resolved Hide resolved

src/networking.c Show resolved Hide resolved

lipzhu added 2 commits May 26, 2024 23:00

resolve conflicts.

8acc347

Signed-off-by: Lipeng Zhu <[email protected]>

remove io_threads_maximum_num from valkey info string.

ace50bb

Signed-off-by: Lipeng Zhu <[email protected]>

madolson approved these changes May 28, 2024

View reviewed changes

madolson added major-decision-pending Major decision pending by TSC team needs-doc-pr This change needs to update a documentation page. Remove label once doc PR is open. labels May 28, 2024

hwware approved these changes May 30, 2024

View reviewed changes

zuiderkwast reviewed May 30, 2024

View reviewed changes

lipzhu added 2 commits May 31, 2024 03:05

resove conflicts.

1f02956

Signed-off-by: Lipeng Zhu <[email protected]>

clang-format.

a0848a8

Signed-off-by: Lipeng Zhu <[email protected]>

enjoy-binbin approved these changes Jun 4, 2024

View reviewed changes

lipzhu closed this Aug 27, 2024

Adjust io-threads number in runtime to fully utilize multi threads and make CPU more efficient. #111

Adjust io-threads number in runtime to fully utilize multi threads and make CPU more efficient. #111

Conversation

lipzhu commented Apr 1, 2024 • edited by hwware Loading

Description

Benchmark Result

Test Environment

Test Steps

Start Valkey-Server

Test from throughput perspective

Test from CPU efficiency perspective

lipzhu commented Apr 4, 2024

madolson commented Apr 4, 2024

lipzhu commented Apr 7, 2024

lipzhu commented Apr 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PingXie commented Apr 24, 2024

PingXie commented Apr 24, 2024

codecov bot commented Apr 26, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

lipzhu commented Apr 28, 2024

PingXie left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PingXie left a comment

Choose a reason for hiding this comment

lipzhu commented May 7, 2024 • edited Loading

madolson left a comment

Choose a reason for hiding this comment

madolson commented May 28, 2024

zuiderkwast commented May 29, 2024

lipzhu commented May 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lipzhu commented Jun 3, 2024

enjoy-binbin left a comment

Choose a reason for hiding this comment

zuiderkwast commented Jun 9, 2024

madolson commented Aug 26, 2024

lipzhu commented Aug 27, 2024

lipzhu commented Apr 1, 2024 •

edited by hwware

Loading

codecov bot commented Apr 26, 2024 •

edited

Loading

lipzhu commented May 7, 2024 •

edited

Loading