-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust io-threads number in runtime to fully utilize multi threads and make CPU more efficient. #111
Conversation
Signed-off-by: Lipeng Zhu <[email protected]>
@valkey-io/core-team Could you help to review this patch? |
@lipzhu I noticed there wasn't a test, we should at least add one for that. |
@madolson I am thinking how to add the unit tests for io-threads, it's for performance purpose, I am uncertain how to simulate it in unit tests, can you provide some guidance? |
Signed-off-by: Lipeng Zhu <[email protected]>
Ping @madolson . |
} | ||
} | ||
|
||
test "Enable io-threads" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test case that validates
- the active thread count can increase to the max
- the active thread count can decay back to 1
Writing a reliable test for this feature is tricky and we might need to introduce additional DEBUG
hooks but I think it will be worth the effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Writing a reliable test for this feature is tricky and we might need to introduce additional DEBUG hooks but I think it will be worth the effort
Can you provide more guidance of how to introduce additional DEBUG hooks? In my local, I used memtier_benchmark -c 25 -t 4
to simulate the multiple clients/requests, I am not sure if the TCL test framework had the similar API or where can I test the correctness, debug print a message of the io_threads_active_num
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kindly ping @PingXie
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about adding some new DBBUG
command to modify the engine behavior (it looks like you made a change in debug.c but it doesn't look like you've completed it). After seeing #344, I think a real unit test would be more preferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PingXie Thanks for your effort to review this.
I was thinking about adding some new DBBUG command to modify the engine behavior (it looks like you made a change in debug.c but it doesn't look like you've completed it)
I just add a monitor for DEBUG io-threads
command, sorry, I didn't get your point to modify the engine behavior, can you be more specific?
After seeing #344, I think a real unit test would be more preferable.
Do you want to add the unit test in this PR, or we can add the unit test when new test framework merged?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you be more specific?
No worries. I was thinking of manipulating the data structure like clients_pending_write
in a new DEBUG
command to simulate different workload. Now that we will have a true unit test framework, I don't think we should bother with this hack at all. Do you mind reverting the changes in debug.c?
Do you want to add the unit test in this PR, or we can add the unit test when new test framework merged?
I think @madolson is actively working on #344 and it is also close to completion. That said, I am fine with adding the tests in a separate PR and if you decide so, could you open an issue and assign it to yourself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of adding some DEBUG commands so we can properly do TCL tests of IO threading features. It probably doesn't make sense to play with the pending writes directly, but something like DEBUG set-active-io-thread-count
could be a good DEBUG command, even outside of just TCL testing, so that we can force IO threads on and off. E.g. I remember debugging some TLS related issues with IO threading that only happened when IO threads turned off, and it required me to run and stop valkey-benchmark repeatedly.
So I would suggest:
- Unit testing the "pending writes" to "active IO threads" logic
- E2E testing the whole IO thread flow with a DEBUG command that sends commands while adjusting IO thread count up and down
But we can follow up on those improvements
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point. I think it would be good to address it separately though. Let's open a separate issue?
For this change, I am mostly interested in proving that the EMA logic is working properly (it can scale up and also scale back down) so I think a real unit test would be a lot less flaky to maintain in the long run.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PingXie #396 #397 were created to track the debug commands and tests. @murphyjacob4 please feel free to add. BTW, I have no permission to assign the issues.
My understanding of this change is that it helps with the volatile workload case where request rates fluctuate quite a bit. This change doesn't allocate/activate more IO worker threads at runtime but it allows the IO worker threads to stay active a bit longer using the Exponential Moving Average (EMA). This trades off a bit of CPU efficiency for a more consistent latency through these workload spikes/troughs, potentially significant. Overall I am onboard with this change. I think the only thing I am looking forward to is a test that proves we can still activate all the threads configured via |
I don't think we need a (unit) test for performance. We will use release candidates to get feedback. What we really need is a correctness test as I mentioned above. |
Signed-off-by: Lipeng Zhu <[email protected]>
Signed-off-by: Lipeng Zhu <[email protected]>
Signed-off-by: Lipeng Zhu <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #111 +/- ##
============================================
+ Coverage 70.02% 70.47% +0.45%
============================================
Files 109 109
Lines 59957 59955 -2
============================================
+ Hits 41985 42254 +269
+ Misses 17972 17701 -271
|
Signed-off-by: Lipeng Zhu <[email protected]>
} | ||
} | ||
|
||
test "Enable io-threads" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about adding some new DBBUG
command to modify the engine behavior (it looks like you made a change in debug.c but it doesn't look like you've completed it). After seeing #344, I think a real unit test would be more preferable.
Signed-off-by: Lipeng Zhu <[email protected]>
Signed-off-by: Lipeng Zhu <[email protected]>
Signed-off-by: Lipeng Zhu <[email protected]>
Signed-off-by: Lipeng Zhu <[email protected]>
Signed-off-by: Lipeng Zhu <[email protected]>
Signed-off-by: Lipeng Zhu <[email protected]>
Signed-off-by: Lipeng Zhu <[email protected]>
@PingXie I added integration test to validate the io_threads_active_num EMA logic working properly. Could you help to review again? |
Signed-off-by: Lipeng Zhu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LTGM. Some final touches and this PR is good to go.
tests/integration/io-threads.tcl
Outdated
regexp {io_threads_active:([0-9]+)} $server_info -> \ | ||
io_threads_active | ||
regexp {io_threads_maximum_num:([0-9]+)} $server_info -> \ | ||
io_threads_maximum_num | ||
regexp {io_threads_active_num:([0-9]+)} $server_info -> \ | ||
io_threads_active_num |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regexp {io_threads_active:([0-9]+)} $server_info -> \ | |
io_threads_active | |
regexp {io_threads_maximum_num:([0-9]+)} $server_info -> \ | |
io_threads_maximum_num | |
regexp {io_threads_active_num:([0-9]+)} $server_info -> \ | |
io_threads_active_num | |
set io_threads_active [get_info_field $server_info io_threads_active] | |
set io_threads_maximum_num [get_info_field $server_info io_threads_maximum_num] | |
set io_threads_active_num [get_info_field $server_info io_threads_active_num] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, proc get_info_field
is a common proc in cluster module, I just moved it to support/util.tcl
.
|
||
# Scale up io_threads_active_num by sending large number requests to server. | ||
set bench_pid [start_benchmark 10 50] | ||
wait_for_condition 1000 500 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is quite a loop but judging by your test runs, this test doesn't seem to increase the test execution time.
Another concern of mine is the potential flakiness of this test since it has non-deterministic dependency in nature.
That said, I am fine with keeping this test for now. Push comes shove, we can remove it since we will get coverage from #397 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this test doesn't seem to increase the test execution time
Depends on the write speed, the condition of meet total count of 10000 should be very fast.
Another concern of mine is the potential flakiness of this test since it has non-deterministic dependency in nature.
That said, I am fine with keeping this test for now. Push comes shove, we can remove it since we will get coverage from #397 as well.
Got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @lipzhu!
Signed-off-by: Lipeng Zhu <[email protected]>
What is the next process, is it ready to merge? @PingXie |
Signed-off-by: Lipeng Zhu <[email protected]>
Signed-off-by: Lipeng Zhu <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM , just two minor comments.
Signed-off-by: Lipeng Zhu <[email protected]>
Signed-off-by: Lipeng Zhu <[email protected]>
@valkey-io/core-team Need approval for adding a single new metric, The real world usage of the metric is probably so an operator can understand how many of the CPUs are actually doing useful work when debugging. |
For some context, we already have The config |
The
Yes, |
void startThreadedIO(void) { | ||
serverAssert(server.io_threads_active == 0); | ||
for (int j = 1; j < server.io_threads_num; j++) pthread_mutex_unlock(&io_threads_mutex[j]); | ||
server.io_threads_active = 1; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changes will just cause merge conflicts for the Async IO threads feature. It's better that we don't merge it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw the Async IO threads remove the startThreadIO
and use the dynamic threads number too, this is similar with this patch. The difference is that Async IO threads keep the threshold
based on the real time events count, this patch use EMA to make a more consistent latency when requests spikes/troughs.
Maybe we could merge this firstly and then rebase Async IO threads on this? If needed, I could help the rebase work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@uriyage WDYT?
Signed-off-by: Lipeng Zhu <[email protected]>
Signed-off-by: Lipeng Zhu <[email protected]>
So we are still waiting for the approval from @soloestoy @enjoy-binbin , could you help to take a look at this when you are free? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, did not review the tests
I think it's better to merge Async IO threads first and then add this one afterwards, since this one is a smaller change. |
@lipzhu Do you want to update this PR to handle the new IO threads or shall we close this? |
Sure, let's close this PR as most logic is already covered by new IO threads. |
Description
This patch try to fix the issue when io-threads number is configured, io-threads will be stopped according below logic, this will cause the io-threads useless and longer latency. According to existing logic, the io-threads will be activated by a hardcode threshold value, this will cause QPS regression by the different configured io-threads number even the same clients requests.
With this patch, it will dynamically adjust the /IO threads number in runtime according to real workloads to tradeoff between the latency and CPU efficiency. I use decay rate formula based on
clients_pending_writes
to make throughput stable.Benchmark Result
Test Environment
Test Steps
Start Valkey-Server
Start valkey-server
taskset -c 0-3 ~/src/valkey-server /tmp/valkey_1.conf
with below config.Test from throughput perspective
Start valkey-benchmark in localhost with below commands:
taskset -c 4-7 ./src/valkey-benchmark -p 9001 -t set,get -d 128 -r 5000000 -n 10000000 --threads 2 -c 10
to measure the SET GET performance gap correspondingly, results are listed as below:
we can find that the io-threads is not used due to
clients_pending_write < (server.io_threads_num*2)
, the throughput is limited by only single thread even more CPU resources were allocated.NOTE: QPS will use relative value instead of absolute value.
CPU utilized:
perf stat -p `pidof valkey-server` sleep 5
Test from CPU efficiency perspective
If we increase the requests number by below commands to make origin io-threads work, throughput increased as expected, but all CPU are fully utilized. With this patch, we can make CPU more efficiency by reducing the active io-threads number.
taskset -c 4-7 ./src/valkey-benchmark -p 9001 -t set,get -d 128 -r 5000000 -n 10000000 --threads 2 -c 15
One test case issue is related to #397