-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance due to single thread and mutexes #1803
Comments
Few other findings:
Here is profiling for buffered gRPC call. CPU is still underutilized and GC is triggered very often. |
Thank you for your descriptive write-up of the issue. You are on point with inefficient processing of transactions by the scheduler. Personally, I am migrating to a new project that plans to use Ligato more extensively, so I might actually do some optimizations in this area (I planned to optimize allocation of strings in the kvscheduler). |
I've started code analysis to increase performance in separate fork: https://github.com/1NCE-GmbH/vpp-agent. I'll explain some pre-investigation and would like to discuss VPP-agent implementation details on how to better optimize its work flow. First of all here are traces and profiler results: vpp-agent-profile.zip
During analysis it became clear that KV Scheduler is single point of throttling. Launching 30000 operations via gRPC channel this one entity blocks all other threads while performing consumeTransactions() -> processTransaction() calls. Multi-threaded execution effectively becomes single-threaded operation of VPP-agent All other gRPC calls are waiting on one single lock. But removing this particular mutex will not solve issue as these routines would park on channel So main target for optimization is KV Scheduler. I'll try to refactor this module to utilize as much CPU time as possible. But I'd like to consult with project owners and mantainers to make this process more robust. My questions to maintainers of VPP-agent:
CC: @ondrej-fabry, @milanlenco |
After solving KV Scheduler issue I'll create followup issues regarding GC and memory usage, mutexes, and txnQueue size. There is a possibility that some of them will be solved together with Scheduler problems. So let's postpone new issue creation. |
…o#1803' of https://github.com/1NCE-GmbH/vpp-agent into Poor-performance-due-to-single-thread-and-mutexes-ligato#1803
Started experimenting with multithread approach.
To overcome this lock the Graph must be refactored. It can be done by moving mutex locking inside API methods. It would also require to decouple Node from Graph. Currently Nodes directly access Graph during Targets re-configuration. The Target struct might contain direct pointer to other nodes, or Graph itself should be responsible for Targets. Investigation is ongoing. Any comment from maintainers would help |
Just want to note that Milan is no longer working with me as he started working for other company and most likely will not reply here. However before he left he was in the progress of preparing refactor of KVScheduler. You could look into it to get some idea of the planned changes. There are currently no plans to work on this at the moment, but possibly with your help we could make these ready. here is the most recent one which was supposed to introduce new descriptor api v2 that replaces key parameter with more flexible context: and here is little bit older planned change which is more related to your discoveries and was supposed to actually add support for parallel processing of transactions: |
@ondrej-fabry , thank you for response. I studied provided branches. This one optimizes GC. I'm thinking of splitting current issue into 2 steps. One is for parallel tnx processing, other for GC and memory management. In scope of this issue I'll focus on Scheduler tnx handling and after that I'll create a followup for memory. This one does relate to performance issue. However it doesn't solve root cause of exclusive Graph write locking. I'll continue investigation for Graph handling refactoring. To refactor Graph the following steps must be performed:
|
Small update. @ondrej-fabry thanks for feedback on my attempts to improve Scheduler. I hope next revisions of VPP-agend would become more performant. If you would like to continue this work then I encourage you to start with refactoring the Graph module. |
After some profiling it seems that these 2 functions impacts performance when pushing many operations:
Problem is that VPP agent process operations in single thread with mutex/queue blocks. Instead it should utilize cheap Go routines and sync.Pool memory.
To collect trace I used pprof tool in main() function and pushed 30000 Update commands via gRPC connection.
Trace will show underutilized CPU and long wait time on mentioned functions
The text was updated successfully, but these errors were encountered: