-
Notifications
You must be signed in to change notification settings - Fork 4
Patching memory leak using tracer #153
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
A quick test against object lifecycles and tracer has confirmed that no object is being leaked, although the parent/child relationship is still kind of messed up. I'm working on fixing that right now. |
Brian, Roger, and I had a discussion involving the different approaches I can take to observing the state of the native rust layer of the library. There were a lot of solutions and ideas explored. CallbacksThe first approach was keeping the interface and the API consistent by introducing callbacks in the native layer which invoke javascript functions to perform and trigger tracing. This would have otherwise worked if not for a caveat — javascript can hook into rust functions by passing in parameters and geting return values back, but this is a one-way street. Rust bindings is used by the javascript layer, but rust cannot pull javascript functions and execute them. This happens because javaascript works using an event loop to track what to run and when, and rust has a completely different control and logic flow, so rust and javascript execution contexts are fundamentally disconnected. Technically, its possible to run javascript code in rust, as the NAPI exposes a way to do so, but the executed callbacks are meant to be more along the lines of fire-and-forget, and not something which we can wait to get return values from. This isn't an issue for stopping a span, but the start span callback returns a span identifier. Thus, the rust code will look like this. pub fn javascriptCallback(napi::JsFunction &callback) {
// ...
const rx = napi::JsReceiver;
napi::callJavascriptFunction(&callback, napi::Threadsafe::Blocking);
const result = rx.await.unwrap();
} This code is simplified, but it conveys the idea. The rust code is called as a single task inside the event loop. There, the NAPI queues the javascript function and immediately waits for a response, but the javascript function hasn't executed yet, only queued, so the program ends up deadlocking. This is a known limitation of working with javascript callbacks in the native layer, so this approach cannot be continued. Unix socketsThe next idea I had was to use Unix sockets to communicate between the two different layers. Even disregarding the work that would go into structuring the API (which would basically be a repeat of the work done in First-party native integrationAnother approach is to use first-party native integration. This would result in explicit bindings being available for native code, so we won't have to worry about weird interactions or limitations between them. This, however, introduces its own slew of problems. Logic/IO separationThe first approach was proposed by Roger. He mentioned that we can separate the code into the logical part residing in the native rust layer, and the IO is handled by javascript events. This allows for a clear separation of events and treats javascript as a wrapper for the logic defined by the rust code. This approach will require rust to interact with javascript, and potentially in the same way as the callbacks approach, so a lot of design considerations needs to be made. Separate implementationsBrian's proposal involved implementing two separate layers for tracing. Basically, we keep javascript tracing as-is and implement a dedicated native tracer. Both would operate on their own separate loops and generate data separately. As we will be using Both these approaches run into the same higher-level issue — Fresh perspectiveRoger and I had a discussion after sleeping on it. Here, Roger clarified the requirements and problem statement. The purpose of this is to find and patch the memory leak. Tracer is just a tool we were going to use, but if it can be done without it, then that would also mark the issue as completed. Thus, instead of working around or against the data and control flow as defined by NAPI, we should work with it. The bindings of all objects need to be explicitly exposed to the javascript layer, so the tracing can be directly applied to the javascript layer itself without worrying about interfacing. To perform rust layer tracing, we can use dedicated rust profilers to gain observability. We will not get data in the same format as what tracer provides and what the visualiser expects, but it would help us solve the bug. Thus, we don't need to worry about symbols present in quiche which we cannot expose to javascript for manual tracing. There's another problem that needs to be considered in the design phase. Lower-level language like rust use a manual memory model, where memory is manually allocated and freed. Higher-level languages like javascript use a garbage collector to automatically free memory when it is no longer reachable by the code. As we will be exposing manually-managed heap objects to dynamically-maintained heap objects, there is a chance that the garbage collector doesn't have an idea of the actual object lifecycle. In native code, the heap buffer can just be overwritten after the object it contained has been destroyed, but this is not propagated to the javascript layer, and it can read garbage data from invalid memory location. Thus, we need to take care about this scenario in our implementation. This approach is what I will be implementing now. |
Tracer isn't viable for debugging this as of now. With tracer, I was able to confirm the bug was not present in the javascript layer, but nothing else. So, at the end, I fell back on using typical memory leak detectors like
We can generally ignore I'm working on figuring out if the issue is in |
The leak is clearly happening through QUIC's |
To get to this output, this is what I did. First, I had to tell cargo to include debug symbols and prevent it from optimising the code so I get a full traceback of the symbols. I added this in my [profile.release]
debug = true
strip = false
opt-level = 0
lto = false
codegen-units = 1
incremental = false Then, I cleaned and rebuilt the native library. I used a test script which creates a lot of streams, sends some data through it, and closes it immediately.
These thousand iterations accumulate roughly 150 KiB of leaked memory. This information is acquired by the following command. This command automatically runs the program while basically 'capturing' the state of the heap, and then opens the result in a visualiser.
We can actually look at the caller/callee tree from this GUI visualiser but that doesn't have enough information. To actually view that, I had to print the heaptrack data.
Then, I had to relate the hash shown in the memory leak (here being Also note that the number of leaked allocations is not consistent through runs. For example, one run could have 700 leaked allocations, another could have 500. However, the leaked memory is roughly the same across runs, so the amount of memory leaked is deterministic. |
Good information so far. But for the code you're using for testing, we want to simulate graceful usage of the stream. So that means we don't want to cancel the readable. The steps should be
|
Okay, I've changed the test to do this instead, and it runs the exact same with the same amount of memory being leaked.
|
The leak explored here isn't accurate. Due to a mistake in the test itself, the streams were not being awaited properly, resulting in this leak. This is no longer the case. When checking the running memory usage using tools like My guess is that most of the leaks shown here are occurring due to overhead from using quiche and isn't directly related to the issues we are encountering. It is a bit hard to see, but the top of the spikes are slowly growing, indicating that there is a leak according to this Consumed Memory graph over time (~40 seconds, 5000 streams). To actually get to the bottom of this, I will need to use the chrome heap analysis tool with node's debug mode enabled so I can inspect heap snapshots manually. I can also look into incorporating runtime debug symbols so I can get a relative frame of reference to a known symbol. |
After the meeting with Brian, we have made a few discoveries. We firstly used the heapdump from chrome inspector to inspect the diff between two heap states, one when the program was just started, and another when it had over 250 MiB of leaked memory. There, I discovered that I forgot to clean-install after removing the tracing code. This is already an issue in MatrixAI/js-logger#65, but the events were queued up in memory without a consumer, which was the cause of the observable spike in memory consumption. After we fixed that up, the memory no longer spiked for the memtest I had written. In the meantime, we made plenty of other discoveries. Rust monkeypatchingJavascript neatly lends to monkeypatching by providing access to the node modules directory which contains the source code of all upstream packages, and we can locally modify them for testing. This is usually not the case with compiled binaries as, well, they are compiled binaries. However, due to rust's nature, each package is compiled on the host system itself. This means that cargo pulls in the source code and compiles it instead of only including a complied binary. Thus, it is possible to monkeypatch rust code. There's an option in the [patch.crates-io]
quiche = { path = "/home/aryanj/Projects/quiche/quiche" } Adding this overrode the Program freeze when sent data exceeds 10 MiBFor some reason, when the data sent through the QUIC streams hit 10 MiB cumulative capacity, then the program basically stopped responding. I'm not sure if this is a real bug or just an artifact of testing. It is unlikely to see this happen on the nodes themselves, but maybe the occasional seednode freezes where they don't crash but also are just stuck in a state where they don't accept new connections could be attributed to this? It's too early to say without further investigation, which was put off in favour of working on the current issue. After all these discoveries and the result with no memory leaks on the stream side, we are forced to change our approach. Broaden scopeInstead of focusing on streams, focus on everything in QUIC. Going from data packets to connections, everything. We have proven that the leak isn't in QUIC streams, so check where else the leak could be. The next candidate is to test connections themselves by creating a connection, sending data back-and-forth, then closing the connection. However, if none of the tests actually reveal an issue, then it means that the leak isn't present in QUIC at all. Heaptrack Polykey agentAs our primary hypothesis has been proven incorrect, we have no idea where the leak could be. So, instead of looking blindly, run a long-term polykey agent with inspector enabled, and infrequently (every hour or so) take a heap dump using chrome's memory inspector. Then, backtrack from there to figure out what is consistently leaking memory over time. Temporary agents running on an isolated networkTestnet is stable as no nodes are connected to it due to network segregation work. This means that if we are able to replicate the memory leak with two agents, then we can reasonably limit what operations might be causing it. We can combine this approach with occasional heap tracing the agent to figure out what is holding it open. Small-scale network simulationIf all of the above steps fail to give us a reasonable direction, then I can run a seednode on my own system, and make some of my agents directly connect to them. Then, I will be able to track an entire polykey network to figure out exactly what could be causing the leak by having access to the entire network with all agents and clients. Final ThoughtsThe leak is not happening with streams, which was our first guess and the primary hypothesis. Thus, we need to step back and look over the entire codebase. To do this, we need to run polykey agents with inspector enabled so we can trace memory dumps as the agent is running. Tracer would be of huge help here, but it is not mature enough to actually provide us with relevant and detailed data as of now, so we will be falling back to more classic forms of leak fixing. After looking at the code, we believe that a rewrite would benefit this library, especially refactoring to use observables instead. However, QUIC refactoring has a large timescale, which can be better spent working on better issues. |
I just want to add, while we were going over it. I checked the metrics of the testnet and it shows the same slow growth that flattens out over time. Similar to how the mainnet used to act. This suggests that the problem possibly existed for a longer time than expected and its being triggered now by increased traffic. I suggested that we can run some test nodes on the testnet and force them to generate a lot of network traffic as part of the script. We can then see if the test-net sufferers the same memory leak. We'd expect it to but while doing that we can use debug tools to try and see it. |
socketCleanupFactory maintains a set of sockets it tracks. So it will hold sockets in memory. It's not needed since its only purpose is to clean up sockets during tests if they fail the test. We want to track each connection and stream on the server side and make sure they finish before moving to the next step of the loop.
After testing QUIC connections, I did end up discovering a tiny memory leak in the native code. However, the leak is too insubstantial to have actually caused any significant issues on the seednodes. Using chrome's inspector revealed that there is a leak of around 1MiB per 20,000 connections. While this seems to be a real issue, it is too small to impact the seednodes' memory consumption in any significant way. This is why I believe the issue does not lie with js-quic. I will keep this PR open for dumping any investigation results until I am able to better narrow this down.
I have launched two agents on testnet to trace this. Both the nodes have each other added to their own node graph, so rediscovery should happen between them. I will reduce the network interval from an hour down to 10 minutes, and rediscovery from 3 days down to an hour. This should synthetically induce more traffic, which might help us figure this out faster. |
After monitoring the two running agents on testnet for about half a day, the testnet memory is relatively stable. Rather, it has actually dropped by approximately 4% from the previous stable, going down from ~46% to ~42%. Note that this is with the changes of running the rediscovery code more frequently. I have also observed that, in both nodes' Otherwise, the nodes are relatively stable. I have attached a debugger to both the nodes and have obtained a heap dump when I first started the processes. I'll do another dump sometime tomorrow and see if any resource is obviously leaking. Otherwise, occasional I'll collaborate with @brynblack to connect with the seednodes and try to figure out what could be causing these leaks. In the worst case, I might have to run a seednode locally as well, just to obtain full heap tracing over an entire network. |
Unfortunately, my laptop crashed at some point during the overnight run, so I could not perform another heapdump to attempt tracing where the leak is coming from. Instead of spending more time doing this, I will shift my focus to other, more important issues in Polykey. The memtest script created in this PR can be used later to check and review QUIC and its memory-tightness. There are other fixes in this PR that are partially unrelated to the tracing work and would benefit from being merged, so I'll make those changes as a standalone commit instead of doing anything with this PR. |
Description
This PR will integrate the experimental tracer from js-logger to trace the memory leak to finally patch it. This should make the polykey seednodes stable and robust. This will also act as a live-fire exercise for the tracer.
Issues Fixed
Tasks
Final checklist