Skip to content

Patching memory leak using tracer #153

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

aryanjassal
Copy link
Member

@aryanjassal aryanjassal commented May 21, 2025

Description

This PR will integrate the experimental tracer from js-logger to trace the memory leak to finally patch it. This should make the polykey seednodes stable and robust. This will also act as a live-fire exercise for the tracer.

Issues Fixed

Tasks

  • 1. Integrate tracer into js code
  • 2. Integrate tracer into native code
  • 3. Replicate memory leak with test/script
  • 4. Use tracer to visualise where the leak is coming from

Final checklist

  • Domain specific tests
  • Full tests
  • Updated inline-comment documentation
  • Lint fixed
  • Squash and rebased
  • Sanity check the final build

@aryanjassal aryanjassal self-assigned this May 21, 2025
Copy link

linear bot commented May 21, 2025

ENG-626

@aryanjassal
Copy link
Member Author

A quick test against object lifecycles and tracer has confirmed that no object is being leaked, although the parent/child relationship is still kind of messed up. I'm working on fixing that right now.

Copy link
Member Author

Brian, Roger, and I had a discussion involving the different approaches I can take to observing the state of the native rust layer of the library. There were a lot of solutions and ideas explored.

Callbacks

The first approach was keeping the interface and the API consistent by introducing callbacks in the native layer which invoke javascript functions to perform and trigger tracing. This would have otherwise worked if not for a caveat — javascript can hook into rust functions by passing in parameters and geting return values back, but this is a one-way street. Rust bindings is used by the javascript layer, but rust cannot pull javascript functions and execute them. This happens because javaascript works using an event loop to track what to run and when, and rust has a completely different control and logic flow, so rust and javascript execution contexts are fundamentally disconnected.

Technically, its possible to run javascript code in rust, as the NAPI exposes a way to do so, but the executed callbacks are meant to be more along the lines of fire-and-forget, and not something which we can wait to get return values from. This isn't an issue for stopping a span, but the start span callback returns a span identifier. Thus, the rust code will look like this.

pub fn javascriptCallback(napi::JsFunction &callback) {
  // ...
  const rx = napi::JsReceiver;
  napi::callJavascriptFunction(&callback, napi::Threadsafe::Blocking);
  const result = rx.await.unwrap();
}

This code is simplified, but it conveys the idea. The rust code is called as a single task inside the event loop. There, the NAPI queues the javascript function and immediately waits for a response, but the javascript function hasn't executed yet, only queued, so the program ends up deadlocking. This is a known limitation of working with javascript callbacks in the native layer, so this approach cannot be continued.

Unix sockets

The next idea I had was to use Unix sockets to communicate between the two different layers. Even disregarding the work that would go into structuring the API (which would basically be a repeat of the work done in js-rpc), the idea is still a bad one as the two layers are already linked, and using an IPC approach to communicate between the two layers just sounds dumb.

First-party native integration

Another approach is to use first-party native integration. This would result in explicit bindings being available for native code, so we won't have to worry about weird interactions or limitations between them. This, however, introduces its own slew of problems.

Logic/IO separation

The first approach was proposed by Roger. He mentioned that we can separate the code into the logical part residing in the native rust layer, and the IO is handled by javascript events. This allows for a clear separation of events and treats javascript as a wrapper for the logic defined by the rust code. This approach will require rust to interact with javascript, and potentially in the same way as the callbacks approach, so a lot of design considerations needs to be made.

Separate implementations

Brian's proposal involved implementing two separate layers for tracing. Basically, we keep javascript tracing as-is and implement a dedicated native tracer. Both would operate on their own separate loops and generate data separately. As we will be using IdSortable, we will be able to reconcile and merge the outputs of both into a single log file.


Both these approaches run into the same higher-level issue — IdSortable. It is provided via @matrixai/js-id, which is a javascript-only library. This means that the rust code still needs to pull data from javascript, which means we will run into the same issue as before.

Fresh perspective

Roger and I had a discussion after sleeping on it. Here, Roger clarified the requirements and problem statement. The purpose of this is to find and patch the memory leak. Tracer is just a tool we were going to use, but if it can be done without it, then that would also mark the issue as completed. Thus, instead of working around or against the data and control flow as defined by NAPI, we should work with it. The bindings of all objects need to be explicitly exposed to the javascript layer, so the tracing can be directly applied to the javascript layer itself without worrying about interfacing.

To perform rust layer tracing, we can use dedicated rust profilers to gain observability. We will not get data in the same format as what tracer provides and what the visualiser expects, but it would help us solve the bug. Thus, we don't need to worry about symbols present in quiche which we cannot expose to javascript for manual tracing.

There's another problem that needs to be considered in the design phase. Lower-level language like rust use a manual memory model, where memory is manually allocated and freed. Higher-level languages like javascript use a garbage collector to automatically free memory when it is no longer reachable by the code. As we will be exposing manually-managed heap objects to dynamically-maintained heap objects, there is a chance that the garbage collector doesn't have an idea of the actual object lifecycle. In native code, the heap buffer can just be overwritten after the object it contained has been destroyed, but this is not propagated to the javascript layer, and it can read garbage data from invalid memory location. Thus, we need to take care about this scenario in our implementation.

This approach is what I will be implementing now.

Copy link
Member Author

Tracer isn't viable for debugging this as of now. With tracer, I was able to confirm the bug was not present in the javascript layer, but nothing else. So, at the end, I fell back on using typical memory leak detectors like valgrind or heaptrack to trace it. I ended up using heaptrack to trace the code, and have confirmed that there definitely is a memory leak in the rust code.

heaptrack stats:
	allocations:          	6815926
	leaked allocations:   	618
	temporary allocations:	1492508
Heaptrack finished!

image.png

We can generally ignore v8:: scoped leaks as they are unlikely to be the cause of the memory leaks here. What is of particular importance is the unresolved function and alloc::alloc::alloc, indicating the missing pieces exist in the rust layer.

I'm working on figuring out if the issue is in js-quic or quiche, but given alloc::alloc::alloc is leaking, there might be either something leaking from quiche or js-quic mismanaging some state.

Copy link
Member Author

547.82K peak memory consumed over 19017 calls from
alloc::alloc::alloc::hd6fb3bd214935af9
  at /build/rustc-1.77.2-src/library/alloc/src/alloc.rs:98
  in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
alloc::alloc::Global::alloc_impl::hf22c66c92301f94a
  at /build/rustc-1.77.2-src/library/alloc/src/alloc.rs:181
386.06K consumed over 2 calls from:
    _$LT$alloc..alloc..Global$u20$as$u20$core..alloc..Allocator$GT$::allocate::h72f444ab5665945f
      at /build/rustc-1.77.2-src/library/alloc/src/alloc.rs:241
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    hashbrown::raw::alloc::inner::do_alloc::h05c1c9964173c76c
      at /build/rustc-1.77.2-src/vendor/hashbrown/src/raw/alloc.rs:15
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    hashbrown::raw::RawTableInner::new_uninitialized::h04185919e4520fe6
      at /build/rustc-1.77.2-src/vendor/hashbrown/src/raw/mod.rs:1752
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    hashbrown::raw::RawTableInner::fallible_with_capacity::hc60b62db5a759de4
      at /build/rustc-1.77.2-src/vendor/hashbrown/src/raw/mod.rs:1790
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    hashbrown::raw::RawTableInner::prepare_resize::h856fb8d96c0104dc
      at /build/rustc-1.77.2-src/vendor/hashbrown/src/raw/mod.rs:2869
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    hashbrown::raw::RawTableInner::resize_inner::h5734c6ebce825348
      at /build/rustc-1.77.2-src/vendor/hashbrown/src/raw/mod.rs:3065
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    hashbrown::raw::RawTableInner::reserve_rehash_inner::h02c7332d0072754b
      at /build/rustc-1.77.2-src/vendor/hashbrown/src/raw/mod.rs:2955
    hashbrown::raw::RawTable$LT$T$C$A$GT$::reserve_rehash::he043065fb3ccd23d
      at /build/rustc-1.77.2-src/vendor/hashbrown/src/raw/mod.rs:1233
    hashbrown::raw::RawTable$LT$T$C$A$GT$::reserve::ha5daa5c8aa9f2525
      at /build/rustc-1.77.2-src/vendor/hashbrown/src/raw/mod.rs:1181
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    hashbrown::map::HashMap$LT$K$C$V$C$S$C$A$GT$::reserve::h75404fbadb979caa
      at /build/rustc-1.77.2-src/vendor/hashbrown/src/map.rs:1106
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    hashbrown::rustc_entry::_$LT$impl$u20$hashbrown..map..HashMap$LT$K$C$V$C$S$C$A$GT$$GT$::rustc_entry::h03f7fef4df72c2a6
      at /build/rustc-1.77.2-src/vendor/hashbrown/src/rustc_entry.rs:46
    std::collections::hash::map::HashMap$LT$K$C$V$C$S$GT$::entry::hbdc371e735fcb947
      at /build/rustc-1.77.2-src/library/std/src/collections/hash/map.rs:856
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    quiche::stream::StreamMap::get_or_create::h74d54b7613720d99
      at src/stream.rs:222
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    quiche::Connection::get_or_create_stream::h9c9dc6a8b8b2d6eb
      at src/lib.rs:6585
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    quiche::Connection::stream_send::h3942e29571d828c2
      at src/lib.rs:4562
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    quic::connection::Connection::stream_send::he68a9132a1c70ae8
      at src/native/napi/connection.rs:400
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    quic::connection::__napi_impl_helper__Connection__1::__napi__stream_send::_$u7b$$u7b$closure$u7d$$u7d$::_$u7b$$u7b$closure$u7d$$u7d$::h395f50a9ae562376
      at src/native/napi/connection.rs:186
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    napi::tokio_runtime::within_runtime_if_available::hee2811d803013358
      at /home/aryanj/.cargo/registry/src/index.crates.io-6f17d22bba15001f/napi-2.10.1/src/tokio_runtime.rs:69
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    quic::connection::__napi_impl_helper__Connection__1::__napi__stream_send::_$u7b$$u7b$closure$u7d$$u7d$::h629475faa68d95e1
      at src/native/napi/connection.rs:186
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    core::result::Result$LT$T$C$E$GT$::and_then::h54bcef81d4270c5d
      at /build/rustc-1.77.2-src/library/core/src/result.rs:1316
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node
    quic::connection::__napi_impl_helper__Connection__1::__napi__stream_send::h6d1ebca11e523cd6
      at src/native/napi/connection.rs:186
      in /home/aryanj/Projects/js-quic/prebuild/quic-linux-x64.node

The leak is clearly happening through QUIC's stream_send, which is propagating into quiche. As I have tested sending thousands of messages through the QUIC stream, clearly that is not causing the issue. The issue is quiche not closing the stream properly, or that is what I think as of now with this information.

Copy link
Member Author

To get to this output, this is what I did.

First, I had to tell cargo to include debug symbols and prevent it from optimising the code so I get a full traceback of the symbols. I added this in my Cargo.toml to do that.

[profile.release]
debug = true
strip = false
opt-level = 0
lto = false
codegen-units = 1
incremental = false

Then, I cleaned and rebuilt the native library. I used a test script which creates a lot of streams, sends some data through it, and closes it immediately.

for (let i = 0; i < 1000; i++) {
  const stream = client.connection.newStream();
  const writer = stream.writable.getWriter();
  const reader = stream.readable.getReader();
  await writer.write(message);
  await reader.read();
  await writer.close();
  await reader.cancel();
}

These thousand iterations accumulate roughly 150 KiB of leaked memory. This information is acquired by the following command. This command automatically runs the program while basically 'capturing' the state of the heap, and then opens the result in a visualiser.

heaptrace node ./scripts/memtest.mjs

image.png

We can actually look at the caller/callee tree from this GUI visualiser but that doesn't have enough information. To actually view that, I had to print the heaptrack data.

heaptrack_print heaptrack.node.<pid>.zst > heaptrack.txt
nvim heaptrack.txt

Then, I had to relate the hash shown in the memory leak (here being hd6fb3bd214935af9), to the hash in the print dump. There, I was able to see a traceback through the call stack, which is what has been included in the previous comment.

Also note that the number of leaked allocations is not consistent through runs. For example, one run could have 700 leaked allocations, another could have 500. However, the leaked memory is roughly the same across runs, so the amount of memory leaked is deterministic.

@tegefaulkes
Copy link
Contributor

Good information so far. But for the code you're using for testing, we want to simulate graceful usage of the stream. So that means we don't want to cancel the readable.

The steps should be

  1. new stream
  2. write message
  3. close writable
  4. read readable until completion
  5. repeat.

Copy link
Member Author

Okay, I've changed the test to do this instead, and it runs the exact same with the same amount of memory being leaked.

  for (let i = 0; i < 1000; i++) {
    const stream = client.connection.newStream();
    const writer = stream.writable.getWriter();
    await writer.write(message);
    await writer.close();
    const reader = stream.readable.getReader();
    let finished = false;
    while (!finished) {
      finished = (await reader.read()).done;
    }
  }

@aryanjassal
Copy link
Member Author

aryanjassal commented Jun 4, 2025

I've discovered another quick something. There are two main allocation points resulting in the resource leaks. One of the alloc leaks are unaffected by the amount of data or the size of data transmitted, and the other alloc does increase with thee size/amount of data transmitted.

image
Heaptrace of test with 22000 transmitted bytes

image
Heaptrace of test with 0 transmitted bytes

The memory leak happening for the alloc call on h39d... is consistently 49.7 KiB, but the other call, hd6f..., increased with larger messages. This means that the leak is most likely occurring due to the streams not being closed or cleaned up properly. I believe Brian also suspected this outcome since the beginning.

@aryanjassal
Copy link
Member Author

image
There is no way to explicitly close a quiche stream. It will automattically close if either:

  • Both sides sent FIN
  • Either side sent RESET_STREAM
    From my exploration, it seems the code is handling that correctly, but obviously something is going wrong.

image

Also, Once I reduce the number of streams but kept the kept the data constant (around 44 bytes), I got this map. So, this confirms that one of the leaks is directly tied to the stream count and one is directly tied to the data amount, reinforcing the idea that the leak is in improper closing of quiche streams.

@aryanjassal
Copy link
Member Author

aryanjassal commented Jun 4, 2025

The leak explored here isn't accurate. Due to a mistake in the test itself, the streams were not being awaited properly, resulting in this leak. This is no longer the case. When checking the running memory usage using tools like btop (set to a low frequency like 100ms), I could see the memory grow, but the heaptrack results were consistently showing the same 'leaks'.

image

My guess is that most of the leaks shown here are occurring due to overhead from using quiche and isn't directly related to the issues we are encountering.

image

It is a bit hard to see, but the top of the spikes are slowly growing, indicating that there is a leak according to this Consumed Memory graph over time (~40 seconds, 5000 streams).

To actually get to the bottom of this, I will need to use the chrome heap analysis tool with node's debug mode enabled so I can inspect heap snapshots manually. I can also look into incorporating runtime debug symbols so I can get a relative frame of reference to a known symbol.

@aryanjassal
Copy link
Member Author

After the meeting with Brian, we have made a few discoveries. We firstly used the heapdump from chrome inspector to inspect the diff between two heap states, one when the program was just started, and another when it had over 250 MiB of leaked memory. There, I discovered that I forgot to clean-install after removing the tracing code. This is already an issue in MatrixAI/js-logger#65, but the events were queued up in memory without a consumer, which was the cause of the observable spike in memory consumption. After we fixed that up, the memory no longer spiked for the memtest I had written.

In the meantime, we made plenty of other discoveries.

Rust monkeypatching

Javascript neatly lends to monkeypatching by providing access to the node modules directory which contains the source code of all upstream packages, and we can locally modify them for testing. This is usually not the case with compiled binaries as, well, they are compiled binaries. However, due to rust's nature, each package is compiled on the host system itself. This means that cargo pulls in the source code and compiles it instead of only including a complied binary. Thus, it is possible to monkeypatch rust code. There's an option in the Cargo.toml which functions similarly to npm link.

[patch.crates-io]
quiche = { path = "/home/aryanj/Projects/quiche/quiche" }

Adding this overrode the quiche pulled by cargo to instead point to a version I had present locally. Meaning we could modify the rust code manually, like debug printing every time a stream was closed. So we did. And the debug statements provided strong evidence that every single stream was being closed, so this was basically ruled out.

Program freeze when sent data exceeds 10 MiB

For some reason, when the data sent through the QUIC streams hit 10 MiB cumulative capacity, then the program basically stopped responding. btop claimed program was 'Sleeping'. After adding debug prints to check the number of bytes sent/received, that also stopped after that limit was hit. We also added a setInterval with a console log to confirm if node completely deadlocked or if only the QUIC code stopped working, and the console logs were working even after the QUIC code froze up. This limit was discovered by sending a 1 MiB buffer over the stream 10 times, and a 512 KiB buffer 20 times, and 1 KiB buffer 10000 times.

I'm not sure if this is a real bug or just an artifact of testing. It is unlikely to see this happen on the nodes themselves, but maybe the occasional seednode freezes where they don't crash but also are just stuck in a state where they don't accept new connections could be attributed to this? It's too early to say without further investigation, which was put off in favour of working on the current issue.


After all these discoveries and the result with no memory leaks on the stream side, we are forced to change our approach.

Broaden scope

Instead of focusing on streams, focus on everything in QUIC. Going from data packets to connections, everything. We have proven that the leak isn't in QUIC streams, so check where else the leak could be. The next candidate is to test connections themselves by creating a connection, sending data back-and-forth, then closing the connection. However, if none of the tests actually reveal an issue, then it means that the leak isn't present in QUIC at all.

Heaptrack Polykey agent

As our primary hypothesis has been proven incorrect, we have no idea where the leak could be. So, instead of looking blindly, run a long-term polykey agent with inspector enabled, and infrequently (every hour or so) take a heap dump using chrome's memory inspector. Then, backtrack from there to figure out what is consistently leaking memory over time.

Temporary agents running on an isolated network

Testnet is stable as no nodes are connected to it due to network segregation work. This means that if we are able to replicate the memory leak with two agents, then we can reasonably limit what operations might be causing it. We can combine this approach with occasional heap tracing the agent to figure out what is holding it open.

Small-scale network simulation

If all of the above steps fail to give us a reasonable direction, then I can run a seednode on my own system, and make some of my agents directly connect to them. Then, I will be able to track an entire polykey network to figure out exactly what could be causing the leak by having access to the entire network with all agents and clients.

Final Thoughts

The leak is not happening with streams, which was our first guess and the primary hypothesis. Thus, we need to step back and look over the entire codebase. To do this, we need to run polykey agents with inspector enabled so we can trace memory dumps as the agent is running. Tracer would be of huge help here, but it is not mature enough to actually provide us with relevant and detailed data as of now, so we will be falling back to more classic forms of leak fixing.

After looking at the code, we believe that a rewrite would benefit this library, especially refactoring to use observables instead. However, QUIC refactoring has a large timescale, which can be better spent working on better issues.

@tegefaulkes
Copy link
Contributor

I just want to add, while we were going over it. I checked the metrics of the testnet and it shows the same slow growth that flattens out over time. Similar to how the mainnet used to act. This suggests that the problem possibly existed for a longer time than expected and its being triggered now by increased traffic.

I suggested that we can run some test nodes on the testnet and force them to generate a lot of network traffic as part of the script. We can then see if the test-net sufferers the same memory leak. We'd expect it to but while doing that we can use debug tools to try and see it.

aryanjassal and others added 2 commits June 11, 2025 11:02
socketCleanupFactory maintains a set of sockets it tracks. So it will hold sockets in memory. It's not needed since its only purpose is to clean up sockets during tests if they fail the test.

We want to track each connection and stream on the server side and make sure they finish before moving to the next step of the loop.
@aryanjassal
Copy link
Member Author

Broaden scope

Instead of focusing on streams, focus on everything in QUIC. Going from data packets to connections, everything. We have proven that the leak isn't in QUIC streams, so check where else the leak could be. The next candidate is to test connections themselves by creating a connection, sending data back-and-forth, then closing the connection. However, if none of the tests actually reveal an issue, then it means that the leak isn't present in QUIC at all.

After testing QUIC connections, I did end up discovering a tiny memory leak in the native code. However, the leak is too insubstantial to have actually caused any significant issues on the seednodes. Using chrome's inspector revealed that there is a leak of around 1MiB per 20,000 connections. While this seems to be a real issue, it is too small to impact the seednodes' memory consumption in any significant way. This is why I believe the issue does not lie with js-quic. I will keep this PR open for dumping any investigation results until I am able to better narrow this down.

Temporary agents running on an isolated network

Testnet is stable as no nodes are connected to it due to network segregation work. This means that if we are able to replicate the memory leak with two agents, then we can reasonably limit what operations might be causing it. We can combine this approach with occasional heap tracing the agent to figure out what is holding it open.

I have launched two agents on testnet to trace this. Both the nodes have each other added to their own node graph, so rediscovery should happen between them. I will reduce the network interval from an hour down to 10 minutes, and rediscovery from 3 days down to an hour. This should synthetically induce more traffic, which might help us figure this out faster.

@aryanjassal
Copy link
Member Author

After monitoring the two running agents on testnet for about half a day, the testnet memory is relatively stable. Rather, it has actually dropped by approximately 4% from the previous stable, going down from ~46% to ~42%. Note that this is with the changes of running the rediscovery code more frequently.

I have also observed that, in both nodes' nodes connections, the seednodes are frequently acquiring multiple connections. Of course, eventually, they do trickle back to maintaining a single connection, but this is some interesting behaviour I did observe.

Otherwise, the nodes are relatively stable. I have attached a debugger to both the nodes and have obtained a heap dump when I first started the processes. I'll do another dump sometime tomorrow and see if any resource is obviously leaking. Otherwise, occasional btop checks have revealed the memory consumption to sit comfortably around 200 to 300MiB mark, sometimes hitting the 400s.

I'll collaborate with @brynblack to connect with the seednodes and try to figure out what could be causing these leaks. In the worst case, I might have to run a seednode locally as well, just to obtain full heap tracing over an entire network.

@aryanjassal
Copy link
Member Author

aryanjassal commented Jun 12, 2025

Unfortunately, my laptop crashed at some point during the overnight run, so I could not perform another heapdump to attempt tracing where the leak is coming from. Instead of spending more time doing this, I will shift my focus to other, more important issues in Polykey. The memtest script created in this PR can be used later to check and review QUIC and its memory-tightness.

There are other fixes in this PR that are partially unrelated to the tracing work and would benefit from being merged, so I'll make those changes as a standalone commit instead of doing anything with this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Integrate tracer to detect memory leaks
2 participants