Description
Following is generally written by LLM but benchmark is run by myself :)
1. Motivation: NIOFSDirectory is still relevant
In recent memory-constrained deployments (cgroup-limited containers with large indices), MMapDirectory triggered severe page-fault storms — pgmajfault rates spiking by an order of magnitude once the working set exceeded the cgroup limit, with sharply degraded query latency. Switching to NIOFSDirectory helps us resolve it.
2. Problem: a JDK monitor caps NIOFSDirectory at ~4 threads
After moving more workloads onto NIOFSDirectory, we hit a hard scaling ceiling. The bottleneck is not the kernel — it's a synchronized block in sun.nio.ch.FileChannelImpl. Every positioned read registers the calling thread into a NativeThreadSet (so a concurrent close() can interrupt it via pthread_kill), and that registration takes a global monitor on every read.
// sun.nio.ch.FileChannelImpl
private int readInternal(ByteBuffer dst, long position) throws IOException {
int n = 0;
int ti = -1;
try {
beginBlocking();
// ↓↓↓ contention point — monitor-protected, on every single read ↓↓↓
ti = threads.add();
if (!isOpen()) return -1;
do {
// ... Blocker.begin / IOUtil.read(fd, dst, position, ...) / Blocker.end ...
} while ((n == IOStatus.INTERRUPTED) && isOpen());
return IOStatus.normalize(n);
} finally {
threads.remove(ti); // takes the same monitor again
endBlocking(n > 0);
}
}
// sun.nio.ch.NativeThreadSet — the monitor every reader fights for
int add() {
long th = NativeThread.current();
synchronized (this) { // ← global monitor per channel
// ... grow array, find free slot, write thread handle ...
}
}
Past ~4 threads, this monitor's cache-line bouncing dominates the cost of pread64 itself, and throughput stops scaling. This is structurally tied to the Channel.close() interruption contract and unlikely to be removed from the JDK in the near term.
3. Benchmark: native pread(2) via Panama FFI scales 4× higher
JMH on Java 25, Linux x86_64, NVMe; 1 GiB file, 16 KiB random reads, 16 reads/op. Throughput in ops/ms (higher is better):
| Benchmark |
1 thr |
2 thr |
4 thr |
8 thr |
16 thr |
32 thr |
ffiPread |
371.8 |
633.8 |
1104.5 |
1854.5 |
2838.1 |
2862.5 |
fileChannelReadDirect |
358.9 |
428.1 |
683.4 |
637.3 |
737.0 |
737.4 |
fileChannelReadHeap |
318.1 |
495.4 |
668.2 |
596.0 |
757.4 |
712.8 |
- 1 thread: FFI is ~4% faster — same syscall, less Java overhead.
FileChannel plateaus at ~700 ops/ms from 4 threads onward; profiling shows time inside NativeThreadSet's monitor.
- FFI scales near-linearly to 16 threads, then hits the hardware ceiling at 32.
4. Proposal: PreadDirectory
A new Directory that performs random reads via pread(2) through Panama FFI:
- POSIX → FFI
pread. No NativeThreadSet, no monitor, stateless syscall.
- Non-POSIX → fallback to
NIOFSDirectory. Behavior never worse than today;
Description
1. Motivation:
NIOFSDirectoryis still relevantIn recent memory-constrained deployments (cgroup-limited containers with large indices),
MMapDirectorytriggered severe page-fault storms —pgmajfaultrates spiking by an order of magnitude once the working set exceeded the cgroup limit, with sharply degraded query latency. Switching toNIOFSDirectoryhelps us resolve it.2. Problem: a JDK monitor caps
NIOFSDirectoryat ~4 threadsAfter moving more workloads onto
NIOFSDirectory, we hit a hard scaling ceiling. The bottleneck is not the kernel — it's a synchronized block insun.nio.ch.FileChannelImpl. Every positioned read registers the calling thread into aNativeThreadSet(so a concurrentclose()can interrupt it viapthread_kill), and that registration takes a global monitor on every read.Past ~4 threads, this monitor's cache-line bouncing dominates the cost of
pread64itself, and throughput stops scaling. This is structurally tied to theChannel.close()interruption contract and unlikely to be removed from the JDK in the near term.3. Benchmark: native
pread(2)via Panama FFI scales 4× higherJMH on Java 25, Linux x86_64, NVMe; 1 GiB file, 16 KiB random reads, 16 reads/op. Throughput in ops/ms (higher is better):
ffiPreadfileChannelReadDirectfileChannelReadHeapFileChannelplateaus at ~700 ops/ms from 4 threads onward; profiling shows time insideNativeThreadSet's monitor.4. Proposal:
PreadDirectoryA new
Directorythat performs random reads viapread(2)through Panama FFI:pread. NoNativeThreadSet, no monitor, stateless syscall.NIOFSDirectory. Behavior never worse than today;