Skip to content

Introduce a pread Directory based on Panama-FFI ? #16044

@gf2121

Description

@gf2121

Description

Following is generally written by LLM but benchmark is run by myself :)

1. Motivation: NIOFSDirectory is still relevant

In recent memory-constrained deployments (cgroup-limited containers with large indices), MMapDirectory triggered severe page-fault storms — pgmajfault rates spiking by an order of magnitude once the working set exceeded the cgroup limit, with sharply degraded query latency. Switching to NIOFSDirectory helps us resolve it.

2. Problem: a JDK monitor caps NIOFSDirectory at ~4 threads

After moving more workloads onto NIOFSDirectory, we hit a hard scaling ceiling. The bottleneck is not the kernel — it's a synchronized block in sun.nio.ch.FileChannelImpl. Every positioned read registers the calling thread into a NativeThreadSet (so a concurrent close() can interrupt it via pthread_kill), and that registration takes a global monitor on every read.

// sun.nio.ch.FileChannelImpl
private int readInternal(ByteBuffer dst, long position) throws IOException {
    int n = 0;
    int ti = -1;
    try {
        beginBlocking();
        // ↓↓↓ contention point — monitor-protected, on every single read ↓↓↓
        ti = threads.add();
        if (!isOpen()) return -1;
        do {
            // ... Blocker.begin / IOUtil.read(fd, dst, position, ...) / Blocker.end ...
        } while ((n == IOStatus.INTERRUPTED) && isOpen());
        return IOStatus.normalize(n);
    } finally {
        threads.remove(ti);   // takes the same monitor again
        endBlocking(n > 0);
    }
}
// sun.nio.ch.NativeThreadSet — the monitor every reader fights for
int add() {
    long th = NativeThread.current();
    synchronized (this) {                              // ← global monitor per channel
        // ... grow array, find free slot, write thread handle ...
    }
}

Past ~4 threads, this monitor's cache-line bouncing dominates the cost of pread64 itself, and throughput stops scaling. This is structurally tied to the Channel.close() interruption contract and unlikely to be removed from the JDK in the near term.

3. Benchmark: native pread(2) via Panama FFI scales 4× higher

JMH on Java 25, Linux x86_64, NVMe; 1 GiB file, 16 KiB random reads, 16 reads/op. Throughput in ops/ms (higher is better):

Benchmark 1 thr 2 thr 4 thr 8 thr 16 thr 32 thr
ffiPread 371.8 633.8 1104.5 1854.5 2838.1 2862.5
fileChannelReadDirect 358.9 428.1 683.4 637.3 737.0 737.4
fileChannelReadHeap 318.1 495.4 668.2 596.0 757.4 712.8
  • 1 thread: FFI is ~4% faster — same syscall, less Java overhead.
  • FileChannel plateaus at ~700 ops/ms from 4 threads onward; profiling shows time inside NativeThreadSet's monitor.
  • FFI scales near-linearly to 16 threads, then hits the hardware ceiling at 32.

4. Proposal: PreadDirectory

A new Directory that performs random reads via pread(2) through Panama FFI:

  • POSIX → FFI pread. No NativeThreadSet, no monitor, stateless syscall.
  • Non-POSIX → fallback to NIOFSDirectory. Behavior never worse than today;

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions