Reading a large file into memory fails with `java.lang.ArrayIndexOutOfBoundsException: arraycopy: length -84 is negative` #656

megri · 2025-02-21T13:42:24Z

Trying to read a JSON-file of the format Seq[Map[String, T]] using ujson.read(os.read.stream(path)) fails due to internal read buffer overflow.

The file is 2.7 GiB large.

Stack trace:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: arraycopy: length -84 is negative
        at java.base/java.lang.System.arraycopy(Native Method)
        at upickle.core.ByteBuilder.appendAll(ByteBuilder.scala:38)
        at upickle.core.BufferingByteParser.appendBytesToBuilder(BufferingByteParser.scala:164)
        at upickle.core.BufferingByteParser.appendBytesToBuilder$(BufferingByteParser.scala:24)
        at ujson.ByteParser.appendBytesToBuilder(ByteParser.scala:19)
        at ujson.ByteParser.parseStringToOutputBuilder(ByteParser.scala:638)
        at ujson.ByteParser.parseStringKey(ByteParser.scala:629)
        at ujson.ByteParser.parseNested(ByteParser.scala:390)
        at ujson.ByteParser.parseTopLevel0(ByteParser.scala:338)
        at ujson.ByteParser.parseTopLevel(ByteParser.scala:323)
        at ujson.ByteParser.parse(ByteParser.scala:72)
        at ujson.InputStreamParser$.transform(InputStreamParser.scala:26)
        at ujson.ReadableLowPri.ujson$ReadableLowPri$$anon$2$$_$transform$$anonfun$1(Readable.scala:38)
        at os.read$stream$$anon$1.readBytesThrough(ReadWriteOps.scala:253)
        at ujson.ReadableLowPri$$anon$2.transform(Readable.scala:38)
        at ujson.Readable$.transform(Readable.scala:15)
        at ujson.Readable$.transform(Readable.scala:14)
        at upickle.core.BufferedValue$.maybeSortKeysTransform(BufferedValue.scala:76)
        at ujson.package$.transform(package.scala:8)
        at ujson.package$.read$$anonfun$1(package.scala:15)
        at upickle.core.TraceVisitor$.withTrace(TraceVisitor.scala:18)
        at ujson.package$.read(package.scala:15)

The text was updated successfully, but these errors were encountered:

lihaoyi · 2025-02-21T14:46:29Z

The problem is that a lot of the parser code assumes that it takes an index: Int, and with >2gb the integer is overflowing:

upickle/ujson/templates/ElemParser.scala

Line 32 in fdb5a0d

def checkSafeIndex(j: Int): Int = {

This can probably fixed, but might take some work:

Naively converting the Int to a Long may hurt performance,
Having a separate Long version of the API is possible but would take some work to adjust the codegen templates
Making the parser not take an index: Int at all, or making it relative to the current cursor, should also work. But would likely also take a significant refactoring of the parser

All three approaches would work if someone wants to take a crack at it

plokhotnyuk · 2025-02-21T14:50:18Z

@megri You might find it helpful to consider using the scanJsonArrayFromStream method from jsoniter-scala. This approach allows for the efficient parsing of huge JSON arrays without requiring the entire input or parsed data to be held in memory. You can find more details about this method here.

megri · 2025-02-21T16:31:04Z

@megri You might find it helpful to consider using the scanJsonArrayFromStream method from jsoniter-scala. This approach allows for the efficient parsing of huge JSON arrays without requiring the entire input or parsed data to be held in memory. You can find more details about this method here.

Fair enough, but I still think this is an issue in upickle/ujson ;)

megri · 2025-02-21T16:36:07Z

The problem is that a lot of the parser code assumes that it takes an index: Int, and with >2gb the integer is overflowing:

upickle/ujson/templates/ElemParser.scala

Line 32 in fdb5a0d

def checkSafeIndex(j: Int): Int = {
This can probably fixed, but might take some work:

Naively converting the Int to a Long may hurt performance,

Having a separate Long version of the API is possible but would take some work to adjust the codegen templates

Making the parser not take an index: Int at all, or making it relative to the current cursor, should also work. But would likely also take a significant refactoring of the parser

All three approaches would work if someone wants to take a crack at it

I was looking around the library for some sort of "lazy" Visitor that would allow me to scan over the data without buffering it excessively, but from what I saw even the base parser makes assumptions that a JSON-array has a length. I wonder if unpickling performance would suffer greatly if no such assumption was made and the elements of objects/arrays were simply emitted to an underlying Visitor 🤔

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading a large file into memory fails with `java.lang.ArrayIndexOutOfBoundsException: arraycopy: length -84 is negative` #656

Reading a large file into memory fails with `java.lang.ArrayIndexOutOfBoundsException: arraycopy: length -84 is negative` #656

megri commented Feb 21, 2025

lihaoyi commented Feb 21, 2025

plokhotnyuk commented Feb 21, 2025

megri commented Feb 21, 2025

megri commented Feb 21, 2025

Reading a large file into memory fails with java.lang.ArrayIndexOutOfBoundsException: arraycopy: length -84 is negative #656

Reading a large file into memory fails with java.lang.ArrayIndexOutOfBoundsException: arraycopy: length -84 is negative #656

Comments

megri commented Feb 21, 2025

lihaoyi commented Feb 21, 2025

plokhotnyuk commented Feb 21, 2025

megri commented Feb 21, 2025

megri commented Feb 21, 2025

Reading a large file into memory fails with `java.lang.ArrayIndexOutOfBoundsException: arraycopy: length -84 is negative` #656

Reading a large file into memory fails with `java.lang.ArrayIndexOutOfBoundsException: arraycopy: length -84 is negative` #656