Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading a large file into memory fails with java.lang.ArrayIndexOutOfBoundsException: arraycopy: length -84 is negative #656

Open
megri opened this issue Feb 21, 2025 · 4 comments

Comments

@megri
Copy link

megri commented Feb 21, 2025

Trying to read a JSON-file of the format Seq[Map[String, T]] using ujson.read(os.read.stream(path)) fails due to internal read buffer overflow.

The file is 2.7 GiB large.

Stack trace:

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: arraycopy: length -84 is negative
        at java.base/java.lang.System.arraycopy(Native Method)
        at upickle.core.ByteBuilder.appendAll(ByteBuilder.scala:38)
        at upickle.core.BufferingByteParser.appendBytesToBuilder(BufferingByteParser.scala:164)
        at upickle.core.BufferingByteParser.appendBytesToBuilder$(BufferingByteParser.scala:24)
        at ujson.ByteParser.appendBytesToBuilder(ByteParser.scala:19)
        at ujson.ByteParser.parseStringToOutputBuilder(ByteParser.scala:638)
        at ujson.ByteParser.parseStringKey(ByteParser.scala:629)
        at ujson.ByteParser.parseNested(ByteParser.scala:390)
        at ujson.ByteParser.parseTopLevel0(ByteParser.scala:338)
        at ujson.ByteParser.parseTopLevel(ByteParser.scala:323)
        at ujson.ByteParser.parse(ByteParser.scala:72)
        at ujson.InputStreamParser$.transform(InputStreamParser.scala:26)
        at ujson.ReadableLowPri.ujson$ReadableLowPri$$anon$2$$_$transform$$anonfun$1(Readable.scala:38)
        at os.read$stream$$anon$1.readBytesThrough(ReadWriteOps.scala:253)
        at ujson.ReadableLowPri$$anon$2.transform(Readable.scala:38)
        at ujson.Readable$.transform(Readable.scala:15)
        at ujson.Readable$.transform(Readable.scala:14)
        at upickle.core.BufferedValue$.maybeSortKeysTransform(BufferedValue.scala:76)
        at ujson.package$.transform(package.scala:8)
        at ujson.package$.read$$anonfun$1(package.scala:15)
        at upickle.core.TraceVisitor$.withTrace(TraceVisitor.scala:18)
        at ujson.package$.read(package.scala:15)
@lihaoyi
Copy link
Member

lihaoyi commented Feb 21, 2025

The problem is that a lot of the parser code assumes that it takes an index: Int, and with >2gb the integer is overflowing:

def checkSafeIndex(j: Int): Int = {

This can probably fixed, but might take some work:

  1. Naively converting the Int to a Long may hurt performance,
  2. Having a separate Long version of the API is possible but would take some work to adjust the codegen templates
  3. Making the parser not take an index: Int at all, or making it relative to the current cursor, should also work. But would likely also take a significant refactoring of the parser

All three approaches would work if someone wants to take a crack at it

@plokhotnyuk
Copy link
Contributor

@megri You might find it helpful to consider using the scanJsonArrayFromStream method from jsoniter-scala. This approach allows for the efficient parsing of huge JSON arrays without requiring the entire input or parsed data to be held in memory. You can find more details about this method here.

@megri
Copy link
Author

megri commented Feb 21, 2025

@megri You might find it helpful to consider using the scanJsonArrayFromStream method from jsoniter-scala. This approach allows for the efficient parsing of huge JSON arrays without requiring the entire input or parsed data to be held in memory. You can find more details about this method here.

Fair enough, but I still think this is an issue in upickle/ujson ;)

@megri
Copy link
Author

megri commented Feb 21, 2025

The problem is that a lot of the parser code assumes that it takes an index: Int, and with >2gb the integer is overflowing:

upickle/ujson/templates/ElemParser.scala

Line 32 in fdb5a0d

def checkSafeIndex(j: Int): Int = {
This can probably fixed, but might take some work:

  1. Naively converting the Int to a Long may hurt performance,
  2. Having a separate Long version of the API is possible but would take some work to adjust the codegen templates
  3. Making the parser not take an index: Int at all, or making it relative to the current cursor, should also work. But would likely also take a significant refactoring of the parser

All three approaches would work if someone wants to take a crack at it

I was looking around the library for some sort of "lazy" Visitor that would allow me to scan over the data without buffering it excessively, but from what I saw even the base parser makes assumptions that a JSON-array has a length. I wonder if unpickling performance would suffer greatly if no such assumption was made and the elements of objects/arrays were simply emitted to an underlying Visitor 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants