Skip to content

[C++][Parquet] Fix endianness and test failures on s390x (big-endian) #48213

@k8ika0s

Description

@k8ika0s

Describe the bug, including details regarding any error messages, version, and platform.

Describe the bug

Running Apache Arrow / Parquet 22.0.0 on s390x (big-endian) exposed multiple endianness-related correctness bugs and some allocator/test fragility:

  • Parquet encoders sometimes emit host-order bytes instead of canonical little-endian.
  • Decoders may double-swap values or interpret host-order statistics incorrectly.
  • INT96 timestamp handling mixes host-order and LE limbs, breaking comparison and decoding.
  • Page index min/max expectations are built from host-order bytes.
  • Float16 (half) serialization does not always produce canonical bytes.
  • BYTE_ARRAY and dictionary 4-byte length prefixes are encoded inconsistently across paths.
  • Bit-packing, VLQ, and ZigZag helpers assume little-endian layout and misbehave on big-endian.
  • Synthetic OOM tests can trigger allocator-level aborts on large-memory s390x systems (e.g. mimalloc) before Arrow can raise OutOfMemory.

Together these can lead to:

  • Non-portable or corrupted Parquet pages and statistics on big-endian hosts.
  • Incorrect page index min/max and statistics (wrong sort order, incorrect filtering).
  • Mis-decoded INT96 timestamps.
  • Crashes in test OOM paths on s390x.

To Reproduce

On an s390x (big-endian) system:

  1. Build Arrow C++ (22.0.0) with Parquet enabled.
  2. Run relevant gtest suites (e.g. parquet-reader/writer, parquet-arrow reader/writer, page index tests, chunker/Float16 tests, bit-utility tests, IPC message tests, and allocator/buffer OOM tests).
  3. Observe:
    • Parquet tests failing due to mismatched bytes (stats, page index, INT96, Float16, RLE prefixes, dictionary pages).
    • IPC byte-identical test comparing against a macOS/ARM snapshot that does not match s390x output.
    • OOM tests aborting when mimalloc is used on large-memory s390x environments.

Expected behavior

  • Parquet encoders/decoders should consistently read/write canonical little-endian bytes for all primitives and metadata, independently of host endianness.
  • INT96 timestamps, page index min/max, Float16, and binary/dictionary pages should round-trip correctly on s390x and match the expected canonical layout.
  • IPC “byte identical” tests should compare against a canonical snapshot that matches big-endian builds as well.
  • Synthetic OOM tests should exercise Arrow’s OutOfMemory behavior without provoking allocator-level hard aborts.

What this PR changes

The associated PR:

  • Introduces an ARROW_ENSURE_S390X_ENDIANNESS CMake option (default ON on s390x) so Parquet I/O always uses explicit little-endian conversions on big-endian hosts.
  • Adds parquet/endian_internal.h with portable helpers for encoding/decoding primitives (integers, floats, INT96, etc.) in little-endian format regardless of host architecture.
  • Updates Parquet encoders/decoders (Plain, ByteStreamSplit, Dictionary, DeltaBitPack, FLBA) to use these helpers, including Arrow array fast-paths.
  • Ensures level headers, RLE prefixes, and BYTE_ARRAY length prefixes are always written/read as little-endian.
  • Canonicalizes INT96 timestamp handling and test fixtures to little-endian limb layout.
  • Fixes page index and Float16 expectations so tests compare against canonical LE bytes.
  • Adjusts geospatial WKB parsing and related tests to respect a unified endianness shim.
  • Updates IPC byte-identical tests to use a canonical s390x snapshot and emit hex diffs on mismatch.
  • Clamps synthetic OOM allocations and skips mimalloc-specific OOM tests to avoid allocator aborts while still exercising Arrow’s OutOfMemory behavior.

Environment

  • Architecture: s390x (big-endian)
  • Component(s): C++, Parquet
  • Arrow version: 22.0.0 (and current main while developing the fix)
  • Allocators: default pool, mimalloc (where OOM tests were unstable)

Additional context

The goal is to make Arrow/Parquet fully correct and stable on s390x by enforcing canonical little-endian Parquet I/O and adjusting tests so that all suites pass on big-endian platforms without behavior differences or crashes.

Component(s)

C++, Parquet

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions