-
Notifications
You must be signed in to change notification settings - Fork 3.9k
Open
Description
Describe the bug, including details regarding any error messages, version, and platform.
Describe the bug
Running Apache Arrow / Parquet 22.0.0 on s390x (big-endian) exposed multiple endianness-related correctness bugs and some allocator/test fragility:
- Parquet encoders sometimes emit host-order bytes instead of canonical little-endian.
- Decoders may double-swap values or interpret host-order statistics incorrectly.
- INT96 timestamp handling mixes host-order and LE limbs, breaking comparison and decoding.
- Page index min/max expectations are built from host-order bytes.
- Float16 (half) serialization does not always produce canonical bytes.
- BYTE_ARRAY and dictionary 4-byte length prefixes are encoded inconsistently across paths.
- Bit-packing, VLQ, and ZigZag helpers assume little-endian layout and misbehave on big-endian.
- Synthetic OOM tests can trigger allocator-level aborts on large-memory s390x systems (e.g. mimalloc) before Arrow can raise OutOfMemory.
Together these can lead to:
- Non-portable or corrupted Parquet pages and statistics on big-endian hosts.
- Incorrect page index min/max and statistics (wrong sort order, incorrect filtering).
- Mis-decoded INT96 timestamps.
- Crashes in test OOM paths on s390x.
To Reproduce
On an s390x (big-endian) system:
- Build Arrow C++ (22.0.0) with Parquet enabled.
- Run relevant gtest suites (e.g. parquet-reader/writer, parquet-arrow reader/writer, page index tests, chunker/Float16 tests, bit-utility tests, IPC message tests, and allocator/buffer OOM tests).
- Observe:
- Parquet tests failing due to mismatched bytes (stats, page index, INT96, Float16, RLE prefixes, dictionary pages).
- IPC byte-identical test comparing against a macOS/ARM snapshot that does not match s390x output.
- OOM tests aborting when mimalloc is used on large-memory s390x environments.
Expected behavior
- Parquet encoders/decoders should consistently read/write canonical little-endian bytes for all primitives and metadata, independently of host endianness.
- INT96 timestamps, page index min/max, Float16, and binary/dictionary pages should round-trip correctly on s390x and match the expected canonical layout.
- IPC “byte identical” tests should compare against a canonical snapshot that matches big-endian builds as well.
- Synthetic OOM tests should exercise Arrow’s OutOfMemory behavior without provoking allocator-level hard aborts.
What this PR changes
The associated PR:
- Introduces an
ARROW_ENSURE_S390X_ENDIANNESSCMake option (default ON on s390x) so Parquet I/O always uses explicit little-endian conversions on big-endian hosts. - Adds
parquet/endian_internal.hwith portable helpers for encoding/decoding primitives (integers, floats, INT96, etc.) in little-endian format regardless of host architecture. - Updates Parquet encoders/decoders (Plain, ByteStreamSplit, Dictionary, DeltaBitPack, FLBA) to use these helpers, including Arrow array fast-paths.
- Ensures level headers, RLE prefixes, and BYTE_ARRAY length prefixes are always written/read as little-endian.
- Canonicalizes INT96 timestamp handling and test fixtures to little-endian limb layout.
- Fixes page index and Float16 expectations so tests compare against canonical LE bytes.
- Adjusts geospatial WKB parsing and related tests to respect a unified endianness shim.
- Updates IPC byte-identical tests to use a canonical s390x snapshot and emit hex diffs on mismatch.
- Clamps synthetic OOM allocations and skips mimalloc-specific OOM tests to avoid allocator aborts while still exercising Arrow’s OutOfMemory behavior.
Environment
- Architecture: s390x (big-endian)
- Component(s): C++, Parquet
- Arrow version: 22.0.0 (and current
mainwhile developing the fix) - Allocators: default pool, mimalloc (where OOM tests were unstable)
Additional context
The goal is to make Arrow/Parquet fully correct and stable on s390x by enforcing canonical little-endian Parquet I/O and adjusting tests so that all suites pass on big-endian platforms without behavior differences or crashes.
Component(s)
C++, Parquet