Skip to content

Conversation

@imjalpreet
Copy link
Member

@imjalpreet imjalpreet commented Feb 16, 2025

Description

Upgrade to Hive 4.0.1

Depends on prestodb/presto-hive-apache#65 and prestodb/presto-hive-dwrf#12

Motivation and Context

#24435

Impact

Test Plan

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

Hive Connector Changes
* Upgrade to Hive 4.0.1

@imjalpreet imjalpreet self-assigned this Feb 16, 2025
@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Feb 16, 2025
@imjalpreet imjalpreet force-pushed the hive4-upgrade branch 3 times, most recently from 21c80f6 to 282e2c4 Compare April 24, 2025 11:46
@ScrapCodes
Copy link
Contributor

ScrapCodes commented Apr 28, 2025

Looking closely at the reason why tests are failing, beginning with presto-orc module.
Progress so far:

  • Tests try to write content with hive and read by presto, in case of timestamps - it is incorrectly read by presto. The reason is not clear yet.
    The contents written by tests is correct because on reading with an external orc reader like my following python script, it gives correct result:
from pyarrow import orc

table2 = orc.read_table('/tmp/3420396529049254202/data.orc')
print(table2)
python read_orc.py 
pyarrow.Table
test: timestamp[ns]
----
test: [[1970-01-01 00:00:00.001000000,1970-01-01 00:00:00.003000000,1970-01-01 00:00:00.005000000,1970-01-01 00:00:00.007000000,1970-01-01 00:00:00.011000000,...,1970-01-01 00:00:00.001000000,1970-01-01 00:00:00.003000000,1970-01-01 00:00:00.005000000,1970-01-01 00:00:00.007000000,1970-01-01 00:00:00.011000000]]

Whereas, record reader for ORC in presto reads it differently
Screenshot from 2025-04-28 17-35-06

  • These values e.g. 21600001 when converted to timestamp values give a 6h difference. e.g.

Screenshot from 2025-04-28 17-47-33
This is the reason for failure of these tests.

ORC files generated by the version of hive in the PR and master:
t.zip

  • Lastly, ran the test above with master version of code and the ORC file generated by tests is incorrectly read by python script:
(.venv) prashant@prashant:/drive1/work/orc-reader$ python read_orc.py 
pyarrow.Table
test: timestamp[ns]
----
test: [[1969-12-31 16:59:59.001000000,1969-12-31 16:59:59.003000000,1969-12-31 16:59:59.005000000,1969-12-31 16:59:59.007000000,1969-12-31 16:59:59.011000000,...,1969-12-31 16:59:59.001000000,1969-12-31 16:59:59.003000000,1969-12-31 16:59:59.005000000,1969-12-31 16:59:59.007000000,1969-12-31 16:59:59.011000000]]

This is giving a clue as to something has changed between the versions. This indicates that files written by older version of hive + ORC will give incorrect output?

@ScrapCodes
Copy link
Contributor

ScrapCodes commented Apr 29, 2025

Another interesting find:
Trying to read directly by hive version 3.0.0, orc file produced by presto without hive 4 changes. Produces incorrect results, whereas it produces correct results when reading a file with hive 4 changes.

prashant@prashant:/drive1/work/temp/apache-hive-3.0.0-bin$ HADOOP_HOME=../hadoop-3.3.6/ bin/hive --orcfiledump ../../orc-reader/data-without-hive4.orc 

Processing data file ../../orc-reader/data-without-hive4.orc [length: 931]
Structure for ../../orc-reader/data-without-hive4.orc
File Version: 0.12 with ORC_135
Rows: 30000
Compression: ZLIB
Compression size: 262144
Type: struct<test:timestamp>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 30000 hasNull: false
    Column 1: count: 30000 hasNull: false min: 1969-12-31 16:00:00.001 max: 1969-12-31 16:00:00.017

File Statistics:
  Column 0: count: 30000 hasNull: false
  Column 1: count: 30000 hasNull: false min: 1969-12-31 16:00:00.001 max: 1969-12-31 16:00:00.017

Stripes:
  Stripe: offset: 3 data: 342 rows: 30000 tail: 78 index: 379
    Stream: column 0 section ROW_INDEX start: 3 length 17
    Stream: column 1 section ROW_INDEX start: 20 length 61
    Stream: column 1 section BLOOM_FILTER start: 81 length 175
    Stream: column 1 section BLOOM_FILTER_UTF8 start: 256 length 126
    Stream: column 1 section DATA start: 382 length 23
    Stream: column 1 section SECONDARY start: 405 length 319
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 931 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________


There is a difference between row indices and stripe information.

prashant@prashant:/drive1/work/temp/apache-hive-3.0.0-bin$ HADOOP_HOME=../hadoop-3.3.6/ bin/hive --orcfiledump ../../orc-reader/data.orc 
Processing data file ../../orc-reader/data.orc [length: 935]
Structure for ../../orc-reader/data.orc
File Version: 0.12 with FUTURE
Rows: 30000
Compression: ZLIB
Compression size: 262144
Type: struct<test:timestamp>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 30000 hasNull: false
    Column 1: count: 30000 hasNull: false min: 1970-01-01 00:00:00.001 max: 1970-01-01 00:00:00.017

File Statistics:
  Column 0: count: 30000 hasNull: false
  Column 1: count: 30000 hasNull: false min: 1970-01-01 00:00:00.001 max: 1970-01-01 00:00:00.017

Stripes:
  Stripe: offset: 3 data: 342 rows: 30000 tail: 60 index: 392
    Stream: column 0 section ROW_INDEX start: 3 length 17
    Stream: column 1 section ROW_INDEX start: 20 length 57
    Stream: column 1 section BLOOM_FILTER start: 77 length 175
    Stream: column 1 section BLOOM_FILTER_UTF8 start: 252 length 143
    Stream: column 1 section DATA start: 395 length 23
    Stream: column 1 section SECONDARY start: 418 length 319
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 935 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________

@imjalpreet imjalpreet force-pushed the hive4-upgrade branch 4 times, most recently from 13d9947 to 5c04f64 Compare May 13, 2025 14:37
@prestodb-ci
Copy link
Contributor

@ethanyzhang imported this issue as lakehouse/presto #24571

@ScrapCodes
Copy link
Contributor

ScrapCodes commented May 15, 2025

The test failures such as:
TestOrcReader>AbstractTestOrcReader.testLongDirect:158->AbstractTestOrcReader.testRoundTripNumeric:325 expected [1970-01-01 00:00:00.001] but found [1970-01-01 06:00:00.001]

Is in the way presto reads the data and writes the data. Somehow even before the data is interpreted as a timestamp type i.e it is still a long type, it has a timestamp adjusted to system timezone. Why it happens is not yet clear to me, when the data written by presto is read via an external ORC reader it has a 6h adjustment applied to it. A similar thing happens when presto reads the data written by hive.

There are no issues while reading other datatypes, e.g. Long/Ints etc... The problem seems to be specific to timestamp only. @imjalpreet agrees with this.

@imjalpreet imjalpreet force-pushed the hive4-upgrade branch 4 times, most recently from 5b896d6 to 6db405e Compare May 23, 2025 01:43
@imjalpreet imjalpreet force-pushed the hive4-upgrade branch 3 times, most recently from 894b0aa to 5a2117a Compare June 2, 2025 21:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PR from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants