[WIP] Upgrade to Hive 4.0.1 #24571

imjalpreet · 2025-02-16T02:43:41Z

Description

Upgrade to Hive 4.0.1

Depends on prestodb/presto-hive-apache#65 and prestodb/presto-hive-dwrf#12

Motivation and Context

#24435

Impact

Test Plan

Contributor checklist

Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

Hive Connector Changes
* Upgrade to Hive 4.0.1

ScrapCodes · 2025-04-28T12:19:11Z

Looking closely at the reason why tests are failing, beginning with presto-orc module.
Progress so far:

Tests try to write content with hive and read by presto, in case of timestamps - it is incorrectly read by presto. The reason is not clear yet.
The contents written by tests is correct because on reading with an external orc reader like my following python script, it gives correct result:

from pyarrow import orc

table2 = orc.read_table('/tmp/3420396529049254202/data.orc')
print(table2)

python read_orc.py 
pyarrow.Table
test: timestamp[ns]
----
test: [[1970-01-01 00:00:00.001000000,1970-01-01 00:00:00.003000000,1970-01-01 00:00:00.005000000,1970-01-01 00:00:00.007000000,1970-01-01 00:00:00.011000000,...,1970-01-01 00:00:00.001000000,1970-01-01 00:00:00.003000000,1970-01-01 00:00:00.005000000,1970-01-01 00:00:00.007000000,1970-01-01 00:00:00.011000000]]

Whereas, record reader for ORC in presto reads it differently

These values e.g. 21600001 when converted to timestamp values give a 6h difference. e.g.

This is the reason for failure of these tests.

ORC files generated by the version of hive in the PR and master:
t.zip

Lastly, ran the test above with master version of code and the ORC file generated by tests is incorrectly read by python script:

(.venv) prashant@prashant:/drive1/work/orc-reader$ python read_orc.py 
pyarrow.Table
test: timestamp[ns]
----
test: [[1969-12-31 16:59:59.001000000,1969-12-31 16:59:59.003000000,1969-12-31 16:59:59.005000000,1969-12-31 16:59:59.007000000,1969-12-31 16:59:59.011000000,...,1969-12-31 16:59:59.001000000,1969-12-31 16:59:59.003000000,1969-12-31 16:59:59.005000000,1969-12-31 16:59:59.007000000,1969-12-31 16:59:59.011000000]]

This is giving a clue as to something has changed between the versions. This indicates that files written by older version of hive + ORC will give incorrect output?

ScrapCodes · 2025-04-29T15:53:38Z

Another interesting find:
Trying to read directly by hive version 3.0.0, orc file produced by presto without hive 4 changes. Produces incorrect results, whereas it produces correct results when reading a file with hive 4 changes.

prashant@prashant:/drive1/work/temp/apache-hive-3.0.0-bin$ HADOOP_HOME=../hadoop-3.3.6/ bin/hive --orcfiledump ../../orc-reader/data-without-hive4.orc 

Processing data file ../../orc-reader/data-without-hive4.orc [length: 931]
Structure for ../../orc-reader/data-without-hive4.orc
File Version: 0.12 with ORC_135
Rows: 30000
Compression: ZLIB
Compression size: 262144
Type: struct<test:timestamp>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 30000 hasNull: false
    Column 1: count: 30000 hasNull: false min: 1969-12-31 16:00:00.001 max: 1969-12-31 16:00:00.017

File Statistics:
  Column 0: count: 30000 hasNull: false
  Column 1: count: 30000 hasNull: false min: 1969-12-31 16:00:00.001 max: 1969-12-31 16:00:00.017

Stripes:
  Stripe: offset: 3 data: 342 rows: 30000 tail: 78 index: 379
    Stream: column 0 section ROW_INDEX start: 3 length 17
    Stream: column 1 section ROW_INDEX start: 20 length 61
    Stream: column 1 section BLOOM_FILTER start: 81 length 175
    Stream: column 1 section BLOOM_FILTER_UTF8 start: 256 length 126
    Stream: column 1 section DATA start: 382 length 23
    Stream: column 1 section SECONDARY start: 405 length 319
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 931 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________

There is a difference between row indices and stripe information.

prashant@prashant:/drive1/work/temp/apache-hive-3.0.0-bin$ HADOOP_HOME=../hadoop-3.3.6/ bin/hive --orcfiledump ../../orc-reader/data.orc 
Processing data file ../../orc-reader/data.orc [length: 935]
Structure for ../../orc-reader/data.orc
File Version: 0.12 with FUTURE
Rows: 30000
Compression: ZLIB
Compression size: 262144
Type: struct<test:timestamp>

Stripe Statistics:
  Stripe 1:
    Column 0: count: 30000 hasNull: false
    Column 1: count: 30000 hasNull: false min: 1970-01-01 00:00:00.001 max: 1970-01-01 00:00:00.017

File Statistics:
  Column 0: count: 30000 hasNull: false
  Column 1: count: 30000 hasNull: false min: 1970-01-01 00:00:00.001 max: 1970-01-01 00:00:00.017

Stripes:
  Stripe: offset: 3 data: 342 rows: 30000 tail: 60 index: 392
    Stream: column 0 section ROW_INDEX start: 3 length 17
    Stream: column 1 section ROW_INDEX start: 20 length 57
    Stream: column 1 section BLOOM_FILTER start: 77 length 175
    Stream: column 1 section BLOOM_FILTER_UTF8 start: 252 length 143
    Stream: column 1 section DATA start: 395 length 23
    Stream: column 1 section SECONDARY start: 418 length 319
    Encoding column 0: DIRECT
    Encoding column 1: DIRECT_V2

File length: 935 bytes
Padding length: 0 bytes
Padding ratio: 0%
________________________________________________________________________________________________________________________

prestodb-ci · 2025-05-14T01:16:54Z

@ethanyzhang imported this issue as lakehouse/presto #24571

ScrapCodes · 2025-05-15T15:58:14Z

The test failures such as:
TestOrcReader>AbstractTestOrcReader.testLongDirect:158->AbstractTestOrcReader.testRoundTripNumeric:325 expected [1970-01-01 00:00:00.001] but found [1970-01-01 06:00:00.001]

Is in the way presto reads the data and writes the data. Somehow even before the data is interpreted as a timestamp type i.e it is still a long type, it has a timestamp adjusted to system timezone. Why it happens is not yet clear to me, when the data written by presto is read via an external ORC reader it has a 6h adjustment applied to it. A similar thing happens when presto reads the data written by hive.

There are no issues while reading other datatypes, e.g. Long/Ints etc... The problem seems to be specific to timestamp only. @imjalpreet agrees with this.

…extfile

This reverts commit 6880dd2.

imjalpreet self-assigned this Feb 16, 2025

prestodb-ci added the from:IBM PR from IBM label Feb 16, 2025

imjalpreet force-pushed the hive4-upgrade branch from 7ec86c0 to 954a77b Compare February 21, 2025 11:44

imjalpreet force-pushed the hive4-upgrade branch from 7e6d265 to 0448c62 Compare March 3, 2025 09:01

imjalpreet force-pushed the hive4-upgrade branch 3 times, most recently from 21c80f6 to 282e2c4 Compare April 24, 2025 11:46

imjalpreet force-pushed the hive4-upgrade branch 4 times, most recently from 13d9947 to 5c04f64 Compare May 13, 2025 14:37

imjalpreet force-pushed the hive4-upgrade branch 4 times, most recently from 5b896d6 to 6db405e Compare May 23, 2025 01:43

imjalpreet force-pushed the hive4-upgrade branch 3 times, most recently from 894b0aa to 5a2117a Compare June 2, 2025 21:47

imjalpreet force-pushed the hive4-upgrade branch from 71ed5ae to c2e4072 Compare September 4, 2025 21:01

imjalpreet added 7 commits November 19, 2025 03:42

Upgrade to Hive 4.0.1

e5ecd5a

Use JitPack for release validation

e8eb4dc

Fix CI test pipeline: presto-hive -P test-hive-parquet

58ba4ae

Fix Date and Timestamp failures due to changes in Hive 4.0.1

516ddc7

Disable most tests

9b2d931

Enable relevant tests

0695043

Modify how Presto interprets OrcProto.StringStatistics

ff94cc1

imjalpreet added 21 commits November 19, 2025 03:51

Fix Dependency Versions

b68bd35

Use Hive 4 and Hadoop 3 images

9da8119

Partial changes to fix Timestamps

93aa8cb

Set legacyTimestamp to true by default

5a3c98e

Enable all tests

5d3e5ab

Fix missing classes in org.apache.hive:hive-exec:4.0.1

06eb251

Add Class.forName("org.h2.Driver") in H2QueryRunner

83451ea

Update java.sql.Date to org.apache.hadoop.hive.common.type.Date

3726bb9

Add Class.forName("org.h2.Driver") in IcebergRestTestUtil

1c252a8

Upgrade tempto

815d12c

Fix compatibility with Hive 4 timestamps

64660bf

[Experimental] Fix compatibility with timestamps for Parquet, Avro, T…

c002a63

…extfile

[Experimental] Fix Hive Bucketing hash

8a19a47

Disable TestSelectiveOrcReader.testLongDirectVarintScale for Date

673f541

Fix Hive 4 Timestamp compatibility with Parquet

d57269c

Fix Partition Value String Generation for Timestamp Column

64db5aa

[Temp] Run all product tests even with failures

ea624b9

Update hive-apache version

ed27603

Revert "Fix Partition Value String Generation for Timestamp Column"

f73ef42

This reverts commit 6880dd2.

Update timezone to UTC in TestValuesDecoders

849deb1

Move hive-llap-common to hive-apache

602229d

imjalpreet force-pushed the hive4-upgrade branch from c2e4072 to 602229d Compare November 18, 2025 23:28

Update presto-hive-dwrf jitpack version

be609f6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Upgrade to Hive 4.0.1 #24571

[WIP] Upgrade to Hive 4.0.1 #24571

Uh oh!

imjalpreet commented Feb 16, 2025 •

edited

Loading

Uh oh!

ScrapCodes commented Apr 28, 2025 •

edited

Loading

Uh oh!

ScrapCodes commented Apr 29, 2025 •

edited

Loading

Uh oh!

prestodb-ci commented May 14, 2025

Uh oh!

ScrapCodes commented May 15, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[WIP] Upgrade to Hive 4.0.1 #24571

Are you sure you want to change the base?

[WIP] Upgrade to Hive 4.0.1 #24571

Uh oh!

Conversation

imjalpreet commented Feb 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

Uh oh!

ScrapCodes commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ScrapCodes commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

prestodb-ci commented May 14, 2025

Uh oh!

ScrapCodes commented May 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

imjalpreet commented Feb 16, 2025 •

edited

Loading

ScrapCodes commented Apr 28, 2025 •

edited

Loading

ScrapCodes commented Apr 29, 2025 •

edited

Loading

ScrapCodes commented May 15, 2025 •

edited

Loading