Timestamp millis repair #14120

yihua · 2025-10-18T22:50:39Z

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

hudi-bot · 2025-10-18T23:06:18Z

CI report:

814d442 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

danny0405 · 2025-10-20T03:44:45Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

        case LONG:
          if (oldSchema.getLogicalType() != newSchema.getLogicalType()) {
-            if (oldSchema.getLogicalType() instanceof LogicalTypes.TimestampMillis) {
+            if (skipLogicalTimestampEvolution || oldSchema.getLogicalType() == null || newSchema.getLogicalType() == null) {


didn' get why we need this flag skipLogicalTimestampEvolution, we should always rewrite the field if the logical type mismatch?

Based on my understanding, previously, AvroSchemaCompatibility#calculateCompatibility does not validate the logical timestamp evolution before, so timestamp micros to timestamp millis can happen which leads to precision loss, and such schema evolution should not be allowed.

However, for handling the timestamp issue this PR addresses, the ingestion writer needs to rewrite the schema from timestamp micros to millis.

danny0405 · 2025-10-20T03:46:56Z

hudi-common/src/main/java/org/apache/hudi/common/config/HoodieCommonConfig.java

      .withDocumentation("Enables support for Schema Evolution feature");

+  public static final ConfigProperty<Boolean> SCHEMA_EVOLUTION_ALLOW_LOGICAL_EVOLUTION = ConfigProperty
+      .key("hoodie.schema.evolution.allow.logical.evolution")


why need this flag? timestamp-millis to/from timestamp-micros should always be feasible in schema evolution.

As mentioned in the other thread, timestamp-micros to timestamp-millis should not be allowed as it loses precision.

danny0405 · 2025-10-20T03:50:32Z

hudi-common/src/main/java/org/apache/hudi/common/table/log/block/HoodieAvroDataBlock.java

        this.readRecords++;
        if (this.promotedSchema.isPresent()) {
-          return HoodieAvroUtils.rewriteRecordWithNewSchema(record, this.promotedSchema.get());
+          return HoodieAvroUtils.rewriteRecordWithNewSchema(record, this.promotedSchema.get(), skipLogicalTimestampEvolution);


I thought we only got problems for Parquets, so avro logs also got mismatch precision for timestamp type and it's values? the avro schema in the log block head comes from the table schema which should be correct right?

yihua · 2025-10-20T05:02:35Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/ParquetTimestampUtils.java

+      if (isTimestampMicros(fileType) && isTimestampMillis(tableType)) {
+        columnsToMultiply.add(path);
+      } else if (isLong(fileType) && isLocalTimestampMillis(tableType)) {
+        columnsToMultiply.add(path);


Is this a new breaking case to handle?

yihua · 2025-10-20T05:16:59Z

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/common/ParquetTimestampUtils.java

+   * @param tableSchema The Parquet schema from the table (target)
+   * @return Set of column paths (e.g., "timestamp", "metadata.created_at") that need multiplication
+   */
+  public static Set<String> findColumnsToMultiply(MessageType fileSchema, MessageType tableSchema) {


Could Avro table schema be passed in for comparison instead of converting the Avro table schema to Parquet MessageType for comparison? The additional conversion from Avro to Parquet schema introduces another layer of processing, which can be error-prone (e.g., any change in such a conversion logic can affect the mitigation in this PR).

yihua · 2025-10-20T05:42:15Z

...scala/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetFileFormatHelper.scala

          Cast(expr, dec, if (needTimeZone) timeZoneId else None)
        case (StringType, DateType) =>
          Cast(expr, DateType, if (needTimeZone) timeZoneId else None)
+        case (LongType, TimestampNTZType) => expr // @ethan I think we just want a no-op here?


Now I kind of get it. Is this because the local timestamp or TimestampNTZType is written as Long type in parquet before? Also, there is no regression micros in schema vs millis in values for TimestampNTZType for published Hudi releases correct? If so, there is no need for conversion.

yihua · 2025-10-20T05:46:48Z

...scala/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetFileFormatHelper.scala

    }

-    if (typeChangeInfos.isEmpty) {
+    if (typeChangeInfos.isEmpty && columnsToMultiply.isEmpty) {


Does this mean that the record-level projection overhead is only incurred if there are columns to apply multiplication?

yihua · 2025-10-20T05:55:36Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

        case LONG:
          if (oldSchema.getLogicalType() != newSchema.getLogicalType()) {
-            if (oldSchema.getLogicalType() instanceof LogicalTypes.TimestampMillis) {
+            if (skipLogicalTimestampEvolution || oldSchema.getLogicalType() == null || newSchema.getLogicalType() == null) {


Based on my understanding, previously, AvroSchemaCompatibility#calculateCompatibility does not validate the logical timestamp evolution before, so timestamp micros to timestamp millis can happen which leads to precision loss, and such schema evolution should not be allowed.

yihua · 2025-10-20T05:56:16Z

hudi-common/src/main/java/org/apache/hudi/avro/HoodieAvroUtils.java

        case LONG:
          if (oldSchema.getLogicalType() != newSchema.getLogicalType()) {
-            if (oldSchema.getLogicalType() instanceof LogicalTypes.TimestampMillis) {
+            if (skipLogicalTimestampEvolution || oldSchema.getLogicalType() == null || newSchema.getLogicalType() == null) {


However, for handling the timestamp issue this PR addresses, the ingestion writer needs to rewrite the schema from timestamp micros to millis.

yihua · 2025-10-20T06:28:48Z

...scala/org/apache/spark/sql/execution/datasources/parquet/HoodieParquetFileFormatHelper.scala

      })
    }

+    def recursivelyApplyMultiplication(expr: Expression, columnPath: String, dataType: DataType): Expression = {


I'm wondering if we can change HoodieParquetReadSupport and adding a read support implementation for Avro parquet reader for handling the millis interpretation, which is one layer below the current approach? Would that incur less overhead than the projection?

Jonathan Vexler added 4 commits October 16, 2025 17:12

current progress

e46d157

seems to be working for spark non vectorized and avro

513e8a1

filters working

bb4a278

prevent overflow

814d442

github-actions bot added the size:L PR with lines of changes in (300, 1000] label Oct 18, 2025

danny0405 reviewed Oct 20, 2025

View reviewed changes

yihua commented Oct 20, 2025

View reviewed changes

Timestamp millis repair #14120

Are you sure you want to change the base?

Timestamp millis repair #14120

Conversation

yihua commented Oct 18, 2025

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

hudi-bot commented Oct 18, 2025

CI report:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants