Skip to content

RFC: Table-Level Upstream Information Metadata in Apache Hudi #17513

@shangxinli

Description

@shangxinli

Feature Description

Summary

This feature introduces table-level lineage metadata in Apache Hudi. Lineage records the direct upstream source tables from which a Hudi table is derived and stores this information as versioned table metadata.

Today, table lineage is often tracked externally or inferred heuristically, leading to inconsistency and loss of historical context. This proposal adds a simple, declarative, and deterministic lineage primitive directly to Hudi.

What is added

  • A new table metadata property recording upstream source tables
  • Lineage represented as a list of catalog.database.table identifiers
  • Lineage versioned implicitly with table metadata evolution

Example:

hoodie.table.lineage.sources = [
  "hive.rawdata.kafka_events",
  "hive.rawdata.users"
]

Key design points

  • Table-level only (no partition or column lineage)
  • Previous-layer only (one hop)
  • Declared explicitly by writers
  • No inference or query engine dependency

User Experience

How users use this feature

  • Opt-in: existing tables and pipelines are unchanged
  • Writers declare lineage during table creation or initial ingestion
  • Normal incremental writes do not modify lineage

Usage examples

Declare lineage when creating or rebuilding a table:

setLineageSources(Arrays.asList(
  "hive.rawdata.kafka_events",
  "hive.rawdata.users"
));

Read lineage:

metaClient.getTableConfig().getLineageSources();

What users do NOT need to do

  • No schema changes
  • No SQL or query changes
  • No engine upgrades
  • No new runtime dependencies

Hudi RFC Requirements

Non-Goals

  • Column-level lineage
  • Record-level lineage
  • Automatic inference
  • DAG management
  • Query planner changes

Backward Compatibility

  • Metadata is additive
  • Existing tables unaffected
  • No commit or file-format changes

Alternatives Considered

  • Commit-level lineage (rejected)
  • Engine-side inference (rejected)
  • External-only lineage systems (rejected)

Future Work

  • SQL / metadata table exposure
  • Visualization tooling
  • Integration with governance platforms

Metadata

Metadata

Assignees

Labels

type:featureNew features and enhancements

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions