[HUDI-9025] perf: improve append performance by reducing avro schema comparisons #12839

TheR1sing3un · 2025-02-14T03:30:43Z

For engine specific record merger mode, i.e. spark record：
For each record, we need to use cpu time to do some unnecessary operation, such as schema comparison, even if we have a global avro schema -> spark schema cache, but for the cache.get() operation, It will eventually call the avro schema::equals method, which will go through all the columns to compare. In the append scenario, our cache will hit every time, but each record will need to compare the avro schema completely, which will waste a lot of cpu time.

In my perf result:

HoodieInternalRowUtils::getCachedSchema cost almost 10% cpu time during doAppend.
And then we can analyze the results and figure out that basically all the time in this function is used to compare avro schema with cache::key.

As the number of columns increases, the proportion of this consumption will be higher.

Change Logs

improve append performance by reducing comparisons

Impact

add a new interface for engine specific record for using engine specific schema rather than avro schema

Risk level (write none, low medium or high below)

low

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

1. improve append performance by reducing comparisons Signed-off-by: TheR1sing3un <[email protected]>

hudi-bot · 2025-02-14T04:52:19Z

CI report:

e0e89f2 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

perf: improve append performance by reducing comparisons

e0e89f2

1. improve append performance by reducing comparisons Signed-off-by: TheR1sing3un <[email protected]>

github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 14, 2025

TheR1sing3un changed the title ~~[HUDI-9025] perf: improve append performance by reducing comparisons~~ [HUDI-9025] perf: improve append performance by reducing avro schema comparisons Feb 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-9025] perf: improve append performance by reducing avro schema comparisons #12839

[HUDI-9025] perf: improve append performance by reducing avro schema comparisons #12839

TheR1sing3un commented Feb 14, 2025 •

edited

Loading

hudi-bot commented Feb 14, 2025

[HUDI-9025] perf: improve append performance by reducing avro schema comparisons #12839

Are you sure you want to change the base?

[HUDI-9025] perf: improve append performance by reducing avro schema comparisons #12839

Conversation

TheR1sing3un commented Feb 14, 2025 • edited Loading

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

hudi-bot commented Feb 14, 2025

CI report:

TheR1sing3un commented Feb 14, 2025 •

edited

Loading