Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-9025] perf: improve append performance by reducing avro schema comparisons #12839

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

TheR1sing3un
Copy link
Member

@TheR1sing3un TheR1sing3un commented Feb 14, 2025

For engine specific record merger mode, i.e. spark record:
For each record, we need to use cpu time to do some unnecessary operation, such as schema comparison, even if we have a global avro schema -> spark schema cache, but for the cache.get() operation, It will eventually call the avro schema::equals method, which will go through all the columns to compare. In the append scenario, our cache will hit every time, but each record will need to compare the avro schema completely, which will waste a lot of cpu time.

In my perf result:

HoodieInternalRowUtils::getCachedSchema cost almost 10% cpu time during doAppend.
And then we can analyze the results and figure out that basically all the time in this function is used to compare avro schema with cache::key.
image

As the number of columns increases, the proportion of this consumption will be higher.

Change Logs

  1. improve append performance by reducing comparisons

Impact

add a new interface for engine specific record for using engine specific schema rather than avro schema

Risk level (write none, low medium or high below)

low

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

1. improve append performance by reducing comparisons

Signed-off-by: TheR1sing3un <[email protected]>
@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Feb 14, 2025
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@TheR1sing3un TheR1sing3un changed the title [HUDI-9025] perf: improve append performance by reducing comparisons [HUDI-9025] perf: improve append performance by reducing avro schema comparisons Feb 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:M PR with lines of changes in (100, 300]
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants