[HUDI-9025] perf: improve append performance by reducing avro schema comparisons #12839
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
For engine specific record merger mode, i.e. spark record:
For each record, we need to use cpu time to do some unnecessary operation, such as schema comparison, even if we have a global
avro schema
->spark schema
cache, but for thecache.get()
operation, It will eventually call theavro schema::equals
method, which will go through all the columns to compare. In the append scenario, our cache will hit every time, but each record will need to compare theavro schema
completely, which will waste a lot of cpu time.In my perf result:
HoodieInternalRowUtils::getCachedSchema
cost almost 10% cpu time duringdoAppend
.And then we can analyze the results and figure out that basically all the time in this function is used to compare
avro schema
withcache::key
.As the number of columns increases, the proportion of this consumption will be higher.
Change Logs
Impact
add a new interface for engine specific record for using engine specific schema rather than avro schema
Risk level (write none, low medium or high below)
low
Documentation Update
none
Contributor's checklist