You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to perform an upsert to hudi with a dataframe of 200 M records and I noticed that it is taking an hour to complete this process. My hudi table has record level index enabled on it and I would like to speed up this upsert process to finish in a matter of minutes as opposed to an hour. Please see the attached UI screenshots specifically jobs 32 and 37 and under them stages 70 and 98 respectively.
We are upserting about 181309872 records to the table. I see that there are 23360 occurrences of fileId in the commit file, which I'm assuming is the number of files that hudi modifies during this upsert operation.
Questions
For stage 98, where does the 20003 number for tasks come from? None of my spark config has that value and I want to know if I can configure it so that I can try increasing it to get a lower time for updating the metadata.
For stage 70, is there anything that we can do to improve the time it takes to write the data files? We've to use a copy on write table for our consumers to get the latest version after the commit file. It would be good to understand if there are any configurations that we can tune to get a better performance here.
Should we include any other configuration in our upsert configuration to ensure that our data upsert works in the most efficient manner?
Another puzzling thing I observed was the size of the commit files for upsert operations - they were almost as large as the commit file that was created for the table creation operation using INSERT? Do you know why that might be? See the following image for the sizes.
I'm trying to determine if we have hit the lower limit for the performance in terms of time of upsert operation and see if there is room for improvement or not.
To Reproduce
Read the data
Upsert into hudi
Expected behavior
The upsert operation would finish in minutes as opposed to taking an hour.
We tried using a partition key that logically partitions the data, however, due to the nature of our data we have a skew in our partitions as you can see.
The text was updated successfully, but these errors were encountered:
Describe the problem you faced
I am trying to perform an upsert to hudi with a dataframe of 200 M records and I noticed that it is taking an hour to complete this process. My hudi table has record level index enabled on it and I would like to speed up this upsert process to finish in a matter of minutes as opposed to an hour. Please see the attached UI screenshots specifically jobs 32 and 37 and under them stages 70 and 98 respectively.
We are upserting about
181309872
records to the table. I see that there are23360
occurrences of fileId in the commit file, which I'm assuming is the number of files that hudi modifies during this upsert operation.Questions
I'm trying to determine if we have hit the lower limit for the performance in terms of time of upsert operation and see if there is room for improvement or not.
To Reproduce
Expected behavior
The upsert operation would finish in minutes as opposed to taking an hour.
Environment Description
Hudi version : 0.15.0
Spark version : 3.4.1
Hive version :
Hadoop version : 2.7.5
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
Spark Configuration
Hudi upsert configuration
We disable the timeline server and use DIRECT markers as per this other support ticket
Spark UI screenshots with details
Details for Job 32
Details for Job 37
Details for Stage 70
Details for Stage 98
Spark Job View for the upsert operation
Hudi table partition with object count and size
We tried using a partition key that logically partitions the data, however, due to the nature of our data we have a skew in our partitions as you can see.

The text was updated successfully, but these errors were encountered: