-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Log files in Hudi MOR table are not getting deleted #12702
Comments
@xushiyan Please let us know how to share the .hoodi folder |
@koochiswathiTR You can either zip it and send through Apache Hudi Slack. You can ping me (Aditya Goenka) or Ranga Reddy. In case if you are not able to share, let us know. We can get into a call and debug. |
@ad1happy2go Where can i ping you? can I share via teams? |
Are you "Apache Hudi" Slack community? you can share there |
@ad1happy2go I am not able upload the .hoodi folder here as the size is 2.3 gb zip file nor I didn't find Apache Hudi channel in slack. |
@ad1happy2go Finally found Hudi Slack, but I am not able to upload file there as its 2.3g zip file. |
@ad1happy2go I have pinged you the .hoodi folder via slack.. |
Thanks @koochiswathiTR . We are looking into it. |
@ad1happy2go @xushiyan Any update on this? |
@koochiswathiTR I was trying to reach you from friday on Apache Hudi Slack. Did you got a chance to check. We can connect on call too to debug this further. |
@ad1happy2go Uploaded again, Please check and confirm. |
@ad1happy2go sure we can meet, can you check once the zip file. what time will you be available? |
@koochiswathiTR I checked the timeline files and noticed that as you said in clean planning i just see logs files and no parquet files. I pinged you on slack about my available timings. |
Tips before filing an issue
Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at [email protected].
If you have triaged this as a bug, then file an issue directly.
We use MOR Hudi table , We read kinesis stream and process in AWS EMR using spark streaming.
We use inline compaction and cleanup based on the commits,
Though compaction and cleanup is running we see some of hudi log files that are generated while ingestion are not getting cleared fully.
We process with batch interval of 5 mints for every 5 mints we do one hudi commit.
Below are our hudi configs
DataSourceWriteOptions.TABLE_TYPE.key() -> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL,
DataSourceWriteOptions.RECORDKEY_FIELD.key() -> "guid",
DataSourceWriteOptions.PARTITIONPATH_FIELD.key() -> "collectionName",
DataSourceWriteOptions.PRECOMBINE_FIELD.key() -> "operationTime",
DataSourceWriteOptions.HIVE_PARTITION_FIELDS.key() -> "collectionName",
DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS.key() -> classOf[MultiPartKeysValueExtractor].getName,
DataSourceWriteOptions.HIVE_SYNC_MODE.key() -> "hms",
DataSourceWriteOptions.HIVE_USE_JDBC.key() -> "false",*/
HoodieCompactionConfig.INLINE_COMPACT_TRIGGER_STRATEGY.key() -> CompactionTriggerStrategy.TIME_ELAPSED.name,
HoodieCompactionConfig.INLINE_COMPACT_TIME_DELTA_SECONDS.key() -> String.valueOf(60 * 6 * 60),
HoodieCompactionConfig.CLEANER_POLICY.key() -> HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name(),
HoodieCompactionConfig.CLEANER_COMMITS_RETAINED.key() -> "228",
HoodieCompactionConfig.MIN_COMMITS_TO_KEEP.key() -> "229",
HoodieCompactionConfig.MAX_COMMITS_TO_KEEP.key() -> "252",
HoodieCompactionConfig.ASYNC_CLEAN.key() -> "false",
HoodieCompactionConfig.INLINE_COMPACT.key() -> "true",
HoodieMetricsConfig.TURN_METRICS_ON.key() -> "true",
HoodieMetricsConfig.METRICS_REPORTER_TYPE_VALUE.key() -> MetricsReporterType.DATADOG.name(),
HoodieMetricsDatadogConfig.API_SITE_VALUE.key() -> "US",
HoodieMetricsDatadogConfig.METRIC_PREFIX_VALUE.key() -> "XXXX.hudi",
HoodieMetricsDatadogConfig.API_KEY_SUPPLIER.key() -> "XXXXX.XXX.DatadogKeySupplier",
HoodieMetadataConfig.ENABLE.key() -> "false",
HoodieWriteConfig.ROLLBACK_USING_MARKERS_ENABLE.key() -> "false",
Snapshots attached :
If you want the .hoodi folder, Please let us know we will share it here.
Hudi version : 0.9
Spark version : 3.2.1
EMR Version : 6.7
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : no
The text was updated successfully, but these errors were encountered: