Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Log files in Hudi MOR table are not getting deleted #12702

Open
koochiswathiTR opened this issue Jan 24, 2025 · 15 comments
Open

Log files in Hudi MOR table are not getting deleted #12702

koochiswathiTR opened this issue Jan 24, 2025 · 15 comments

Comments

@koochiswathiTR
Copy link

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at [email protected].

  • If you have triaged this as a bug, then file an issue directly.

We use MOR Hudi table , We read kinesis stream and process in AWS EMR using spark streaming.
We use inline compaction and cleanup based on the commits,
Though compaction and cleanup is running we see some of hudi log files that are generated while ingestion are not getting cleared fully.
We process with batch interval of 5 mints for every 5 mints we do one hudi commit.

Below are our hudi configs

DataSourceWriteOptions.TABLE_TYPE.key() -> DataSourceWriteOptions.MOR_TABLE_TYPE_OPT_VAL,

DataSourceWriteOptions.RECORDKEY_FIELD.key() -> "guid",

DataSourceWriteOptions.PARTITIONPATH_FIELD.key() -> "collectionName",

DataSourceWriteOptions.PRECOMBINE_FIELD.key() -> "operationTime",

DataSourceWriteOptions.HIVE_PARTITION_FIELDS.key() -> "collectionName",

DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS.key() -> classOf[MultiPartKeysValueExtractor].getName,

DataSourceWriteOptions.HIVE_SYNC_MODE.key() -> "hms",

DataSourceWriteOptions.HIVE_USE_JDBC.key() -> "false",*/

HoodieCompactionConfig.INLINE_COMPACT_TRIGGER_STRATEGY.key() -> CompactionTriggerStrategy.TIME_ELAPSED.name,

HoodieCompactionConfig.INLINE_COMPACT_TIME_DELTA_SECONDS.key() -> String.valueOf(60 * 6 * 60),

HoodieCompactionConfig.CLEANER_POLICY.key() -> HoodieCleaningPolicy.KEEP_LATEST_COMMITS.name(),

HoodieCompactionConfig.CLEANER_COMMITS_RETAINED.key() -> "228",

HoodieCompactionConfig.MIN_COMMITS_TO_KEEP.key() -> "229",

HoodieCompactionConfig.MAX_COMMITS_TO_KEEP.key() -> "252",

HoodieCompactionConfig.ASYNC_CLEAN.key() -> "false",

HoodieCompactionConfig.INLINE_COMPACT.key() -> "true",

HoodieMetricsConfig.TURN_METRICS_ON.key() -> "true",

HoodieMetricsConfig.METRICS_REPORTER_TYPE_VALUE.key() -> MetricsReporterType.DATADOG.name(),

HoodieMetricsDatadogConfig.API_SITE_VALUE.key() -> "US",

HoodieMetricsDatadogConfig.METRIC_PREFIX_VALUE.key() -> "XXXX.hudi",

HoodieMetricsDatadogConfig.API_KEY_SUPPLIER.key() -> "XXXXX.XXX.DatadogKeySupplier",

HoodieMetadataConfig.ENABLE.key() -> "false",

HoodieWriteConfig.ROLLBACK_USING_MARKERS_ENABLE.key() -> "false",

Snapshots attached :

Image

Image

If you want the .hoodi folder, Please let us know we will share it here.

  • Hudi version : 0.9

  • Spark version : 3.2.1

  • EMR Version : 6.7

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : no

@koochiswathiTR
Copy link
Author

@xushiyan Please let us know how to share the .hoodi folder

@ad1happy2go
Copy link
Collaborator

ad1happy2go commented Jan 27, 2025

@koochiswathiTR You can either zip it and send through Apache Hudi Slack. You can ping me (Aditya Goenka) or Ranga Reddy.

In case if you are not able to share, let us know. We can get into a call and debug.

@koochiswathiTR
Copy link
Author

@ad1happy2go Where can i ping you? can I share via teams?

@ad1happy2go
Copy link
Collaborator

Are you "Apache Hudi" Slack community? you can share there

https://apache-hudi.slack.com/

@koochiswathiTR
Copy link
Author

Im not able to find Apache Hudi in the slack

Image

@koochiswathiTR
Copy link
Author

@ad1happy2go I am not able upload the .hoodi folder here as the size is 2.3 gb zip file nor I didn't find Apache Hudi channel in slack.
If you can give us any S3 link to upload it would be good.

@koochiswathiTR
Copy link
Author

@ad1happy2go @xushiyan

@koochiswathiTR
Copy link
Author

@ad1happy2go Finally found Hudi Slack, but I am not able to upload file there as its 2.3g zip file.

@koochiswathiTR
Copy link
Author

@ad1happy2go I have pinged you the .hoodi folder via slack..

@ad1happy2go
Copy link
Collaborator

Thanks @koochiswathiTR . We are looking into it.

@koochiswathiTR
Copy link
Author

@ad1happy2go @xushiyan Any update on this?

@ad1happy2go
Copy link
Collaborator

@koochiswathiTR I was trying to reach you from friday on Apache Hudi Slack. Did you got a chance to check. We can connect on call too to debug this further.

@koochiswathiTR
Copy link
Author

@ad1happy2go Uploaded again, Please check and confirm.

@koochiswathiTR
Copy link
Author

@ad1happy2go sure we can meet, can you check once the zip file. what time will you be available?

@ad1happy2go
Copy link
Collaborator

@koochiswathiTR I checked the timeline files and noticed that as you said in clean planning i just see logs files and no parquet files. I pinged you on slack about my available timings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Awaiting Triage
Development

No branches or pull requests

2 participants