Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-8992] Deprecate all byte array usage in metadata deserialization path #12826

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

Davis-Zhang-Onehouse
Copy link
Contributor

@Davis-Zhang-Onehouse Davis-Zhang-Onehouse commented Feb 11, 2025

Reviewers please start reviewing from commit with title "util changes done", the upstream PR is tracked in #12829

Change Logs

For all metadata deserialization now we change them to use input streams. A few minor place that requires more code refactoring we need to follow up separately.

Also fix the logic of checking if a given instant file is empty, previously we load file to byte array and check length, now we check file stats which is much light weight.

Impact

Allows for reading of large commit metadata without OOM.

Risk level (write none, low medium or high below)

low

Documentation Update

NA

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Feb 11, 2025
@github-actions github-actions bot added size:XL PR with lines of changes > 1000 and removed size:L PR with lines of changes in (300, 1000] labels Feb 12, 2025
@Davis-Zhang-Onehouse Davis-Zhang-Onehouse force-pushed the HUDI-8992-migrate2 branch 12 times, most recently from bee0cdf to 9699515 Compare February 13, 2025 00:38
@Davis-Zhang-Onehouse Davis-Zhang-Onehouse changed the title Hudi 8992 migrate2 [HUDI-8992] Deprecate all byte array usage in metadata deserialization path Feb 13, 2025
@Davis-Zhang-Onehouse Davis-Zhang-Onehouse force-pushed the HUDI-8992-migrate2 branch 2 times, most recently from 082fd85 to 7b08e0b Compare February 13, 2025 02:47
@@ -394,4 +423,12 @@ public static <T extends SpecificRecordBase> byte[] convertCommitMetadataToJsonB
throw new HoodieIOException("Failed to convert to JSON.", e);
}
}

public static boolean isEmptyStream(InputStream inputStream) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this method is unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

}
return JsonUtils.getObjectMapper().convertValue(replaceCommitMetadata, HoodieReplaceCommitMetadata.class);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method below this one is now showing as unused, can it be cleaned up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

}

public static HoodieCleanerPlan deserializeCleanerPlan(byte[] bytes) throws IOException {
return deserializeAvroMetadata(bytes, HoodieCleanerPlan.class);
public static HoodieCleanMetadata deserializeHoodieCleanMetadataLegacy(byte[] bytes) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are only 2 callers of this and it looks like they can both be migrated to use the content stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

switch (actionType) {
case HoodieTimeline.CLEAN_ACTION: {
archivedMetaWrapper.setHoodieCleanMetadata(CleanerUtils.getCleanerMetadata(metaClient, instantDetails.get()));
archivedMetaWrapper.setHoodieCleanerPlan(CleanerUtils.getCleanerPlan(metaClient, planBytes.get()));
archivedMetaWrapper.setHoodieCleanerPlan(CleanerUtils.getCleanerPlanLegacy(metaClient, planBytes.get()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could replace all the "legacy" methods by wrapping the bytes in an input stream. This seems to be the main non-test class where these methods are used so we can also try to remove those methods after cleaning this up. This could help avoid confusion for future developers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -202,7 +202,7 @@ private Long getLatestCleanTimeTakenInMillis() throws IOException {
HoodieInstant clean = timeline.getReverseOrderedInstants().findFirst().orElse(null);
if (clean != null) {
HoodieCleanMetadata cleanMetadata =
TimelineMetadataUtils.deserializeHoodieCleanMetadata(timeline.getInstantDetails(clean).get());
TimelineMetadataUtils.deserializeHoodieCleanMetadataLegacy(timeline.getInstantDetails(clean).get());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this

Suggested change
TimelineMetadataUtils.deserializeHoodieCleanMetadataLegacy(timeline.getInstantDetails(clean).get());
TimelineMetadataUtils.deserializeHoodieCleanMetadata(timeline.getInstantContentStream(clean));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


@Override
public boolean isEmpty(HoodieInstant instant) {
return TimelineUtils.isEmpty(metaClient, instant);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here, do we want to reference the readCommits?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied the same fix - keep the behavior the same as before.

Comment on lines 129 to 141
public static HoodieCleanMetadata getCleanerMetadata(HoodieTableMetaClient metaClient, Option<InputStream> details)
throws IOException {
CleanMetadataMigrator metadataMigrator = new CleanMetadataMigrator(metaClient);
HoodieCleanMetadata cleanMetadata = TimelineMetadataUtils.deserializeHoodieCleanMetadata(details);
return metadataMigrator.upgradeToLatest(cleanMetadata, cleanMetadata.getVersion());
}

public static HoodieCleanMetadata getCleanerMetadataFromInputStream(HoodieTableMetaClient metaClient, Option<InputStream> in)
throws IOException {
CleanMetadataMigrator metadataMigrator = new CleanMetadataMigrator(metaClient);
HoodieCleanMetadata cleanMetadata = TimelineMetadataUtils.deserializeHoodieCleanMetadata(in);
return metadataMigrator.upgradeToLatest(cleanMetadata, cleanMetadata.getVersion());
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two methods look like they are the same?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yes, fixed.

@@ -190,22 +190,21 @@ private static Option<HoodieRequestedReplaceMetadata> getRequestedReplaceMetadat
String action = factory instanceof InstantGeneratorV2 ? HoodieTimeline.CLUSTERING_ACTION : HoodieTimeline.REPLACE_COMMIT_ACTION;
requestedInstant = factory.createNewInstant(HoodieInstant.State.REQUESTED, action, pendingReplaceOrClusterInstant.requestedTime());
}
Option<byte[]> content = Option.empty();
if (timeline.isEmpty(requestedInstant)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this to the error handling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}
}

public static HoodieCompactionPlan getCompactionPlanFromInputStream(HoodieTableMetaClient metaClient, Option<InputStream> in) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems similar to the method at line 196

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

@@ -1094,7 +1094,7 @@ private static void reAddLogFilesFromRollbackPlan(HoodieTableMetaClient dataTabl
HoodieInstant requested = factory.getRollbackRequestedInstant(rollbackInstant);
try {
HoodieRollbackPlan rollbackPlan = TimelineMetadataUtils.deserializeAvroMetadata(
dataTableMetaClient.getActiveTimeline().readRollbackInfoAsBytes(requested).get(), HoodieRollbackPlan.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readRollbackInfoAsBytes looks like it is no longer used in the codebase and can be removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@Davis-Zhang-Onehouse Davis-Zhang-Onehouse force-pushed the HUDI-8992-migrate2 branch 9 times, most recently from 12cd738 to 2880aca Compare February 20, 2025 23:18
@Davis-Zhang-Onehouse Davis-Zhang-Onehouse force-pushed the HUDI-8992-migrate2 branch 5 times, most recently from 680f56a to 17caf9f Compare February 21, 2025 17:07
@Davis-Zhang-Onehouse Davis-Zhang-Onehouse force-pushed the HUDI-8992-migrate2 branch 3 times, most recently from 670b377 to 81b6e74 Compare February 21, 2025 22:11
@@ -242,145 +220,6 @@ public String toString() {
}
}

static Stream<SchemaEvolutionTestCase> schemaEvolutionTestCases() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the test as it failed due to mock does not match the updated code change. Since we remove all those tests in #12781, I removed them here as well to unblock java CI.

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XL PR with lines of changes > 1000
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants