Skip to content

Conversation

@suryaprasanna
Copy link
Contributor

Describe the issue this Pull Request addresses

This PR removes the deprecated ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN configuration and migrates all log scanning operations to use the ScanV2Internal API as the default implementation. The change simplifies the codebase by eliminating the dual-path scanning logic that was maintained for backward compatibility.

Summary and Changelog

Users will no longer need to configure ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN as the optimized log scanning is now the default behavior. This change streamlines the log reading path and removes approximately 436 lines of legacy code.

Changes:

  • Removed ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN configuration from HoodieReaderConfig and HoodieCompactionConfig
  • Removed conditional logic using enableOptimizedLogBlocksScan across log scanning components
  • Simplified AbstractHoodieLogRecordScanner and BaseHoodieLogRecordReader by removing legacy scan path
  • Updated HoodieMergedLogRecordReader, HoodieMergedLogRecordScanner, and HoodieUnMergedLogRecordScanner to use ScanV2Internal exclusively
  • Cleaned up references in metadata writers, clustering strategies, and test utilities
  • Updated Hive integration components to remove deprecated configuration checks

Impact

Breaking Change: The ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN configuration option has been removed. Users who explicitly set this configuration will need to remove it from their configurations. The new
default behavior is equivalent to having this config enabled.

Performance: No performance impact expected as ScanV2Internal was already the recommended and optimized path. Users who had the config disabled will see performance improvements.

Risk Level

Low - The ScanV2Internal API has been available and tested for several releases. This change only removes the legacy fallback path. All existing tests pass with the new default behavior.

Documentation Update

  • Configuration documentation needs to be updated to remove references to ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN
  • Release notes should highlight this as a breaking change for users who explicitly disabled the optimization

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Dec 8, 2025
@suryaprasanna suryaprasanna force-pushed the remove-olderscan-method branch from 62a86ea to c952745 Compare December 8, 2025 19:26
@suryaprasanna suryaprasanna changed the title refactor!: migrate to ScanV2Internal API and remove ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN config refactor!: migrate to ScanV2Internal API and remove ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN config Dec 8, 2025
@vinothchandar
Copy link
Member

@suryaprasanna since you have a ! in the title. whats the breaking change.

@suryaprasanna
Copy link
Contributor Author

@vinothchandar No breaking change as such, just want to it be careful when refactoring this codepath. Let me remove ! as it can cause confusion.

@suryaprasanna suryaprasanna changed the title refactor!: migrate to ScanV2Internal API and remove ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN config refactor: migrate to ScanV2Internal API and remove ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN config Dec 8, 2025
@hudi-bot
Copy link
Collaborator

hudi-bot commented Dec 8, 2025

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build


public static final ConfigProperty<String> ENABLE_OPTIMIZED_LOG_BLOCKS_SCAN = ConfigProperty
.key("hoodie" + HoodieMetadataConfig.OPTIMIZED_LOG_BLOCKS_SCAN)
.defaultValue("false")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the behavior by using the optimized log block scan by default. I remember that the optimized log block scan has two passes which increase the number of FS calls, which can be noticeable on cloud storage. Is the overhead measured and negligible?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants