Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: move SanityChecker into physical-optimizer crate #14083

Merged
merged 5 commits into from
Jan 16, 2025

Conversation

mnpw
Copy link
Contributor

@mnpw mnpw commented Jan 11, 2025

Which issue does this PR close?

Closes #14072.

Rationale for this change

From #14072

Historically DataFusion was one (very) large crate datafusion, and as it grew bigger we extracted various functionality into separate crates. This leads to both faster compile times (as the crates can be compiled in parallel) as well easier to navigate code (as the crates force a cleaner dependency separation)

What changes are included in this PR?

  • Move SanityChecker from datafusion crate to datafusion-physical-optimizer crate

Are these changes tested?

Ran cargo test

Are there any user-facing changes?

No

@github-actions github-actions bot added optimizer Optimizer rules core Core DataFusion crate labels Jan 11, 2025
@mnpw
Copy link
Contributor Author

mnpw commented Jan 11, 2025

@alamb for your consideration.

I don't like that datafusion-physical-optimizer needs to use datafusion crate as a dev-dependency. This was required as SanityChecker tests were tightly coupled with the helper functions in datafusion::physical_optimizer::test_utils. Perhaps we can consider moving these helper functions somewhere outside, say, to test-utils crate?

@alamb
Copy link
Contributor

alamb commented Jan 11, 2025

I don't like that datafusion-physical-optimizer needs to use datafusion crate as a dev-dependency.

Thanks @mnpw -- I agree this is not good and we shouldn't merge it in like that (and in fact I think it failed the circular dependency check)

I think one of the major sources of this dependency is the use of ParquetExec which is still defined in the core module...

Since datafusion already depends on phsyical-optimizer, what would you think about moving as many of the physical optimizer utils from

@alamb alamb marked this pull request as draft January 12, 2025 11:28
@alamb
Copy link
Contributor

alamb commented Jan 12, 2025

Converting to draft given the dependency situation we have uncovered

@mnpw
Copy link
Contributor Author

mnpw commented Jan 15, 2025

@alamb

Here's my understanding of the issue:

  • datafusion crate already depends on datafusion-physical-optimizer.
  • SanityChecker's tests depend on datafusion::datasources::{listing, physical_plan, stream} (via datafusion::physical_optimizer::test_utils).
  • Moving SanityChecker, with its tests, to datafusion-physical-optimizer makes datafusion-physical-optimizer depend on datafusion::datasources.
  • This creates a circular dependency between datafusion and datafusion-physical-optimizer.

In this case, best solution seems to be to move SanityChecker to datafusion-physical-optimizer while keeping its tests in datafusion/core/tests.

I have re-worked the PR and made the following changes:

  • Move SanityChecker to datafusion-physical-optimizer
  • Move SanityChecker's tests to datafusion/core/tests
  • Move independent componenets from datafusion::physical_optimizer::test_utils to datafusion-physical-optimizer::test_utils

@mnpw mnpw marked this pull request as ready for review January 15, 2025 14:50
datafusion-expr-common = { workspace = true, default-features = true }
datafusion-functions-aggregate = { workspace = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are trying very hard to avoid this dependency -- specifically we are trying to avoid having the optimizer depend on any actual function implementations

The separation ensures that there is no special treatment for provided functions vs built in ones

I played around with this some morning and I found a way to remove it here:

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mnpw -- this looks great. I had one improvement but we can do this as a follow on PR

@alamb
Copy link
Contributor

alamb commented Jan 15, 2025

I will update the datafusion-cli cargo file as well

@mnpw
Copy link
Contributor Author

mnpw commented Jan 15, 2025

Thanks @alamb for proactively reviewing and adding the fix commits. Much appreciated 🙏

@alamb
Copy link
Contributor

alamb commented Jan 15, 2025

Thanks @alamb for proactively reviewing and adding the fix commits. Much appreciated 🙏

Thank you for helping push this along ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate optimizer Optimizer rules
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Move SanityChecker into datafusion-physical-optimizer crate
2 participants