Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define extension API for user-defined invariants. #14029

Open
wiedld opened this issue Jan 6, 2025 · 5 comments
Open

Define extension API for user-defined invariants. #14029

wiedld opened this issue Jan 6, 2025 · 5 comments
Labels
enhancement New feature or request

Comments

@wiedld
Copy link
Contributor

wiedld commented Jan 6, 2025

Is your feature request related to a problem or challenge?

As part of the work to automatically check invariants for the logical and execution plans, we have provided infrastructure to run an invariant checker. This invariant checker runs at limited time points in order to not degrade planning performance; e.g. after all optimizations are completed. In debug mode, it runs these checks more often and can therefore help to quickly isolate at which point (e.g. which specific optimizer run) make the plan become invalid.

We want to also enable users to add their own invariants. Users are already able to add their own Logical and Execution plan extensions, as well as their own optimization runs which modify these plans. Therefore it may be useful for an invariant extension interface for user-defined invariants. e.g. If a change in Datafusion core's optimizer passes will cause a problem in a user-defined Logical plan extension, then the user could define an invariant based upon what their Logical plan extension requires.

Refer to specific examples in this conversation, for plan extensions which have their own invariants. For the example case of our own ProgressiveEval -- we require the input partition streams to have specific sort orders, non-overlapping column ranges, and no pushdown of the offset (issue) in order to provide the correct result. An invariant check, performed after each optimizer run (while in debug mode), would enable us to quickly isolate the problem during DF upgrade.

(We have several other, more complex, examples of how changes in the optimization of UNIONs has produced invalid plans for our SortPreservingMerge. So this is not a one-off example, the above is merely the simplest concrete example.)

Describe the solution you'd like

Take the existing invariant infrastructure provided as part of this issue, and provide extension points for users to define their own invariants.

Describe alternatives you've considered

  • Alternative 1: for a user-defined Execution plan extension, have a runtime check of invariants be performed.

    • Con: this detects problems after planning time, thereby increasing both time-until-error as well as resource utilization.
  • Alternative 2: for either Logical or Physical plan extensions, the user can define an optimization run which is intended to detect invariant violations which are in conflict with their plan extensions.

    • Pro: can detect invariant violation at planning time
    • Con: arguably more code code complexity:
      • in order to isolate exactly which plan mutation (Datafusion core change) caused the problem, it would need to be coded to run after each optimizer pass.

Additional context

No response

@wiedld wiedld added the enhancement New feature or request label Jan 6, 2025
@alamb
Copy link
Contributor

alamb commented Jan 6, 2025

Thanks @wiedld -- I don't fully understand the usecase

Take the existing invariant infrastructure provided as part of #13652 (comment), and provide extension points for users to define their own invariants.

Could you provide an example of such an invariant?

I normally think of "invariants" as some property that always holds true for a certain type of node (for example that LogicalPlan::Join always has 2 inputs). The invariants in this case are defined by the semantics of the node itself (so as a user I couldn't add a invariant that LogicalPlan::Join had 3 inputs)

It would perhaps make sense to provide a way to define invariants for UserDefinedLogicalNode and user provided implementations of ExecutionPlan

Defining an invarint check for ExecutionPlan I think would satisfy the usecase you mention above having specific rules for ProgressiveEval (a user defined extension node)

@alamb
Copy link
Contributor

alamb commented Jan 6, 2025

For example, if we added a function like this to the ExecutionPlan trait, as proposed in #13986 (comment) I think that would permit implementing the invariant check without any more work as I understand it

impl ExecutionPlan {
...
    fn check_node_invariants(&self, invariant_level: InvariantLevel) -> Result<()> 
      Ok(())
    }
...
}

@wiedld
Copy link
Contributor Author

wiedld commented Jan 6, 2025

For example, if we added a function like this to the ExecutionPlan trait, as proposed in #13986 (comment) I think that would permit implementing the invariant check without any more work as I understand it

Agreed. If this^^ is the final form of the execution plan invariants, then we don't need an additional interface for the execution plan invariant extensions. But that would still leave the logical plan invariant extensions for consideration.

If that WIP keeps the ExecutionPlan::check_node_invariants in it's final merge-able state, then we have half of this ticket completed.

@alamb
Copy link
Contributor

alamb commented Jan 7, 2025

But that would still leave the logical plan invariant extensions for consideration.

I would recommend adding a method to: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.UserDefinedLogicalNode.html

Perhaps even the same API:

pub trait UserDefinedLogicalNode { 
...
    fn check_invariants(&self, invariant_level: InvariantLevel) -> Result<()> 
      Ok(())
    }
...

🤔

Then you could extend LogicalPlan::check_invariants to call that method for any LogicalPlan::UserDefined

pub fn check_invariants(&self, check: InvariantLevel) -> Result<()> {

@eliaperantoni
Copy link

Another use case, as discussed in person with @wiedld and @crepererum, would be enforcing the invariant that all nodes that had Spans after the initial plan is produced, should still have them after the optimizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants