feat: Distributed Procedure Support Part 1/X - core code base changes #26373

hantangwangd · 2025-10-21T06:20:14Z

Description

This PR is the first part of many PRs to support distributed procedure into Presto. It is a split of the original entire PR which is located here: #22659.

The whole work in this PR includes the following parts:

Re-factor ProcedureRegistry/Procedure data structure to support the creation and register of DistributedProcedure. And make sure ProcedureRegistry be available in presto-analyzer module and connectors, so that we can recognize distributed procedures in call statement during prepare analyze stages.
Handle call statement on distributed procedures in preparer stage. In this stage, we figure out the procedure's type in call statement, and define a new query type CALL_DISTRIBUTED_PROCEDURE for call distributed procedure in BuiltInPreparedQuery. In this way, call distributed procedure statement would be handled by SqlQueryExecutionFactory, then be created and handled as a SqlQueryExecution.
Analyze and plan the call distributed procedure statement based on the subtype of the distributed procedure. For subtype TableDataRewriteDistributedProcedure, ultimately generate a logical plan for it as follows:

OutputNode <- TableFinishNode <- CallDistributedProcedureNode <- FilterNode <- TableScanNode

Optimize, segmentation, grouped tag and local plan for the logical plan generated above. The handle logical for CallDistributedProcedureNode is similar as TableWriterNode. Besides, a new optimizer RewriteWriterTarget is added, which is placed after all optimization rules. It is used to update the TableHandle held in TableFinishNode and CallDistributedProcedureNode based on the underlying TableScanNode after the entire optimization is completed, considering the possible filter pushing down.

Motivation and Context

prestodb/rfcs#12

Impact

N/A

Test Plan

Add test cases in each phase involving the procedure architecture expansion, including creating and registering for distributed procedures, preparing for call distributed procedure, analyzing for call distributed procedure, logical planning and optimizing for call distributed procedure

Contributor checklist

Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.
If adding new dependencies, verified they have an OpenSSF Scorecard score of 5.0 or higher (or obtained explicit TSC approval for lower scores).

Release Notes

== RELEASE NOTES ==

General Changes
 * Upgrade the procedure architecture to support distributed executing procedures

sourcery-ai

Sorry @hantangwangd, your pull request is larger than the review limit of 150000 diff characters

sourcery-ai

Sorry @hantangwangd, your pull request is larger than the review limit of 150000 diff characters

tdcmeehan

Mainly wondering if we can avoid the breaking change on Procedure.

tdcmeehan · 2025-10-31T17:29:58Z

...pi/src/main/java/com/facebook/presto/spi/procedure/TableDataRewriteDistributedProcedure.java

+                checkArgument(getArguments().get(i).getType().toString().equalsIgnoreCase("varchar"),
+                        format("Argument `%s` must be string type", SCHEMA));
+                schemaIndex = i;
+            }
+            else if (getArguments().get(i).getName().equals(TABLE_NAME)) {
+                checkArgument(getArguments().get(i).getType().toString().equalsIgnoreCase("varchar"),


Use StandardTypes

Suggested change

checkArgument(getArguments().get(i).getType().toString().equalsIgnoreCase("varchar"),

format("Argument `%s` must be string type", SCHEMA));

schemaIndex = i;

}

else if (getArguments().get(i).getName().equals(TABLE_NAME)) {

checkArgument(getArguments().get(i).getType().toString().equalsIgnoreCase("varchar"),

checkArgument(getArguments().get(i).getType().getBase().equals(VARCHAR),

format("Argument `%s` must be string type", SCHEMA));

schemaIndex = i;

}

else if (getArguments().get(i).getName().equals(TABLE_NAME)) {

checkArgument(getArguments().get(i).getType().getBase().equals(VARCHAR),

Thanks, fixed!

tdcmeehan · 2025-10-31T17:30:51Z

presto-spi/src/main/java/com/facebook/presto/spi/procedure/Procedure.java

+    protected static void checkArgument(boolean assertion, String message)
    {
        if (!assertion) {
            throw new IllegalArgumentException(message);


I believe a generic IAE will get translated into an uncategorized Presto error. If so, better to use new PrestoException(INVALID_ARGUMENT, ...

Sure, I've changed the exception type and the corresponding tests.

tdcmeehan · 2025-10-31T17:35:11Z

presto-spi/src/main/java/com/facebook/presto/spi/procedure/Procedure.java

 import static java.util.stream.Collectors.joining;

-public class Procedure
+public abstract class Procedure


I wonder if we can avoid a breaking change here. What if we make Procedure extend an abstract type, for example, BaseProcedure, which DistributedProcedure can also extend from?

Great idea! I've renamed the current abstract parent class to BaseProcedure and reverting LocalProcedure back to Procedure. Please take a look when you have a chance. Thanks a lot!

@hantangwangd I believe there's still a breaking change here--all connectors will be required to migrate to return BaseProcedure. Wondering if it makes sense to keep getProcedures to return Procedure, but add a separate one getDistributedProcedures so folks can opt-in to these procedures without requiring a migration? We can consider a more generic API if a third type of procedure is added?

@tdcmeehan thanks for pointing out this. Yes, you are correct that there's still a breaking change here. I'll look into the feasibility of your suggestion. Also, it seems moving the Argument class into BaseProcedure could also break the backward compatibility. I'll consider both points to figure out a reasonable solution.

Hi @tdcmeehan, I've done the following two things to keep entirely backward compatibility:

Retain the Connector.getProcedures() SPI method as you suggested, and add a new generic method to support other BaseProcedure subtypes such as DistributedProcedure.

Use generics for the Argument class hierarchy to enable shared common logic while maintaining backward compatibility.

After this refactoring, existing in-tree and out-tree connectors require no changes, unless they intend to support distributed procedures. Please take a look when you get a chance. Thanks a lot!

tdcmeehan

This looks good to me, just some remaining questions on the SPI changes.

tdcmeehan · 2025-11-10T21:56:06Z

presto-main-base/src/main/java/com/facebook/presto/connector/ConnectorManager.java

            requireNonNull(procedures, "Connector %s returned a null procedures set");
-            this.procedures = ImmutableSet.copyOf(procedures);
+            proceduresBuilder.addAll(procedures);
+            Set<DistributedProcedure> distributedProcedures = connector.getProcedures(DistributedProcedure.class);


Is there any particular reason to add a generic method here, instead of a simple addition method getDistributedProcedures which does this as well? I'm thinking adding the new method wouldn't require a deprecation cycle for the older getProcedures method, we simply have two parallel methods that each return different types of procedures.

We can consider a more generic API if a third type of procedure is added?

@tdcmeehan thanks for the review and suggestion. I might have misunderstood your comment above. I thought you were suggesting we add a more generic API, to support any future addition of a third type of procedure, rather than one that only supports DistributedProcedure.

Sure, I'll change it to the more straightforward and specific getDistributedProcedures as you suggested. If we need to support a third type of procedure in the future, we can simply add a dedicated API method for it.

Hi @tdcmeehan, the API method has been changed to getDistributedProcedures. Please take a look when you get a chance, thanks!

tdcmeehan · 2025-11-11T16:46:42Z

@hantangwangd I just realized that this framework doesn't appear to support explicit access control. My thinking is there should be two levels, the first is, we can have an access control check on the procedure itself, similar to table level access. Secondly, for distributed procedures which write data to a table, we should probably have INSERT + DELETE permissions required. I think this can be done as a followup.

Secondly, I believe we should add user-facing documentation to our website for this framework. Instructions for how to create these distributed procedures. This PR is quite large so I will leave it up to you on whether or not to add them now or later, although I have a slight preference for adding that now.

hantangwangd · 2025-11-11T19:11:08Z

My thinking is there should be two levels, the first is, we can have an access control check on the procedure itself, similar to table level access. Secondly, for distributed procedures which write data to a table, we should probably have INSERT + DELETE permissions required. I think this can be done as a followup.

Thanks for the suggestion. Sure, I will add access control for procedures framework in a followup PR.

I believe we should add user-facing documentation to our website for this framework. Instructions for how to create these distributed procedures.

Yes, I have been thinking about this documentation for several days. When I started thinking about how to write this document, I realized that distributed procedures may need to include two levels. The first level is how developers can define and extend a new subtype of distributed procedure (refer to TableDataRewriteDistributedProcedure). The second level is how developers can implement a concrete distributed procedure with a specific subtype for a particular connector (refer to RewriteDataFilesProcedure on Iceberg). Besides, we currently lack developer documentation for the original procedures as well. Therefore, my thought is that once we finalize the design for extending the procedure framework at both of these levels, I can add all these documents in a dedicated follow-up PR. Does this sound reasonable to you? Also, any suggestions for the document's content would be greatly appreciated!

tdcmeehan

This looks good to me. The abstractions match existing conventions faithfully. I'm looking forward to the extensive tests for Iceberg rewrites.

And expose the procedure registry to the `presto-analyzer` and `connectors` module

…buted procedure

Refactor `Procedure` and `DistributedProcedure` into abstract classes. Use a subclass `TableDataRewriteDistributedProcedure` for table rewrite tasks, for example, merge small data files, sort table data, repartition table data etc. And introduce a new class `LocalProcedure` to represent the former coordinator-only procedures. Rename `IProcedureRegistry` to `ProcedureRegistry`, and accordingly rename previous `ProcedureRegistry` to `BuiltInProcedureRegistry`.

Rename abstract class `Procedure` to `BaseProcedure`, and then rename `LocalProcedure` back to `Procedure` to maintain backward compatibility

Use `StandardTypes` to check the type of the procedure arguments Throw a `PrestoException` with error code of `INVALID_ARGUMENTS` rather than an IAE

1. Retain the `Connector.getProcedures()` spi method for backward compatibility. Add a new generic method to support other `BaseProcedure` subtypes such as `DistributedProcedure`. 2. Use generics for the `Argument` class hierarchy to enable shared logic while maintaining backward compatibility.

hantangwangd · 2025-11-15T02:15:20Z

@tdcmeehan thank you so much for your review throughout the process. I will add the developer documentation ASAP, and add access control for procedure architecture in a followup PR.

sourcery-ai bot reviewed Oct 21, 2025

View reviewed changes

hantangwangd force-pushed the support_call_distributed_procedure_part1 branch 2 times, most recently from 2e0cff9 to dbb5eb0 Compare October 21, 2025 09:26

hantangwangd marked this pull request as ready for review October 21, 2025 12:13

hantangwangd requested review from a team, ClarenceThreepwood, ZacBlanco, elharo, feilong-liu, jaystarshot and shrinidhijoshi as code owners October 21, 2025 12:13

sourcery-ai bot reviewed Oct 21, 2025

View reviewed changes

hantangwangd requested a review from tdcmeehan October 21, 2025 12:17

tdcmeehan self-assigned this Oct 22, 2025

hantangwangd force-pushed the support_call_distributed_procedure_part1 branch 2 times, most recently from 02d3252 to 8bf8be6 Compare October 30, 2025 10:44

hantangwangd requested review from czentgr and unidevel as code owners October 30, 2025 10:44

hantangwangd force-pushed the support_call_distributed_procedure_part1 branch from 8bf8be6 to 07a4fd9 Compare October 30, 2025 13:05

tdcmeehan reviewed Oct 31, 2025

View reviewed changes

hantangwangd force-pushed the support_call_distributed_procedure_part1 branch from 07a4fd9 to bfa4bc0 Compare November 1, 2025 17:13

hantangwangd requested review from 7c00 and vinothchandar as code owners November 1, 2025 17:13

hantangwangd force-pushed the support_call_distributed_procedure_part1 branch 5 times, most recently from f19a8f8 to d90f700 Compare November 2, 2025 03:05

hantangwangd mentioned this pull request Nov 2, 2025

The prestocpp-format-and-header-check is consistently failing for the PRs that trigger it #26510

Closed

hantangwangd force-pushed the support_call_distributed_procedure_part1 branch 3 times, most recently from c57f657 to fe00517 Compare November 8, 2025 17:01

tdcmeehan reviewed Nov 10, 2025

View reviewed changes

hantangwangd force-pushed the support_call_distributed_procedure_part1 branch from fe00517 to bb7dcd7 Compare November 11, 2025 02:54

tdcmeehan previously approved these changes Nov 14, 2025

View reviewed changes

hantangwangd added 14 commits November 15, 2025 09:57

Refactor ProcedureRegistry to support distributed procedure

239533e

And expose the procedure registry to the `presto-analyzer` and `connectors` module

Support and handle call distributed procedure statement in preparer

41b6c5b

Analyze and plan for call distributed procedure statement

e6e4901

Execute optimization, segmentation and local planning for call distri…

101553d

…buted procedure

Refactor the connector SPI to expose procedure registry to connectors

c638c9f

[native] Relevant changes of presto protocol for distributed procedure

60a0a16

Rename the types in the Procedure hierarchy to avoid a breaking change

f222e31

Rename abstract class `Procedure` to `BaseProcedure`, and then rename `LocalProcedure` back to `Procedure` to maintain backward compatibility

Address comments

a2911ab

Use `StandardTypes` to check the type of the procedure arguments Throw a `PrestoException` with error code of `INVALID_ARGUMENTS` rather than an IAE

Fix compiling failure caused by commit conflict

d355696

Maintain compatibility with custom connector-provided serialization

274cc6c

Resolve conflicts in auto-generated c++ protocol

4e544fa

Use a dedicated spi method in Connector for distributed procedure

1c42e20

hantangwangd dismissed tdcmeehan’s stale review via 1c42e20 November 15, 2025 02:04

hantangwangd force-pushed the support_call_distributed_procedure_part1 branch from bb7dcd7 to 1c42e20 Compare November 15, 2025 02:04

tdcmeehan approved these changes Nov 15, 2025

View reviewed changes

hantangwangd merged commit 2f8bbba into prestodb:master Nov 15, 2025
82 of 83 checks passed

hantangwangd deleted the support_call_distributed_procedure_part1 branch November 15, 2025 05:05

This was referenced Nov 23, 2025

Add developer documentation for Procedures #26679

Open

Add access control for Procedure architecture #26680

Open

feat: Distributed Procedure Support Part 1/X - core code base changes #26373

feat: Distributed Procedure Support Part 1/X - core code base changes #26373

Uh oh!

Conversation

hantangwangd commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

Impact

Test Plan

Contributor checklist

Release Notes

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

tdcmeehan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdcmeehan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tdcmeehan commented Nov 11, 2025

Uh oh!

hantangwangd commented Nov 11, 2025

Uh oh!

tdcmeehan left a comment

Choose a reason for hiding this comment

Uh oh!

hantangwangd commented Nov 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hantangwangd commented Oct 21, 2025 •

edited

Loading