Cherry pick "orca support Foreign Scans" #839

jiaqizho · 2025-01-03T05:09:02Z

Fixes #ISSUE_Number

What does this PR do?

Type of Change

Bug fix (non-breaking change)
New feature (non-breaking change)
Breaking change (fix or feature with breaking changes)
Documentation update

Breaking Changes

Test Plan

Unit tests added/updated
Integration tests added/updated
Passed make installcheck
Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Followed contribution guide
Added/updated documentation
Reviewed code for security implications
Requested review from cloudberry committers

Additional Context

CI Skip Instructions

…d Oid conflict (#14411) Oid is not guaranteed to be unique across the GPDB cluster. For some rare cases, indexes and constraints may happen to have the same Oid number. But Orca always assume that the Oid is unique even for different types of objects. We need to get rid of risky assumption for Orca, otherwise Orca may hit errors (hence falls back to planner) when loading Index/Check Constraints from MD cache, in case of Oid duplication. Before this patch, Index/Type/Relation/Operator/Func/Agg/Trigger/Constraint all share the same MdId type, which is CMDIdGPDB::EmdidGPDB. Ideally we should use separate MdId type for each type of object, but it is difficult to do so without changing massive amount of mdp files. Given that most of user created objects are Relation/Index/Constraint, we decided to assign separate MdId types only for Relation, Index and Constraint. Also renamed EmdidGPDB to EmdidGeneral. Ref: GPQP-93

These are similar changes to the PR pipeline. Currently this pipeline is failing when using the centos7 image.

Prior to this commit, while translating constant values for text related domains like char, bpchar and name ORCA was calling the incorrect hashing function. This lead to `data corrupt` error while Query to DXL translation in ORCA. This commit fixes that issue by checking for the basetype of such domain types and calling the corresponding hashing function. Fixes issue: #14155

We also needed to move the image declaration to the pipeline file since the image now has a password.

Issue: ORCA generates plan with hash redistribution when data is highly skewed, making one segment the bottleneck in execution Root cause: Currently skew is only taken into account when the number of distinct value is less than the segment count. This over simplifies the scenario where skew may arise. Solution: We compute skew ratio using sampled statistics. The sampling rate of each bucket is proportional to the bucket frequency, i.e, the higher the bucket frequency, the more datum we sample from that bucket. We employ deterministic sampling algorithm by always starting with the lower bound, then the high bound, medium, quarter, three quarters, one eighth, three eighths, etc. In normalized terms, 0, 1, 0.5, 0,25, 0.75, 0.125, 0.375, etc. The sampling series ensures discrete sampling. The skew computation is controlled by GUC optimizer_skew_factor. Its default value is 0, i.e, skew computation is turned off. Users can turn it on by setting the GUC to a larger than 0 value. The GUC is called skew factor, because the final skewness used for costing is the product of the skew factor (coefficient) and the skew ratio (variable). This GUC gives user additional control over plan motions (broadcast/gather vs. redistribution) if they choose to emphasize the data skew. Implementation: [CCostModelGPDB] -- Enable Histogram ComputeSkew to calculate the skew factor of relations to be joined. Implement superlinear skew multiplier, to allow fine tuning when the skew factor is small, and coarse tuning when the skew factor is big. [ICostModel] -- Add root stats getter [CPhysical] -- Make helper function GetSkew public [CHistogram] -- Replace RAND function with deterministic algorithm in computing data skew. Collect sample from low frequency buckets first to ensure sampling of low frequency buckets. Normalize the bucket frequency in case they aren't normalized. Set skew multiplier to sqrt(SAMPLE_SIZE) in case of zero variance. [CBucket] -- Remove unused GetRandomSample [COptTasks, CHint, COptimizerConfig, dxltokens, CParseHandlerHint, dxltokens, guc_gp, guc, unsync_guc_name] -- Add optimizer_skew_factor GUC. Default case: optimizer_skew_factor = 0. Calculate skew from # of segments over # distinct values. General case: optimizer_skew_factor ∈ [1, 100]. First, calculate skew from 1000 samples. Then, multiply that by the multiplier. Keep the maximum between this value and the one calculated using the default algorithm. [regress] -- Add skew test [HAWQ-TPCH-Stat-Derivation.mdp] -- Manually add object info to avoid cache lookup failure

#2481 and greenplum-db/gporca#186 removed PartOid from Orca, but the OidCol still exists. This patch completely removes PartOid and related stuff from Orca.

This commit is in preperation for adding foreign scan support in Orca - Rename ExternalScans to ForeignScans - Get rid of CMDRelationExternalGPDB. This had information specific to external scans, but Orca didn't use any of this information. Instead, we populate this necessary information in the translator. It's possible we may be able to do some optimizations such as predicate pushdown, using indexes, etc in the future; however, the interface to the FDW api is through planner structures so they'd have to be rewritten anyway. It's more likely that we keep this in the translator.

This adds support for performing simple scans on Foreign tables in Orca. This does predicate pushdown and column projection if the FDWs support these optimizations, but it doesn't do more complex optimizations such as agg pushdown/remote joins. There are 3 FDW functions that we call: GetForeignRelSize, GetForeignPaths, and GetForeignPlan. As far as I can tell, this is where the fdw_private field can be modified/populated. Note that we don't care about the result of these function calls-- it's strictly to populate fdw_private. The tricky part of this commit was replicating some of the structures needed for the FDW api function calls. Since we have to call these functions, we needed to create "dummy" structures that still had the information we care about. Support for scanning foreign partitions will be added in a subsequent commit.

…nal table Previously, we added separate logic and replicated some of what was in the external table FDW since Orca did not support native FDWs. Since we now support foreign tables, this separate logic/code path is no longer needed. This was also a good test case for Foreign scan support in Orca

leborchuk · 2025-01-05T18:27:34Z

See that original author was lost while cherry-picking greenplum-db/gpdb-archive@9cbe762 - maybe it's so due Jingyu Wang is not github user - could we fix it somehow?

Also do not see greenplum-db/gpdb-archive@995653f

The description says that it is necessary to the "Assign different Mdid types to Relation, Index and Constraint to avoid Oid conflict (#14411)"

my-ship-it · 2025-01-06T05:48:51Z

src/test/regress/expected/update_gp_optimizer.out

+               ->  Seq Scan on into_table_1_prt_2 into_table_2
+               ->  Seq Scan on into_table_1_prt_3 into_table_3
+               ->  Seq Scan on into_table_1_prt_4 into_table_4
+         ->  Hash


Why motion changes here?

gpopt and others added 9 commits January 3, 2025 09:45

Update Orca test pipeline to use rhel8 (#14567)

9f6ad66

These are similar changes to the PR pipeline. Currently this pipeline is failing when using the centos7 image.

Update Orca explain pipeline for rhel8 changes (#14585)

d8ae3dd

We also needed to move the image declaration to the pipeline file since the image now has a password.

Remove table Oid for DML on partition table (#14623)

f7214db

#2481 and greenplum-db/gporca#186 removed PartOid from Orca, but the OidCol still exists. This patch completely removes PartOid and related stuff from Orca.

Add the REPLACE keyword to let cred-alert ignore

90d3f27

my-ship-it added the cherry-pick cherry-pick upstream commts label Jan 3, 2025

chrishajas and others added 2 commits January 3, 2025 13:24

FIX icw test from Foreign Scans

4afc943

jiaqizho force-pushed the cherry-pick-orca-in-path-order-6 branch from 62d5806 to 4afc943 Compare January 3, 2025 05:26

my-ship-it requested a review from yjhjstz January 6, 2025 03:28

my-ship-it reviewed Jan 6, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cherry pick "orca support Foreign Scans" #839

Cherry pick "orca support Foreign Scans" #839

jiaqizho commented Jan 3, 2025

leborchuk commented Jan 5, 2025

my-ship-it Jan 6, 2025

Cherry pick "orca support Foreign Scans" #839

Are you sure you want to change the base?

Cherry pick "orca support Foreign Scans" #839

Conversation

jiaqizho commented Jan 3, 2025

What does this PR do?

Type of Change

Breaking Changes

Test Plan

Impact

Checklist

Additional Context

CI Skip Instructions

leborchuk commented Jan 5, 2025

my-ship-it Jan 6, 2025

Choose a reason for hiding this comment