Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cherry pick "orca support Foreign Scans" #839

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

jiaqizho
Copy link
Contributor

@jiaqizho jiaqizho commented Jan 3, 2025

Fixes #ISSUE_Number

What does this PR do?

Type of Change

  • Bug fix (non-breaking change)
  • New feature (non-breaking change)
  • Breaking change (fix or feature with breaking changes)
  • Documentation update

Breaking Changes

Test Plan

  • Unit tests added/updated
  • Integration tests added/updated
  • Passed make installcheck
  • Passed make -C src/test installcheck-cbdb-parallel

Impact

Performance:

User-facing changes:

Dependencies:

Checklist

Additional Context

CI Skip Instructions


gpopt and others added 9 commits January 3, 2025 09:45
…d Oid conflict (#14411)

Oid is not guaranteed to be unique across the GPDB cluster. For some rare
cases, indexes and constraints may happen to have the same Oid number. But Orca
always assume that the Oid is unique even for different types of objects. We
need to get rid of risky assumption for Orca, otherwise Orca may hit errors
(hence falls back to planner) when loading Index/Check Constraints from MD
cache, in case of Oid duplication.

Before this patch, Index/Type/Relation/Operator/Func/Agg/Trigger/Constraint all
share the same MdId type, which is CMDIdGPDB::EmdidGPDB. Ideally we should use
separate MdId type for each type of object, but it is difficult to do so
without changing massive amount of mdp files. Given that most of user created
objects are Relation/Index/Constraint, we decided to assign separate MdId types
only for Relation, Index and Constraint.

Also renamed EmdidGPDB to EmdidGeneral.

Ref: GPQP-93
These are similar changes to the PR pipeline. Currently this pipeline is
failing when using the centos7 image.
Prior to this commit, while translating constant values for text related
domains like char, bpchar and name ORCA was calling the incorrect
hashing function. This lead to `data corrupt` error while Query to DXL
translation in ORCA. This commit fixes that issue by checking for the
basetype of such domain types and calling the corresponding hashing
function.

Fixes issue: #14155
We also needed to move the image declaration to the pipeline file since the
image now has a password.
Issue: ORCA generates plan with hash redistribution when data is highly skewed,
making one segment the bottleneck in execution

Root cause: Currently skew is only taken into account when the number of
distinct value is less than the segment count. This over simplifies the scenario
where skew may arise.

Solution:
We compute skew ratio using sampled statistics. The sampling rate of each bucket
is proportional to the bucket frequency, i.e, the higher the bucket frequency,
the more datum we sample from that bucket.

We employ deterministic sampling algorithm by always starting with the lower
bound, then the high bound, medium, quarter, three quarters, one eighth, three
eighths, etc. In normalized terms, 0, 1, 0.5, 0,25, 0.75, 0.125, 0.375, etc. The
sampling series ensures discrete sampling.

The skew computation is controlled by GUC optimizer_skew_factor. Its default
value is 0, i.e, skew computation is turned off. Users can turn it on by setting
the GUC to a larger than 0 value. The GUC is called skew factor, because the
final skewness used for costing is the product of the skew factor (coefficient)
and the skew ratio (variable). This GUC gives user additional control over plan
motions (broadcast/gather vs. redistribution) if they choose to emphasize the
data skew.

Implementation:
[CCostModelGPDB] -- Enable Histogram ComputeSkew to calculate the skew factor of
relations to be joined. Implement superlinear skew multiplier, to allow fine
tuning when the skew factor is small, and coarse tuning when the skew factor is
big.
[ICostModel] -- Add root stats getter
[CPhysical] -- Make helper function GetSkew public
[CHistogram] -- Replace RAND function with deterministic algorithm in computing
data skew. Collect sample from low frequency buckets first to ensure sampling of
low frequency buckets. Normalize the bucket frequency in case they aren't
normalized. Set skew multiplier to sqrt(SAMPLE_SIZE) in case of zero variance.
[CBucket] -- Remove unused GetRandomSample
[COptTasks, CHint, COptimizerConfig, dxltokens, CParseHandlerHint, dxltokens,
guc_gp, guc, unsync_guc_name] -- Add optimizer_skew_factor GUC. Default case:
optimizer_skew_factor = 0. Calculate skew from # of segments over # distinct
values. General case: optimizer_skew_factor ∈ [1, 100]. First, calculate skew
from 1000 samples. Then, multiply that by the multiplier. Keep the maximum
between this value and the one calculated using the default algorithm.
[regress] -- Add skew test
[HAWQ-TPCH-Stat-Derivation.mdp] -- Manually add object info to avoid cache
lookup failure
#2481 and greenplum-db/gporca#186 removed PartOid from Orca, but the OidCol
still exists. This patch completely removes PartOid and related stuff from Orca.
This commit is in preperation for adding foreign scan support in Orca

- Rename ExternalScans to ForeignScans
- Get rid of CMDRelationExternalGPDB. This had information specific to
external scans, but Orca didn't use any of this information. Instead,
we populate this necessary information in the translator. It's possible
we may be able to do some optimizations such as predicate pushdown,
using indexes, etc in the future; however, the interface to the FDW api
is through planner structures so they'd have to be rewritten anyway.
It's more likely that we keep this in the translator.
This adds support for performing simple scans on Foreign tables in Orca.
This does predicate pushdown and column projection if the FDWs support
these optimizations, but it doesn't do more complex optimizations such
as agg pushdown/remote joins.

There are 3 FDW functions that we call: GetForeignRelSize,
GetForeignPaths, and GetForeignPlan. As far as I can tell, this is where
the fdw_private field can be modified/populated. Note that we don't care
about the result of these function calls-- it's strictly to populate
fdw_private.

The tricky part of this commit was replicating some of the structures
needed for the FDW api function calls. Since we have to call these
functions, we needed to create "dummy" structures that still had the
information we care about.

Support for scanning foreign partitions will be added in a subsequent
commit.
@my-ship-it my-ship-it added the cherry-pick cherry-pick upstream commts label Jan 3, 2025
chrishajas and others added 2 commits January 3, 2025 13:24
…nal table

Previously, we added separate logic and replicated some of what was in
the external table FDW since Orca did not support native FDWs. Since we
now support foreign tables, this separate logic/code path is no longer
needed.

This was also a good test case for Foreign scan support in Orca
@jiaqizho jiaqizho force-pushed the cherry-pick-orca-in-path-order-6 branch from 62d5806 to 4afc943 Compare January 3, 2025 05:26
@leborchuk
Copy link
Contributor

See that original author was lost while cherry-picking greenplum-db/gpdb-archive@9cbe762 - maybe it's so due Jingyu Wang is not github user - could we fix it somehow?

Also do not see greenplum-db/gpdb-archive@995653f

The description says that it is necessary to the "Assign different Mdid types to Relation, Index and Constraint to avoid Oid conflict (#14411)"

@my-ship-it my-ship-it requested a review from yjhjstz January 6, 2025 03:28
-> Seq Scan on into_table_1_prt_2 into_table_2
-> Seq Scan on into_table_1_prt_3 into_table_3
-> Seq Scan on into_table_1_prt_4 into_table_4
-> Hash
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why motion changes here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-pick cherry-pick upstream commts
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants