-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cherry pick "orca support Foreign Scans" #839
base: main
Are you sure you want to change the base?
Conversation
…d Oid conflict (#14411) Oid is not guaranteed to be unique across the GPDB cluster. For some rare cases, indexes and constraints may happen to have the same Oid number. But Orca always assume that the Oid is unique even for different types of objects. We need to get rid of risky assumption for Orca, otherwise Orca may hit errors (hence falls back to planner) when loading Index/Check Constraints from MD cache, in case of Oid duplication. Before this patch, Index/Type/Relation/Operator/Func/Agg/Trigger/Constraint all share the same MdId type, which is CMDIdGPDB::EmdidGPDB. Ideally we should use separate MdId type for each type of object, but it is difficult to do so without changing massive amount of mdp files. Given that most of user created objects are Relation/Index/Constraint, we decided to assign separate MdId types only for Relation, Index and Constraint. Also renamed EmdidGPDB to EmdidGeneral. Ref: GPQP-93
These are similar changes to the PR pipeline. Currently this pipeline is failing when using the centos7 image.
Prior to this commit, while translating constant values for text related domains like char, bpchar and name ORCA was calling the incorrect hashing function. This lead to `data corrupt` error while Query to DXL translation in ORCA. This commit fixes that issue by checking for the basetype of such domain types and calling the corresponding hashing function. Fixes issue: #14155
We also needed to move the image declaration to the pipeline file since the image now has a password.
Issue: ORCA generates plan with hash redistribution when data is highly skewed, making one segment the bottleneck in execution Root cause: Currently skew is only taken into account when the number of distinct value is less than the segment count. This over simplifies the scenario where skew may arise. Solution: We compute skew ratio using sampled statistics. The sampling rate of each bucket is proportional to the bucket frequency, i.e, the higher the bucket frequency, the more datum we sample from that bucket. We employ deterministic sampling algorithm by always starting with the lower bound, then the high bound, medium, quarter, three quarters, one eighth, three eighths, etc. In normalized terms, 0, 1, 0.5, 0,25, 0.75, 0.125, 0.375, etc. The sampling series ensures discrete sampling. The skew computation is controlled by GUC optimizer_skew_factor. Its default value is 0, i.e, skew computation is turned off. Users can turn it on by setting the GUC to a larger than 0 value. The GUC is called skew factor, because the final skewness used for costing is the product of the skew factor (coefficient) and the skew ratio (variable). This GUC gives user additional control over plan motions (broadcast/gather vs. redistribution) if they choose to emphasize the data skew. Implementation: [CCostModelGPDB] -- Enable Histogram ComputeSkew to calculate the skew factor of relations to be joined. Implement superlinear skew multiplier, to allow fine tuning when the skew factor is small, and coarse tuning when the skew factor is big. [ICostModel] -- Add root stats getter [CPhysical] -- Make helper function GetSkew public [CHistogram] -- Replace RAND function with deterministic algorithm in computing data skew. Collect sample from low frequency buckets first to ensure sampling of low frequency buckets. Normalize the bucket frequency in case they aren't normalized. Set skew multiplier to sqrt(SAMPLE_SIZE) in case of zero variance. [CBucket] -- Remove unused GetRandomSample [COptTasks, CHint, COptimizerConfig, dxltokens, CParseHandlerHint, dxltokens, guc_gp, guc, unsync_guc_name] -- Add optimizer_skew_factor GUC. Default case: optimizer_skew_factor = 0. Calculate skew from # of segments over # distinct values. General case: optimizer_skew_factor ∈ [1, 100]. First, calculate skew from 1000 samples. Then, multiply that by the multiplier. Keep the maximum between this value and the one calculated using the default algorithm. [regress] -- Add skew test [HAWQ-TPCH-Stat-Derivation.mdp] -- Manually add object info to avoid cache lookup failure
#2481 and greenplum-db/gporca#186 removed PartOid from Orca, but the OidCol still exists. This patch completely removes PartOid and related stuff from Orca.
This commit is in preperation for adding foreign scan support in Orca - Rename ExternalScans to ForeignScans - Get rid of CMDRelationExternalGPDB. This had information specific to external scans, but Orca didn't use any of this information. Instead, we populate this necessary information in the translator. It's possible we may be able to do some optimizations such as predicate pushdown, using indexes, etc in the future; however, the interface to the FDW api is through planner structures so they'd have to be rewritten anyway. It's more likely that we keep this in the translator.
This adds support for performing simple scans on Foreign tables in Orca. This does predicate pushdown and column projection if the FDWs support these optimizations, but it doesn't do more complex optimizations such as agg pushdown/remote joins. There are 3 FDW functions that we call: GetForeignRelSize, GetForeignPaths, and GetForeignPlan. As far as I can tell, this is where the fdw_private field can be modified/populated. Note that we don't care about the result of these function calls-- it's strictly to populate fdw_private. The tricky part of this commit was replicating some of the structures needed for the FDW api function calls. Since we have to call these functions, we needed to create "dummy" structures that still had the information we care about. Support for scanning foreign partitions will be added in a subsequent commit.
…nal table Previously, we added separate logic and replicated some of what was in the external table FDW since Orca did not support native FDWs. Since we now support foreign tables, this separate logic/code path is no longer needed. This was also a good test case for Foreign scan support in Orca
62d5806
to
4afc943
Compare
See that original author was lost while cherry-picking greenplum-db/gpdb-archive@9cbe762 - maybe it's so due Jingyu Wang is not github user - could we fix it somehow? Also do not see greenplum-db/gpdb-archive@995653f The description says that it is necessary to the "Assign different Mdid types to Relation, Index and Constraint to avoid Oid conflict (#14411)" |
-> Seq Scan on into_table_1_prt_2 into_table_2 | ||
-> Seq Scan on into_table_1_prt_3 into_table_3 | ||
-> Seq Scan on into_table_1_prt_4 into_table_4 | ||
-> Hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why motion changes here?
Fixes #ISSUE_Number
What does this PR do?
Type of Change
Breaking Changes
Test Plan
make installcheck
make -C src/test installcheck-cbdb-parallel
Impact
Performance:
User-facing changes:
Dependencies:
Checklist
Additional Context
CI Skip Instructions