Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add external node table #4105

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Add external node table #4105

wants to merge 3 commits into from

Conversation

andyfengHKU
Copy link
Contributor

@andyfengHKU andyfengHKU commented Aug 19, 2024

Description

This is our initial PR to support directly execute cypher on relational database. It contains logic for node table only. Major changes including

Nested catalog entry.

We add a new catalog entry type ExternalNodeTable which has a nested entry structure. At parent level, it maintains the logic view of properties which aligns with the columns in relational tables. At child level, it maintains another catalog entry which contains the primary key property only. This child entry aligns with our physical storage.

In the current design, we still need to materialize primary key and use it as join condition when we try to read a property that does not exist in storage.

Scan external table

When we run MATCH (a:label) where label points to an external relational table, we need to scan external relational table and the primary key column materialized in our storage and then perform a join on primary key.

Some sanity benchmark numbers are

Setup
LDBC10 Comment table storing in DuckDB database. 8 Threads.

DuckDB native scanning: 0.3s.
Kuzu scanning DuckDB: 2s.
Scanning external database: copy primary key (6s) + join (3s) = 9s.

Slower than DuckDB is expected as we need to first materialize DuckDB's result and then re-scan it. The major overhead is in us scanning DuckDB result which @acquamarin should see if we can further optimize this.

Another bottleneck is the copy of primary key. I'm fairly confident I can bring the time to ~2s with some optimization.

Fixes # (issue)

Contributor agreement

Copy link

Benchmark Result

Master commit hash: 87de7f9128ff9946b95e62bad69bd8d8053a176c
Branch commit hash: fa217e70eddcc0e885602cc88af766a210af9ff6

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 674.13 664.18 9.95 (1.50%)
aggregation q28 11628.45 11765.74 -137.29 (-1.17%)
filter q14 153.00 143.50 9.50 (6.62%)
filter q15 153.21 141.37 11.84 (8.37%)
filter q16 329.48 320.61 8.87 (2.77%)
filter q17 473.39 466.17 7.21 (1.55%)
filter q18 1953.89 1970.60 -16.70 (-0.85%)
fixed_size_expr_evaluator q07 564.07 556.76 7.31 (1.31%)
fixed_size_expr_evaluator q08 772.76 765.63 7.13 (0.93%)
fixed_size_expr_evaluator q09 773.93 764.40 9.53 (1.25%)
fixed_size_expr_evaluator q10 266.35 255.68 10.67 (4.17%)
fixed_size_expr_evaluator q11 260.79 250.99 9.81 (3.91%)
fixed_size_expr_evaluator q12 259.71 250.45 9.27 (3.70%)
fixed_size_expr_evaluator q13 1502.11 1478.49 23.62 (1.60%)
fixed_size_seq_scan q23 145.55 135.71 9.84 (7.25%)
join q31 11.61 12.63 -1.01 (-8.03%)
ldbc_snb_ic q35 775.55 772.53 3.02 (0.39%)
ldbc_snb_ic q36 51.92 48.53 3.39 (6.99%)
ldbc_snb_is q32 8.66 10.00 -1.34 (-13.42%)
ldbc_snb_is q33 16.73 19.10 -2.37 (-12.41%)
ldbc_snb_is q34 8.12 8.78 -0.66 (-7.47%)
multi-rel multi-rel-large-scan 2795.10 3118.05 -322.95 (-10.36%)
multi-rel multi-rel-lookup 74.58 45.70 28.88 (63.19%)
multi-rel multi-rel-small-scan 53.76 56.26 -2.50 (-4.44%)
order_by q25 158.70 149.55 9.15 (6.12%)
order_by q26 478.05 464.24 13.81 (2.98%)
order_by q27 1430.05 1408.37 21.68 (1.54%)
scan_after_filter q01 197.39 191.00 6.39 (3.35%)
scan_after_filter q02 189.61 178.79 10.82 (6.05%)
shortest_path_ldbc100 q39 112.64 114.59 -1.95 (-1.70%)
var_size_expr_evaluator q03 2098.64 2059.91 38.73 (1.88%)
var_size_expr_evaluator q04 2246.39 2271.61 -25.22 (-1.11%)
var_size_expr_evaluator q05 2646.17 2709.45 -63.28 (-2.34%)
var_size_expr_evaluator q06 1395.17 1398.01 -2.84 (-0.20%)
var_size_seq_scan q19 1492.84 1463.68 29.16 (1.99%)
var_size_seq_scan q20 3149.41 3098.81 50.60 (1.63%)
var_size_seq_scan q21 2459.69 2384.55 75.15 (3.15%)
var_size_seq_scan q22 135.84 129.81 6.03 (4.65%)

// Bind physical create node table info
auto pkDefinition = getDefinition(propertyDefinitions, extraInfo.pkName);
std::vector<PropertyDefinition> physicalPropertyDefinitions;
physicalPropertyDefinitions.push_back(pkDefinition.copy());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need to copy here? getDefinition returns a copied one already.

@@ -124,18 +125,13 @@ std::shared_ptr<Expression> Binder::createPath(const std::string& pathName,
}

static std::vector<std::string> getPropertyNames(const std::vector<TableCatalogEntry*>& entries) {
std::vector<std::string> result;
std::unordered_set<std::string> propertyNamesSet;
auto distinctVector = DistinctVector<std::string>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The DistinctVector usage seems unnecessarily complicated here, why not just use a set and copy to a vector at the end?

@@ -260,7 +260,7 @@ void Planner::planNodeIDScan(uint32_t nodePos) {
auto newSubgraph = context.getEmptySubqueryGraph();
newSubgraph.addQueryNode(nodePos);
auto plan = std::make_unique<LogicalPlan>();
appendScanNodeTable(node->getInternalID(), node->getTableIDs(), {}, *plan);
appendScanNodeTable(node->getInternalID(), {}, node->getEntries(), *plan);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better add an inline comment to {}.

@@ -296,7 +296,7 @@ void Planner::planRelScan(uint32_t relPos) {
auto plan = std::make_unique<LogicalPlan>();
auto [boundNode, nbrNode] = getBoundAndNbrNodes(*rel, direction);
const auto extendDirection = getExtendDirection(*rel, *boundNode);
appendScanNodeTable(boundNode->getInternalID(), boundNode->getTableIDs(), {}, *plan);
appendScanNodeTable(boundNode->getInternalID(), {}, boundNode->getEntries(), *plan);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto on adding an inline comment to {}.

Copy link

Benchmark Result

Master commit hash: cf1330d6d9eb64cfca1e2e1c3f38ec21b1e2113a
Branch commit hash: 7317ae29bd073138c87524aa121ea447528ad94d

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 671.30 664.38 6.93 (1.04%)
aggregation q28 11091.24 11131.84 -40.61 (-0.36%)
filter q14 152.07 141.89 10.18 (7.18%)
filter q15 152.12 143.51 8.61 (6.00%)
filter q16 339.24 329.23 10.01 (3.04%)
filter q17 475.77 464.14 11.63 (2.51%)
filter q18 1978.25 1989.96 -11.71 (-0.59%)
fixed_size_expr_evaluator q07 569.22 560.67 8.56 (1.53%)
fixed_size_expr_evaluator q08 777.93 779.56 -1.63 (-0.21%)
fixed_size_expr_evaluator q09 782.76 774.03 8.73 (1.13%)
fixed_size_expr_evaluator q10 264.53 259.92 4.60 (1.77%)
fixed_size_expr_evaluator q11 259.49 254.62 4.87 (1.91%)
fixed_size_expr_evaluator q12 257.90 253.28 4.62 (1.82%)
fixed_size_expr_evaluator q13 1493.24 1495.33 -2.09 (-0.14%)
fixed_size_seq_scan q23 147.46 136.59 10.87 (7.96%)
join q31 13.25 13.17 0.08 (0.61%)
ldbc_snb_ic q35 743.04 755.55 -12.52 (-1.66%)
ldbc_snb_ic q36 47.80 40.92 6.89 (16.83%)
ldbc_snb_is q32 10.46 9.89 0.57 (5.80%)
ldbc_snb_is q33 18.10 18.51 -0.41 (-2.24%)
ldbc_snb_is q34 8.82 9.05 -0.23 (-2.54%)
multi-rel multi-rel-large-scan 2732.98 2920.13 -187.16 (-6.41%)
multi-rel multi-rel-lookup 38.29 66.73 -28.44 (-42.62%)
multi-rel multi-rel-small-scan 50.63 48.80 1.83 (3.74%)
order_by q25 157.29 147.10 10.19 (6.93%)
order_by q26 476.13 466.22 9.91 (2.12%)
order_by q27 1438.58 1431.95 6.63 (0.46%)
scan_after_filter q01 197.72 191.78 5.94 (3.10%)
scan_after_filter q02 186.20 181.47 4.74 (2.61%)
shortest_path_ldbc100 q39 74.16 75.70 -1.54 (-2.04%)
var_size_expr_evaluator q03 2103.01 2079.95 23.06 (1.11%)
var_size_expr_evaluator q04 2278.73 2295.57 -16.84 (-0.73%)
var_size_expr_evaluator q05 2581.27 2615.85 -34.58 (-1.32%)
var_size_expr_evaluator q06 1418.66 1385.06 33.59 (2.43%)
var_size_seq_scan q19 1508.82 1480.47 28.34 (1.91%)
var_size_seq_scan q20 3253.31 3216.86 36.45 (1.13%)
var_size_seq_scan q21 2468.25 2449.41 18.84 (0.77%)
var_size_seq_scan q22 135.77 132.76 3.01 (2.27%)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants