Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented simplify for the starts_with function to convert it into a LIKE expression. #14119

Merged
merged 6 commits into from
Jan 23, 2025

Conversation

jatin510
Copy link
Contributor

@jatin510 jatin510 commented Jan 14, 2025

Implemented simplify for the starts_with function to convert it into a LIKE expression

Which issue does this PR close?

Closes #14027.

Rationale for this change

Using 'LIKE' expr for 'starts_with' to enable predicate pruning.

What changes are included in this PR?

Implemented simplify for the starts_with function to convert it into a LIKE expression, enabling predicate pruning optimization.

Are these changes tested?

Ues

Are there any user-facing changes?

No

…nto a LIKE expression, enabling predicate pruning optimization.
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jatin510 -- this looks great

Can you please add some tests for this?

Specifically, I think you can use an explain test in sqllogictests, something like the following. The EXPLAIN should show LIKE being used in the physical plan

> create table t(x varchar) as values ('foo'), ('bar');
0 row(s) fetched.
Elapsed 0.013 seconds.

> explain select starts_with(x, 'fo') from t;
+---------------+----------------------------------------------------------------------------+
| plan_type     | plan                                                                       |
+---------------+----------------------------------------------------------------------------+
| logical_plan  | Projection: starts_with(t.x, Utf8("fo"))                                   |
|               |   TableScan: t projection=[x]                                              |
| physical_plan | ProjectionExec: expr=[starts_with(x@0, fo) as starts_with(t.x,Utf8("fo"))] |
|               |   MemoryExec: partitions=1, partition_sizes=[1]                            |
|               |                                                                            |
+---------------+----------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.003 seconds.

) -> Result<ExprSimplifyResult> {
if let Expr::Literal(ScalarValue::Utf8(Some(pattern))) = &args[1] {
// Convert starts_with (col, 'prefix') to col LIKE 'prefix%'
let like_pattern = format!("{}%", pattern);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we probably need to escape any % that appears in the pattern (or avoid doing this replacement when such a pattern exists)

@alamb alamb marked this pull request as draft January 16, 2025 20:23
@alamb
Copy link
Contributor

alamb commented Jan 16, 2025

Marking as draft as I think this PR is no longer waiting on feedback. Please mark it as ready for review when it is ready for another look

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 19, 2025
@jatin510 jatin510 marked this pull request as ready for review January 19, 2025 16:09
@jatin510 jatin510 requested a review from alamb January 21, 2025 15:57
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jatin510 -- this looks really nice

cc @adriangb (this should support pruning on starts_with, though it might be good to add some tests for that explicitly)

I merged this PR up from main and added some more test coverage as I had it checked out and was messing with it

@@ -344,7 +344,7 @@ EXPLAIN SELECT
FROM test;
----
logical_plan
01)Projection: starts_with(test.column1_utf8view, Utf8View("äöüß")) AS c1, starts_with(test.column1_utf8view, Utf8View("")) AS c2, starts_with(test.column1_utf8view, Utf8View(NULL)) AS c3, starts_with(Utf8View(NULL), test.column1_utf8view) AS c4
01)Projection: test.column1_utf8view LIKE Utf8View("äöüß%") AS c1, CASE test.column1_utf8view IS NOT NULL WHEN Boolean(true) THEN Boolean(true) END AS c2, starts_with(test.column1_utf8view, Utf8View(NULL)) AS c3, starts_with(Utf8View(NULL), test.column1_utf8view) AS c4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually pretty cool -- it figured out that STARTS_WITH(column1_utf8view, '') as c2, is true if column1_utf8view is NOT NULL

_info: &dyn SimplifyInfo,
) -> Result<ExprSimplifyResult> {
if let Expr::Literal(scalar_value) = &args[1] {
// Convert starts_with(col, 'prefix') to col LIKE 'prefix%' with proper escaping
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked the escaping logic and I think this looks good to me.

@alamb
Copy link
Contributor

alamb commented Jan 22, 2025

This is really sweet. You an see it working to prune parquet files here:

> copy (values ('foo'), ('bar'), ('baz')) to '/tmp/foo.parquet' STORED AS parquet;
+-------+
| count |
+-------+
| 3     |
+-------+
1 row(s) fetched.
Elapsed 0.010 seconds.

> explain select * from '/tmp/foo.parquet' where starts_with(column1, 'f');
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                  |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Filter: /tmp/foo.parquet.column1 LIKE Utf8View("f%")                                                                                                                                                                                                  |
|               |   TableScan: /tmp/foo.parquet projection=[column1], partial_filters=[/tmp/foo.parquet.column1 LIKE Utf8View("f%")]                                                                                                                                    |
| physical_plan | CoalesceBatchesExec: target_batch_size=8192                                                                                                                                                                                                           |
|               |   FilterExec: column1@0 LIKE f%                                                                                                                                                                                                                       |
|               |     RepartitionExec: partitioning=RoundRobinBatch(16), input_partitions=1                                                                                                                                                                             |
|               |       ParquetExec: file_groups={1 group: [[tmp/foo.parquet]]}, projection=[column1], predicate=column1@0 LIKE f%, pruning_predicate=column1_null_count@2 != column1_row_count@3 AND column1_min@0 <= g AND f <= column1_max@1, required_guarantees=[] |
|               |                                                                                                                                                                                                                                                       |
+---------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.019 seconds.

Specifically the predicate AND column1_min@0 <= g AND f <= column1_max@1 shows it has translated the like into a min/max range on column_1 🤯

I will also add a test to this PR demonstrating this too

01)CoalesceBatchesExec: target_batch_size=8192
02)--FilterExec: column1@0 LIKE f%
03)----RepartitionExec: partitioning=RoundRobinBatch(2), input_partitions=1
04)------ParquetExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/sqllogictest/test_files/scratch/parquet/foo.parquet]]}, projection=[column1], predicate=column1@0 LIKE f%, pruning_predicate=column1_null_count@2 != column1_row_count@3 AND column1_min@0 <= g AND f <= column1_max@1, required_guarantees=[]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is so cool!

@adriangb
Copy link
Contributor

This looks great!

@alamb alamb merged commit 49f95af into apache:main Jan 23, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support pruning on starts_with
3 participants