Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(starrocks): exp.Unnest transpilation #3966

Closed

Conversation

hellozepp
Copy link
Contributor

@hellozepp hellozepp commented Aug 24, 2024

Fixes #3962

This PR adds support for the following:

  • Default initialize exp.UNNEST to include "unnest" as the default table alias if not specified
  • Added expression type of exp.Inline
  • use arrays_zip to merge multiple Lateral views
  • Added some ut to validate unnest transpile

@hellozepp hellozepp force-pushed the starrocks_unnest_without_alias branch from 0267692 to 640411b Compare August 24, 2024 20:12
@hellozepp
Copy link
Contributor Author

hellozepp commented Aug 24, 2024

I tried to add the logic of setting default value when there is no table alias, PTAL @georgesittas @VaggelisD

I need to transpile the starrocks sql:

SELECT student, score, unnest FROM tests CROSS JOIN LATERAL UNNEST(scores)

Expected spark sql:

SELECT student, score, unnest FROM tests LATERAL VIEW EXPLODE(scores) unnest AS unnest

@hellozepp hellozepp force-pushed the starrocks_unnest_without_alias branch from be0a7af to d2119e2 Compare August 24, 2024 20:32
@hellozepp hellozepp force-pushed the starrocks_unnest_without_alias branch from d2119e2 to 77b87d5 Compare August 24, 2024 20:37
@hellozepp
Copy link
Contributor Author

hellozepp commented Aug 25, 2024

I found that when unnest has multiple outputs, it will be transpiled into multiple Lateral view statements (actually incorrect), as shown in the following SQL of starrocks:

SELECT id, t.type, t.scores FROM example_table CROSS JOIN LATERAL unnest(split(type, ";"), scores) AS t(type,scores)

Transpiled into spark, the output is:

SELECT id, t.type, t.scores FROM example_table LATERAL VIEW EXPLODE(SPLIT(type, CONCAT('\\Q', ';'))) t AS type LATERAL VIEW EXPLODE(scores) t AS scores

Incorrect reason:
The output of unnest is the combined value of multiple columns, and LATERAL VIEW will be expanded multiple times and cross joined.
We can use arrays_zip to merge multiple Lateral views.

In fact, our users often use the syntax combination LATERAL VIEW INLINE(arrays_zip(col1, col2)).

table function inline doc as following:

databricks: https://docs.databricks.com/en/sql/language-manual/functions/inline.html
spark: https://spark.apache.org/docs/latest/api/sql/#inline
hive:https://stackoverflow.com/questions/25088488/use-inlinearraystruct-struct-in-hive

Expected output:

SELECT id, t.type, t.score FROM example_table LATERAL VIEW INLINE(ARRAYS_ZIP(SPLIT(type, CONCAT('\\Q', ';')), scores)) t AS type, score

But function arrays_zip may not be supported by some engines, such as hive. I added a parameter unnest_using_arrays_zip to avoid hive being affected.

I added a commit. Will be open to any suggestions =)

@hellozepp hellozepp force-pushed the starrocks_unnest_without_alias branch 2 times, most recently from 08b0bc1 to 6eac225 Compare August 26, 2024 07:02
use `arrays_zip` to merge multiple `Lateral view explode`
@hellozepp hellozepp force-pushed the starrocks_unnest_without_alias branch from 6eac225 to 8261ff5 Compare August 26, 2024 07:27
Copy link
Collaborator

@VaggelisD VaggelisD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for the follow-up PR! Left a few comments in my first pass.

sqlglot/transforms.py Outdated Show resolved Hide resolved
sqlglot/dialects/starrocks.py Show resolved Hide resolved
sqlglot/transforms.py Outdated Show resolved Hide resolved
sqlglot/transforms.py Outdated Show resolved Hide resolved
tests/dialects/test_starrocks.py Outdated Show resolved Hide resolved
sqlglot/expressions.py Outdated Show resolved Hide resolved
@hellozepp hellozepp force-pushed the starrocks_unnest_without_alias branch from b749827 to c1bfdba Compare August 26, 2024 14:26
fix some comment
@hellozepp hellozepp force-pushed the starrocks_unnest_without_alias branch from c1bfdba to 6fa1ae7 Compare August 26, 2024 14:28
@hellozepp
Copy link
Contributor Author

@VaggelisD Added new commit, PTAL

sqlglot/transforms.py Outdated Show resolved Hide resolved
sqlglot/transforms.py Outdated Show resolved Hide resolved
sqlglot/transforms.py Outdated Show resolved Hide resolved
sqlglot/transforms.py Outdated Show resolved Hide resolved
sqlglot/dialects/starrocks.py Outdated Show resolved Hide resolved
@hellozepp
Copy link
Contributor Author

@VaggelisD Good evening, I made some change, please take a look at it when you have time

@hellozepp hellozepp force-pushed the starrocks_unnest_without_alias branch from 5759bfa to 6676b89 Compare August 27, 2024 06:51
fix some comment
@hellozepp hellozepp force-pushed the starrocks_unnest_without_alias branch from 6676b89 to 2b0179a Compare August 27, 2024 07:17
sqlglot/dialects/hive.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @hellozepp!

sqlglot/transforms.py Outdated Show resolved Hide resolved
sqlglot/transforms.py Outdated Show resolved Hide resolved
sqlglot/transforms.py Outdated Show resolved Hide resolved
tests/dialects/test_dialect.py Outdated Show resolved Hide resolved
tests/dialects/test_starrocks.py Outdated Show resolved Hide resolved
sqlglot/transforms.py Outdated Show resolved Hide resolved
sqlglot/dialects/starrocks.py Outdated Show resolved Hide resolved
sqlglot/dialects/hive.py Outdated Show resolved Hide resolved
sqlglot/dialects/spark2.py Outdated Show resolved Hide resolved
sqlglot/transforms.py Outdated Show resolved Hide resolved
hellozepp and others added 2 commits August 27, 2024 16:48
fix some comment
@hellozepp
Copy link
Contributor Author

@georgesittas @VaggelisD Added new commit, PTAL

Copy link
Collaborator

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @hellozepp - LGTM. Appreciate the patience & quick iterations.

We plan to get this in soon and take a stab at refactoring the hive transformation so that it actually does the right thing. One way to go about this is by simulating the INLINE & ARRAYS_ZIP with a subquery that joins the POSEXPLODEd arrays on their positions and then projects the corresponding columns so they can be accessed in the scope of the original UNNEST.

@hellozepp
Copy link
Contributor Author

hellozepp commented Aug 29, 2024

@georgesittas Agree.
Thanks for the previous advice, it helped me a lot =)

@hellozepp hellozepp closed this by deleting the head repository Aug 29, 2024
@hellozepp
Copy link
Contributor Author

Sorry, I cleaned up my sqlglot repository because it was a mess. Wait a minute and I'll make another commit. There won't be any new code since our last commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect translation in table function unnest syntax
3 participants