Skip to content

fix(starrocks): exp.Unnest transpilation #3966

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from
Closed

fix(starrocks): exp.Unnest transpilation #3966

wants to merge 6 commits into from

Conversation

hellozepp
Copy link
Contributor

@hellozepp hellozepp commented Aug 24, 2024

Fixes #3962

This PR adds support for the following:

  • Default initialize exp.UNNEST to include "unnest" as the default table alias if not specified
  • Added expression type of exp.Inline
  • use arrays_zip to merge multiple Lateral views
  • Added some ut to validate unnest transpile

@hellozepp
Copy link
Contributor Author

hellozepp commented Aug 24, 2024

I tried to add the logic of setting default value when there is no table alias, PTAL @georgesittas @VaggelisD

I need to transpile the starrocks sql:

SELECT student, score, unnest FROM tests CROSS JOIN LATERAL UNNEST(scores)

Expected spark sql:

SELECT student, score, unnest FROM tests LATERAL VIEW EXPLODE(scores) unnest AS unnest

@hellozepp
Copy link
Contributor Author

hellozepp commented Aug 25, 2024

I found that when unnest has multiple outputs, it will be transpiled into multiple Lateral view statements (actually incorrect), as shown in the following SQL of starrocks:

SELECT id, t.type, t.scores FROM example_table CROSS JOIN LATERAL unnest(split(type, ";"), scores) AS t(type,scores)

Transpiled into spark, the output is:

SELECT id, t.type, t.scores FROM example_table LATERAL VIEW EXPLODE(SPLIT(type, CONCAT('\\Q', ';'))) t AS type LATERAL VIEW EXPLODE(scores) t AS scores

Incorrect reason:
The output of unnest is the combined value of multiple columns, and LATERAL VIEW will be expanded multiple times and cross joined.
We can use arrays_zip to merge multiple Lateral views.

In fact, our users often use the syntax combination LATERAL VIEW INLINE(arrays_zip(col1, col2)).

table function inline doc as following:

databricks: https://docs.databricks.com/en/sql/language-manual/functions/inline.html
spark: https://spark.apache.org/docs/latest/api/sql/#inline
hive:https://stackoverflow.com/questions/25088488/use-inlinearraystruct-struct-in-hive

Expected output:

SELECT id, t.type, t.score FROM example_table LATERAL VIEW INLINE(ARRAYS_ZIP(SPLIT(type, CONCAT('\\Q', ';')), scores)) t AS type, score

But function arrays_zip may not be supported by some engines, such as hive. I added a parameter unnest_using_arrays_zip to avoid hive being affected.

I added a commit. Will be open to any suggestions =)

use `arrays_zip` to merge multiple `Lateral view explode`
Copy link
Collaborator

@VaggelisD VaggelisD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for the follow-up PR! Left a few comments in my first pass.

fix some comment
@hellozepp
Copy link
Contributor Author

@VaggelisD Added new commit, PTAL

@hellozepp
Copy link
Contributor Author

@VaggelisD Good evening, I made some change, please take a look at it when you have time

fix some comment
Copy link
Collaborator

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @hellozepp!

hellozepp and others added 2 commits August 27, 2024 16:48
fix some comment
@hellozepp
Copy link
Contributor Author

@georgesittas @VaggelisD Added new commit, PTAL

Copy link
Collaborator

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @hellozepp - LGTM. Appreciate the patience & quick iterations.

We plan to get this in soon and take a stab at refactoring the hive transformation so that it actually does the right thing. One way to go about this is by simulating the INLINE & ARRAYS_ZIP with a subquery that joins the POSEXPLODEd arrays on their positions and then projects the corresponding columns so they can be accessed in the scope of the original UNNEST.

@hellozepp
Copy link
Contributor Author

hellozepp commented Aug 29, 2024

@georgesittas Agree.
Thanks for the previous advice, it helped me a lot =)

@hellozepp hellozepp closed this by deleting the head repository Aug 29, 2024
@hellozepp
Copy link
Contributor Author

Sorry, I cleaned up my sqlglot repository because it was a mess. Wait a minute and I'll make another commit. There won't be any new code since our last commit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect translation in table function unnest syntax
3 participants