Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(optimizer): Fix merge_subqueries.py::rename_inner_sources() #4266

Merged
merged 2 commits into from
Oct 18, 2024

Conversation

VaggelisD
Copy link
Collaborator

Fixes #4245

Consider this minimum repro, for the optimizer rules [qualify, eliminate_subqueries, merge_subqueries]:

WITH tbl AS (select 1 as id)
SELECT
  id
FROM (
  SELECT OTBL.id
  FROM (
    SELECT OTBL.id
    FROM (
      SELECT OTBL.id
      FROM tbl AS OTBL
      LEFT OUTER JOIN tbl AS ITBL ON OTBL.id = ITBL.id
    ) AS OTBL
    LEFT OUTER JOIN tbl AS ITBL ON OTBL.id = ITBL.id
  ) AS OTBL
  LEFT OUTER JOIN tbl AS ITBL ON OTBL.id = ITBL.id
) AS ITBL

After eliminate_subqueries, the derived tables have been lifted up as CTEs:

WITH tbl AS (
  SELECT
    1 AS id
), otbl AS (
  SELECT
    otbl.id AS id
  FROM tbl AS otbl
  LEFT OUTER JOIN tbl AS itbl
    ON otbl.id = itbl.id
), otbl_2 AS (
  SELECT
    otbl.id AS id
  FROM otbl AS otbl
  LEFT OUTER JOIN tbl AS itbl
    ON otbl.id = itbl.id
), itbl AS (
  SELECT
    otbl.id AS id
  FROM otbl_2 AS otbl
  LEFT OUTER JOIN tbl AS itbl
    ON otbl.id = itbl.id
)
SELECT
  itbl.id AS id
FROM itbl AS itbl

The merge_ctes subrule of merge_subqueries will attempt to merge each inner scope/CTE to the outer scope,
which (ignoring the OptimizerError) would ideally generate the following:

WITH tbl AS (
  SELECT
    1 AS id
)
SELECT
  otbl.id AS id
FROM tbl AS otbl
LEFT OUTER JOIN tbl AS itbl_2
  ON otbl.id = itbl_2.id
LEFT OUTER JOIN tbl AS itbl_3
  ON otbl.id = itbl_3.id
LEFT OUTER JOIN tbl AS itbl
  ON otbl.id = itbl.id

To safely move parts from the inner scopes outwards the helper function _rename_inner_sources() is used, which attempts to rename the inner sources in case of name collisions. The bug in this procedure has to do with the fact that the taken set is built only from the outer sources. For the example above, this is what happens in the function call that leads to the OptimizerError:

Upon entering _rename_inner_sources(outer_scope, inner_scope, alias):
----------------------------
@param alias: otbl
@param outer_scope: Scope<SELECT otbl.id AS id FROM otbl_2 AS otbl LEFT JOIN tbl AS itbl ON itbl.id = otbl.id>
Outer selected sources / taken set: {'otbl', 'itbl'}

@param inner_scope: Scope<SELECT otbl.id AS id FROM tbl AS otbl LEFT JOIN tbl AS itbl_2 ON itbl_2.id = otbl.id LEFT JOIN tbl AS itbl ON itbl.id = otbl.id>
Inner selected sources: {'itbl_2', 'otbl', 'itbl'} 

conflicts:  (outer_sources ∩ inner_sources) - alias = {'itbl'}

Upon leaving
----------------------------
new_name = find_new_name(taken={'otbl', 'itbl'}, conflict='itbl') -> 'itbl_2'

Renamed inner scope: Scope<SELECT otbl.id AS id FROM tbl AS otbl LEFT JOIN tbl AS itbl_2 ON itbl_2.id = otbl.id LEFT JOIN tbl AS itbl_2 ON itbl_2.id = otbl.id> 

Notice how itbl_2 was already an inner source but was incorrectly generated again to replace ITBL as it wasn't a part of taken, leading to duplicate aliases.

This PR solves this by extending the taken set to be the union of outer & inner selected sources.

Copy link
Collaborator

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work fixing & documenting the issue here @VaggelisD, this shit is gnarly.

sqlglot/optimizer/merge_subqueries.py Outdated Show resolved Hide resolved
@georgesittas georgesittas merged commit 4543fb3 into main Oct 18, 2024
6 checks passed
@georgesittas georgesittas deleted the vaggelisd/merge_subq_fixes branch October 18, 2024 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimizer fails when trying to optimize a query with joins and reused alias
3 participants