Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/qualify columns and support unpivot #4867

Closed

Conversation

zeta9044
Copy link

Fix struct field lookup conversions for PIVOT/UNPIVOT aliases

Issue Description

The _convert_columns_to_dots function in the qualify module incorrectly converts column references to PIVOT/UNPIVOT aliases into Dot expressions. This causes incorrect query results when working with PIVOT/UNPIVOT operations.

Changes Made

This PR improves the handling of struct field lookups by preventing incorrect conversion of PIVOT/UNPIVOT aliases to Dot expressions:

  1. Added early return patterns to skip columns that don't need conversion:

    • Skip columns without a table reference
    • Skip columns where the table is in the current scope's sources
  2. Added PIVOT/UNPIVOT alias detection:

    • First check in the current scope for efficiency
    • Only traverse parent scopes if not found in current scope
    • Use robust type comparison between string and identifier objects
  3. Preserved the original conversion logic for actual struct field lookups

Testing

Added tests to verify that column references to PIVOT/UNPIVOT aliases are correctly handled and not converted to Dot expressions.
Add comprehensive tests for sqlglot qualify functionality with UNPIVOT operations

This commit adds unit tests that verify the correct handling of column qualification in various UNPIVOT scenarios:

  • UNPIVOT without explicit table aliases (using auto-generated aliases)
  • UNPIVOT with explicit table aliases
  • UNPIVOT with JOIN operations and type casting
  • Nested UNPIVOT operations with proper alias propagation
  • Complex queries with UNPIVOT, EXISTS subqueries, and multi-condition joins

These tests ensure that the qualify transformation correctly qualifies column references
across different query structures while preserving the semantic meaning of the original SQL.

Performance Considerations

This change optimizes performance by:

  • Using early returns to avoid unnecessary processing
  • Checking the current scope first before traversing parent scopes
  • Only traversing parent scopes when necessary

The impact on large, complex queries with many nested scopes should be positive, as it reduces unnecessary scope traversal.

Dependency for Lineage Module Enhancement

This fix is a prerequisite for adding full UNPIVOT support in the lineage.py module. The current incorrect conversion of PIVOT/UNPIVOT aliases prevents proper column lineage tracking when working with UNPIVOT operations. With this fix in place, we can proceed with implementing comprehensive UNPIVOT support in the lineage module, ensuring accurate column mapping and dependency tracking through these transformations.

Future Work

Following the acceptance of this PR, I plan to submit another PR that will provide integrated lineage support for both PIVOT and UNPIVOT operations. This upcoming enhancement will enable complete column lineage tracking through these complex transformations, allowing users to accurately trace data flow and dependencies in queries that involve pivoting and unpivoting operations.

kangyou.choi added 3 commits March 10, 2025 16:46
… field accesses

This commit fixes a bug in the `_convert_columns_to_dots` function where UNPIVOT result
columns were incorrectly treated as struct field accesses. The bug caused expressions like
`unpvt.employee_id` to be incorrectly transformed into `employee_sales.unpvt.employee_id`
in qualified SQL queries.

The fix adds logic to detect when a column's table reference is an UNPIVOT alias by
checking against the pivot operations in the current scope. When an UNPIVOT alias is
detected, the function now skips the struct field access conversion for these columns.

This ensures that UNPIVOT result columns remain correctly qualified with just their
UNPIVOT alias (e.g., `unpvt.employee_id`), rather than being incorrectly nested with
the source table name.
Improved the handling of struct field lookups by preventing incorrect conversion
of PIVOT/UNPIVOT aliases to Dot expressions. This patch:

- Adds early returns to skip columns that don't need conversion
- Checks for PIVOT/UNPIVOT aliases in both current and parent scopes
- Optimizes performance by checking current scope first before traversing parents
- Adds robust type comparison between string and identifier objects
- Preserves original conversion logic for actual struct field lookups

This fixes an issue where columns referencing PIVOT/UNPIVOT aliases were
incorrectly converted to Dot expressions, causing incorrect query results.
…T operations

This commit adds unit tests that verify the correct handling of column qualification in various UNPIVOT scenarios:
- UNPIVOT without explicit table aliases (using auto-generated aliases)
- UNPIVOT with explicit table aliases
- UNPIVOT with JOIN operations and type casting
- Nested UNPIVOT operations with proper alias propagation
- Complex queries with UNPIVOT, EXISTS subqueries, and multi-condition joins

These tests ensure that the qualify transformation correctly qualifies column references
across different query structures while preserving the semantic meaning of the original SQL.
@zeta9044 zeta9044 closed this Mar 11, 2025
@zeta9044 zeta9044 reopened this Mar 11, 2025
@zeta9044 zeta9044 closed this Mar 11, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants