Why am I unable to qualify this despite providing the schema? #3207

rorynormaness · 2024-03-23T11:05:55Z

rorynormaness
Mar 23, 2024

Hello guys,

TL;DR

I have run into a very simple case which is not getting qualified when optimizing.
I am providing a schema, but it is not making a difference - I continue to get OptimizeError: Column '"metric2"' could not be resolved

Detail

The SQL example, which in the below python code is assigned to the variable sql

Note that metric2 is unaliased at this point.

SELECT 

M.col1         AS col1,
G.col2         AS col2,
Sum(M.metric1) AS metric1,
Sum( metric2 ) AS metric2, # Note that metric 2 here is not aliased and not able to be disambiguated

FROM            `my-catalog.my_dataset.my_object1` M

LEFT OUTER JOIN `my-catalog.my_dataset.myobject2` G

ON               M.col1 = G.col1

GROUP BY        col1,
                col2

HAVING          metric2 > 0

So first I run this

ast = parse_one(sql, dialect='bigquery')
qualify(ast,identify=True, quote_identifiers=True).sql(dialect='bigquery',pretty=True)

And I get this error:

OptimizeError: Column '"metric2"' could not be resolved

I fully understand that, in this context, it is impossible for SQLGlot to determine whether metric2 is from the tables referenced M or G.
I do not understand why it is being formatted as a nested string
I would expect the error message to say either:

OptimizeError: Column 'metric2' could not be resolved
or OptimizeError: Column "metric2" could not be resolved

Question 1: Is there a reason for this/is it expected?

So I then try this instead:

my_schema = {'my-catalog.my_dataset.my_object1': {'col1': 'STRING',
                                      'metric1': 'INT64',
                                      'metric2': 'INT64'},
             'my-catalog.my_dataset.myobject2': {'col1': 'STRING', 'col2': 'INT64'}}

ast = parse_one(sql, dialect='bigquery')

qualify(ast, identify=True, quote_identifiers=True, schema=my_schema)

Again - I get the same error OptimizeError: Column '"metric2"' could not be resolved despite the fact that metric2 should be disambiguated with this schema, and identified as a column of 'my-catalog.my_dataset.my_object1'.

Question 2: What am I doing wrong here?

Answered by georgesittas

Mar 23, 2024

I do not understand why it is being formatted as a nested string

In reality this is basically "metric2", but the error reporting code looks something like f"... '{column}' ...", making it seem like this is a string. The intention is to delimit the name so it's separated from the rest of the text and it's clear what's not resolved.

Question 1: Is there a reason for this/is it expected?

Yes, it's expected. The optimizer can't possibly disambiguate the metric2 reference without additional information.

Question 2: What am I doing wrong here?

You're not nesting your schema properly. This is how it needs to be changed:

from sqlglot import parse_one
from sqlglot.optimizer.qualify import qu…

View full answer

georgesittas · 2024-03-23T13:49:54Z

georgesittas
Mar 23, 2024
Collaborator

I do not understand why it is being formatted as a nested string

In reality this is basically "metric2", but the error reporting code looks something like f"... '{column}' ...", making it seem like this is a string. The intention is to delimit the name so it's separated from the rest of the text and it's clear what's not resolved.

Question 1: Is there a reason for this/is it expected?

Yes, it's expected. The optimizer can't possibly disambiguate the metric2 reference without additional information.

Question 2: What am I doing wrong here?

You're not nesting your schema properly. This is how it needs to be changed:

from sqlglot import parse_one
from sqlglot.optimizer.qualify import qualify

sql = """
SELECT

M.col1         AS col1,
G.col2         AS col2,
Sum(M.metric1) AS metric1,
Sum( metric2 ) AS metric2,

FROM            `my-catalog.my_dataset.my_object1` M

LEFT OUTER JOIN `my-catalog.my_dataset.myobject2` G

ON               M.col1 = G.col1

GROUP BY        col1,
                col2

HAVING          metric2 > 0
"""

my_schema = {
    'my-catalog': {
        'my_dataset': {
            'my_object1': {
                'col1': 'STRING',
                'metric1': 'INT64',
                'metric2': 'INT64'
            },
            'myobject2': {
                'col1': 'STRING',
                'col2': 'INT64',
            },
        },
    },
}

ast = parse_one(sql, dialect='bigquery')
qualified = qualify(ast, dialect="bigquery", schema=my_schema)

print(qualified.sql(dialect="bigquery", pretty=True))

# SELECT `m`.`col1` AS `col1`, `g`.`col2` AS `col2`, Sum(`m`.`metric1`) AS `metric1`, Sum(`m`.`metric2`) AS `metric2` /* Note that metric 2 here is not aliased and not able to be disambiguated */ FROM `my-catalog.my_dataset.my_object1` AS `m` LEFT OUTER JOIN `my-catalog.my_dataset.myobject2` AS `g` ON `m`.`col1` = `g`.`col1` GROUP BY `m`.`col1`, `g`.`col2` HAVING Sum(`m`.`metric2`) > 0
# print(qualify(ast, dialect="bigquery", schema=my_schema).sql(dialect="bigquery", pretty=True))
# print(qualify(ast, dialect="bigquery", schema=my_schema).sql(dialect="bigquery", pretty=True))
# SELECT
#   `m`.`col1` AS `col1`,
#   `g`.`col2` AS `col2`,
#   Sum(`m`.`metric1`) AS `metric1`,
#   Sum(`m`.`metric2`) AS `metric2` /* Note that metric 2 here is not aliased and not able to be disambiguated */
# FROM `my-catalog.my_dataset.my_object1` AS `m`
# LEFT OUTER JOIN `my-catalog.my_dataset.myobject2` AS `g`
#   ON `m`.`col1` = `g`.`col1`
# GROUP BY
#   `m`.`col1`,
#   `g`.`col2`
# HAVING
#   Sum(`m`.`metric2`) > 0

0 replies

rorynormaness · 2024-03-23T15:36:15Z

rorynormaness
Mar 23, 2024
Author

Hey @georgesittas, thank you very much for replying so fast!

Yes, it's expected. The optimizer can't possibly disambiguate the metric2 reference without additional information.

Understood - I was expecting the error - just the string formatting threw me off!

You're not nesting your schema properly. This is how it needs to be changed:

Aha! Thank you very much for this!

I promise I did look through the MappingSchema ABC within the Schema.py source, where I saw this:

class MappingSchema(AbstractMappingSchema, Schema):
    """
    Schema based on a nested mapping.

    Args:
        schema: Mapping in one of the following forms:
            1. {table: {col: type}}
            2. {db: {table: {col: type}}}
            3. {catalog: {db: {table: {col: type}}}}
            4. None - Tables will be added later
        visible: Optional mapping of which columns in the schema are visible. If not provided, all columns
            are assumed to be visible. The nesting should mirror that of the schema:
            1. {table: set(*cols)}}
            2. {db: {table: set(*cols)}}}
            3. {catalog: {db: {table: set(*cols)}}}}
        dialect: The dialect to be used for custom type mappings & parsing string arguments.
        normalize: Whether to normalize identifier names according to the given dialect or not.
    """

And also the included example of schema on the table of contents:

import sqlglot
from [sqlglot.optimizer](https://sqlglot.com/sqlglot/optimizer.html) import optimize

print(
    optimize(
        [parse_one](https://sqlglot.com/sqlglot.html#parse_one)("""
            SELECT A OR (B OR (C AND D))
            FROM x
            WHERE Z = date '2021-01-01' + INTERVAL '1' month OR 1 = 0
        """),
        schema={"x": {"A": "INT", "B": "INT", "C": "INT", "D": "INT", "Z": "STRING"}}
    ).sql(pretty=True)
)

But I (wrongly) thought that table could just be the fully qualified table name as it would be found within the raw query (if qualifying straight from it) rather than qualifying an AST.

Is there somewhere in the documentation where I could have found an example like you mentioned without asking? If so if you could let me know so I can make sure I look better next time before asking.

Thanks again for your help - I really appreciate it @georgesittas!

1 reply

georgesittas Mar 23, 2024
Collaborator

You're welcome 👍

I think most examples that use a schema have this nesting structure. What you initially tried doesn't work because this way of qualifying tables is only supported by BigQuery and the optimizer, schema modules don't expect this format.

Is there somewhere in the documentation where I could have found an example like you mentioned without asking?

I think you can check the API docs, or also look at the test suite / docstrings for similar examples.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why am I unable to qualify this despite providing the schema? #3207

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Why am I unable to qualify this despite providing the schema? #3207

rorynormaness Mar 23, 2024

Replies: 2 comments · 1 reply

georgesittas Mar 23, 2024 Collaborator

rorynormaness Mar 23, 2024 Author

georgesittas Mar 23, 2024 Collaborator

rorynormaness
Mar 23, 2024

Replies: 2 comments 1 reply

georgesittas
Mar 23, 2024
Collaborator

rorynormaness
Mar 23, 2024
Author

georgesittas Mar 23, 2024
Collaborator