Why am I unable to qualify this despite providing the schema? #3207
-
Hello guys, TL;DR
Detail The SQL example, which in the below python code is assigned to the variable sql
SELECT
M.col1 AS col1,
G.col2 AS col2,
Sum(M.metric1) AS metric1,
Sum( metric2 ) AS metric2, # Note that metric 2 here is not aliased and not able to be disambiguated
FROM `my-catalog.my_dataset.my_object1` M
LEFT OUTER JOIN `my-catalog.my_dataset.myobject2` G
ON M.col1 = G.col1
GROUP BY col1,
col2
HAVING metric2 > 0 So first I run this ast = parse_one(sql, dialect='bigquery')
qualify(ast,identify=True, quote_identifiers=True).sql(dialect='bigquery',pretty=True) And I get this error:
Question 1: Is there a reason for this/is it expected? So I then try this instead: my_schema = {'my-catalog.my_dataset.my_object1': {'col1': 'STRING',
'metric1': 'INT64',
'metric2': 'INT64'},
'my-catalog.my_dataset.myobject2': {'col1': 'STRING', 'col2': 'INT64'}}
ast = parse_one(sql, dialect='bigquery')
qualify(ast, identify=True, quote_identifiers=True, schema=my_schema) Again - I get the same error Question 2: What am I doing wrong here? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 1 reply
-
In reality this is basically
Yes, it's expected. The optimizer can't possibly disambiguate the
You're not nesting your schema properly. This is how it needs to be changed: from sqlglot import parse_one
from sqlglot.optimizer.qualify import qualify
sql = """
SELECT
M.col1 AS col1,
G.col2 AS col2,
Sum(M.metric1) AS metric1,
Sum( metric2 ) AS metric2,
FROM `my-catalog.my_dataset.my_object1` M
LEFT OUTER JOIN `my-catalog.my_dataset.myobject2` G
ON M.col1 = G.col1
GROUP BY col1,
col2
HAVING metric2 > 0
"""
my_schema = {
'my-catalog': {
'my_dataset': {
'my_object1': {
'col1': 'STRING',
'metric1': 'INT64',
'metric2': 'INT64'
},
'myobject2': {
'col1': 'STRING',
'col2': 'INT64',
},
},
},
}
ast = parse_one(sql, dialect='bigquery')
qualified = qualify(ast, dialect="bigquery", schema=my_schema)
print(qualified.sql(dialect="bigquery", pretty=True))
# SELECT `m`.`col1` AS `col1`, `g`.`col2` AS `col2`, Sum(`m`.`metric1`) AS `metric1`, Sum(`m`.`metric2`) AS `metric2` /* Note that metric 2 here is not aliased and not able to be disambiguated */ FROM `my-catalog.my_dataset.my_object1` AS `m` LEFT OUTER JOIN `my-catalog.my_dataset.myobject2` AS `g` ON `m`.`col1` = `g`.`col1` GROUP BY `m`.`col1`, `g`.`col2` HAVING Sum(`m`.`metric2`) > 0
# print(qualify(ast, dialect="bigquery", schema=my_schema).sql(dialect="bigquery", pretty=True))
# print(qualify(ast, dialect="bigquery", schema=my_schema).sql(dialect="bigquery", pretty=True))
# SELECT
# `m`.`col1` AS `col1`,
# `g`.`col2` AS `col2`,
# Sum(`m`.`metric1`) AS `metric1`,
# Sum(`m`.`metric2`) AS `metric2` /* Note that metric 2 here is not aliased and not able to be disambiguated */
# FROM `my-catalog.my_dataset.my_object1` AS `m`
# LEFT OUTER JOIN `my-catalog.my_dataset.myobject2` AS `g`
# ON `m`.`col1` = `g`.`col1`
# GROUP BY
# `m`.`col1`,
# `g`.`col2`
# HAVING
# Sum(`m`.`metric2`) > 0 |
Beta Was this translation helpful? Give feedback.
-
Hey @georgesittas, thank you very much for replying so fast!
Understood - I was expecting the error - just the string formatting threw me off!
Aha! Thank you very much for this! I promise I did look through the MappingSchema ABC within the Schema.py source, where I saw this: class MappingSchema(AbstractMappingSchema, Schema):
"""
Schema based on a nested mapping.
Args:
schema: Mapping in one of the following forms:
1. {table: {col: type}}
2. {db: {table: {col: type}}}
3. {catalog: {db: {table: {col: type}}}}
4. None - Tables will be added later
visible: Optional mapping of which columns in the schema are visible. If not provided, all columns
are assumed to be visible. The nesting should mirror that of the schema:
1. {table: set(*cols)}}
2. {db: {table: set(*cols)}}}
3. {catalog: {db: {table: set(*cols)}}}}
dialect: The dialect to be used for custom type mappings & parsing string arguments.
normalize: Whether to normalize identifier names according to the given dialect or not.
""" And also the included example of schema on the table of contents: import sqlglot
from [sqlglot.optimizer](https://sqlglot.com/sqlglot/optimizer.html) import optimize
print(
optimize(
[parse_one](https://sqlglot.com/sqlglot.html#parse_one)("""
SELECT A OR (B OR (C AND D))
FROM x
WHERE Z = date '2021-01-01' + INTERVAL '1' month OR 1 = 0
"""),
schema={"x": {"A": "INT", "B": "INT", "C": "INT", "D": "INT", "Z": "STRING"}}
).sql(pretty=True)
) But I (wrongly) thought that table could just be the fully qualified table name as it would be found within the raw query (if qualifying straight from it) rather than qualifying an AST. Is there somewhere in the documentation where I could have found an example like you mentioned without asking? If so if you could let me know so I can make sure I look better next time before asking. Thanks again for your help - I really appreciate it @georgesittas! |
Beta Was this translation helpful? Give feedback.
In reality this is basically
"metric2"
, but the error reporting code looks something likef"... '{column}' ..."
, making it seem like this is a string. The intention is to delimit the name so it's separated from the rest of the text and it's clear what's not resolved.Yes, it's expected. The optimizer can't possibly disambiguate the
metric2
reference without additional information.You're not nesting your schema properly. This is how it needs to be changed: