You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The rewriter currently does not apply any engine-specific name comparison rules, despite being guaranteed to have a NameCompare class available. The name comparison rules are currently applied only at symbol resolution time, which happens before and after rewriting.
This can lead to duplicate column names in the rewritten query. For example:
sample = Table("PUMS", "PUMS", [
Int('pid', is_key=True),
Int('"PiD"')
], 150)
meta = CollectionMetadata([sample], "csv")
query = 'SELECT COUNT (DISTINCT PID) AS foo, COUNT(DISTINCT "PiD") AS bar FROM PUMS.PUMS'
meta.compare = PostgresNameCompare()
reader = PostgresReader("localhost", "PUMS", "admin", "password")
private_reader = PrivateReader(reader, meta, 1.0)
inner, outer = private_reader.rewrite(query) # Fails!!
ne = outer.select.namedExpressions
assert(ne[0].expression.expression.name == 'keycount')
assert(ne[1].expression.expression.name != 'keycount')
In the above, the metadata provided by the data curator has a pid and a PiD, which can be separate columns according to Postgres rules. The query provided by the user, however, counts a PID column, which will resolve to pid at symbol resolution time, but results in duplicate columns in the rewriter. The rewriter produces 3 columns: A 'pid' (for the keycount), a PID (for the user query), and a PiD for the case-sensitive column. At symbol resolution time, this fails, because the rewritten query has two columns that resolve to pid.
This occurs because push_scope does not apply name comparison rules, and therefore does not know that COUNT (DISTINCT pid) and COUNT (DISTINCT PID) are counting the same thing. If name comparison rules were used, the column would only be queried once, and would be reused as needed.
Probably the best way to fix this would be to have the hash table in NameScope use the equality rules from the NameCompare. The hash table for the naming scope handles things like expressions, and the name-comparison rules would only need to apply to identifiers. This would allow us to keep the current contract of NameCompare.identifier_match, and would cover all cases, as long as the engine-specific rules are implemented faithfully.
Another fix could be to canonicalize names used in the rewriter. For example, when a user-provided query asks the rewriter to access a column named PID, the rewriter could scan the curator-provided metadata and look for the first column name that matches (e.g. pid), and then use the column name from the metadata in the rewritten query. This is probably a good idea even if the hash table implements name compare rules.
If the hash table doesn't implement name compare rules, canonicalization based on metadata would not cover all cases. For computed columns which don't map to TableColumn, the rewriter would need to use some heuristic like .lower() or .upper(). We could extend the NameCompare contract to include a canonicalize() method, to allow more fine-tuned canonicalization that honors escaping rules. However, this seems like a lot of additional complexity.
The text was updated successfully, but these errors were encountered:
The rewriter currently does not apply any engine-specific name comparison rules, despite being guaranteed to have a
NameCompare
class available. The name comparison rules are currently applied only at symbol resolution time, which happens before and after rewriting.This can lead to duplicate column names in the rewritten query. For example:
In the above, the metadata provided by the data curator has a
pid
and aPiD
, which can be separate columns according to Postgres rules. The query provided by the user, however, counts aPID
column, which will resolve topid
at symbol resolution time, but results in duplicate columns in the rewriter. The rewriter produces 3 columns: A 'pid' (for thekeycount
), aPID
(for the user query), and aPiD
for the case-sensitive column. At symbol resolution time, this fails, because the rewritten query has two columns that resolve topid
.This occurs because
push_scope
does not apply name comparison rules, and therefore does not know thatCOUNT (DISTINCT pid)
andCOUNT (DISTINCT PID)
are counting the same thing. If name comparison rules were used, the column would only be queried once, and would be reused as needed.Probably the best way to fix this would be to have the hash table in
NameScope
use the equality rules from theNameCompare
. The hash table for the naming scope handles things like expressions, and the name-comparison rules would only need to apply to identifiers. This would allow us to keep the current contract ofNameCompare.identifier_match
, and would cover all cases, as long as the engine-specific rules are implemented faithfully.Another fix could be to canonicalize names used in the rewriter. For example, when a user-provided query asks the rewriter to access a column named
PID
, the rewriter could scan the curator-provided metadata and look for the first column name that matches (e.g.pid
), and then use the column name from the metadata in the rewritten query. This is probably a good idea even if the hash table implements name compare rules.If the hash table doesn't implement name compare rules, canonicalization based on metadata would not cover all cases. For computed columns which don't map to
TableColumn
, the rewriter would need to use some heuristic like.lower()
or.upper()
. We could extend theNameCompare
contract to include acanonicalize()
method, to allow more fine-tuned canonicalization that honors escaping rules. However, this seems like a lot of additional complexity.The text was updated successfully, but these errors were encountered: