Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewriter should use NameCompare #370

Open
joshua-oss opened this issue May 31, 2021 · 0 comments
Open

Rewriter should use NameCompare #370

joshua-oss opened this issue May 31, 2021 · 0 comments

Comments

@joshua-oss
Copy link
Contributor

The rewriter currently does not apply any engine-specific name comparison rules, despite being guaranteed to have a NameCompare class available. The name comparison rules are currently applied only at symbol resolution time, which happens before and after rewriting.

This can lead to duplicate column names in the rewritten query. For example:

         sample = Table("PUMS", "PUMS", [
             Int('pid', is_key=True),
             Int('"PiD"')
         ], 150)
         meta = CollectionMetadata([sample], "csv")
         query = 'SELECT COUNT (DISTINCT PID) AS foo, COUNT(DISTINCT "PiD") AS bar FROM PUMS.PUMS'
         meta.compare = PostgresNameCompare()
         reader = PostgresReader("localhost", "PUMS", "admin", "password")
         private_reader = PrivateReader(reader, meta, 1.0)
         inner, outer = private_reader.rewrite(query)  # Fails!!
         ne = outer.select.namedExpressions
         assert(ne[0].expression.expression.name == 'keycount')
         assert(ne[1].expression.expression.name != 'keycount')

In the above, the metadata provided by the data curator has a pid and a PiD, which can be separate columns according to Postgres rules. The query provided by the user, however, counts a PID column, which will resolve to pid at symbol resolution time, but results in duplicate columns in the rewriter. The rewriter produces 3 columns: A 'pid' (for the keycount), a PID (for the user query), and a PiD for the case-sensitive column. At symbol resolution time, this fails, because the rewritten query has two columns that resolve to pid.

This occurs because push_scope does not apply name comparison rules, and therefore does not know that COUNT (DISTINCT pid) and COUNT (DISTINCT PID) are counting the same thing. If name comparison rules were used, the column would only be queried once, and would be reused as needed.

Probably the best way to fix this would be to have the hash table in NameScope use the equality rules from the NameCompare. The hash table for the naming scope handles things like expressions, and the name-comparison rules would only need to apply to identifiers. This would allow us to keep the current contract of NameCompare.identifier_match, and would cover all cases, as long as the engine-specific rules are implemented faithfully.

Another fix could be to canonicalize names used in the rewriter. For example, when a user-provided query asks the rewriter to access a column named PID, the rewriter could scan the curator-provided metadata and look for the first column name that matches (e.g. pid), and then use the column name from the metadata in the rewritten query. This is probably a good idea even if the hash table implements name compare rules.

If the hash table doesn't implement name compare rules, canonicalization based on metadata would not cover all cases. For computed columns which don't map to TableColumn, the rewriter would need to use some heuristic like .lower() or .upper(). We could extend the NameCompare contract to include a canonicalize() method, to allow more fine-tuned canonicalization that honors escaping rules. However, this seems like a lot of additional complexity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant