Rewriter should use NameCompare #370

joshua-oss · 2021-05-31T03:11:56Z

The rewriter currently does not apply any engine-specific name comparison rules, despite being guaranteed to have a NameCompare class available. The name comparison rules are currently applied only at symbol resolution time, which happens before and after rewriting.

This can lead to duplicate column names in the rewritten query. For example:

         sample = Table("PUMS", "PUMS", [
             Int('pid', is_key=True),
             Int('"PiD"')
         ], 150)
         meta = CollectionMetadata([sample], "csv")
         query = 'SELECT COUNT (DISTINCT PID) AS foo, COUNT(DISTINCT "PiD") AS bar FROM PUMS.PUMS'
         meta.compare = PostgresNameCompare()
         reader = PostgresReader("localhost", "PUMS", "admin", "password")
         private_reader = PrivateReader(reader, meta, 1.0)
         inner, outer = private_reader.rewrite(query)  # Fails!!
         ne = outer.select.namedExpressions
         assert(ne[0].expression.expression.name == 'keycount')
         assert(ne[1].expression.expression.name != 'keycount')

In the above, the metadata provided by the data curator has a pid and a PiD, which can be separate columns according to Postgres rules. The query provided by the user, however, counts a PID column, which will resolve to pid at symbol resolution time, but results in duplicate columns in the rewriter. The rewriter produces 3 columns: A 'pid' (for the keycount), a PID (for the user query), and a PiD for the case-sensitive column. At symbol resolution time, this fails, because the rewritten query has two columns that resolve to pid.

This occurs because push_scope does not apply name comparison rules, and therefore does not know that COUNT (DISTINCT pid) and COUNT (DISTINCT PID) are counting the same thing. If name comparison rules were used, the column would only be queried once, and would be reused as needed.

Probably the best way to fix this would be to have the hash table in NameScope use the equality rules from the NameCompare. The hash table for the naming scope handles things like expressions, and the name-comparison rules would only need to apply to identifiers. This would allow us to keep the current contract of NameCompare.identifier_match, and would cover all cases, as long as the engine-specific rules are implemented faithfully.

Another fix could be to canonicalize names used in the rewriter. For example, when a user-provided query asks the rewriter to access a column named PID, the rewriter could scan the curator-provided metadata and look for the first column name that matches (e.g. pid), and then use the column name from the metadata in the rewritten query. This is probably a good idea even if the hash table implements name compare rules.

If the hash table doesn't implement name compare rules, canonicalization based on metadata would not cover all cases. For computed columns which don't map to TableColumn, the rewriter would need to use some heuristic like .lower() or .upper(). We could extend the NameCompare contract to include a canonicalize() method, to allow more fine-tuned canonicalization that honors escaping rules. However, this seems like a lot of additional complexity.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewriter should use NameCompare #370

Rewriter should use NameCompare #370

joshua-oss commented May 31, 2021

Rewriter should use NameCompare #370

Rewriter should use NameCompare #370

Comments

joshua-oss commented May 31, 2021