-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for UNICODE for sqlite, mysql, tsql, postgres, and oracle #4554
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @pruzko, thank you a lot for these PRs! Will go ahead and look at each one.
As a general rule, would you mind adding in the description any links/documentation that validate the decisions you've made for each dialect?
Do note that this PR only implements parsing for SQLite (implicitly done through the exp.Unicode
) and generation for the other dialects, but not the other way around; This means that users won't be able to convert Postgres's ASCII(x)
to SQLite's UNICODE(x)
because without Postgres parsing support, ASCII(x)
will be an exp.Anonymous
function.
sqlglot/dialects/mysql.py
Outdated
char_ord = exp.func("ord", char_utf) | ||
return self.sql(char_ord) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can simplify these 2 lines into return self.func("ORD", char_utf)
sqlglot/dialects/oracle.py
Outdated
unistr_func = exp.func("UNISTR", expression.this) | ||
unicode_func = exp.Anonymous(this="ASCII", expressions=[unistr_func]) | ||
return self.sql(unicode_func) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto regarding self.func
Hi, I missed your comments. Sure, I'll address this soon. |
I cleaned up the code and updated the description. I've also added "ASCII" to Postres's parser. However, I'd also like to add parsing for MySQL (and Oracle) that would:
Do you already have something like this in the codebase please? |
Added support for
UNICODE
for #4233SQLGlot already supports
exp.Chr
that converts integers to chars. It would be useful to have an inverse function that returns an integer representations of chars.Here I propose unicode points as the default abstraction since they cover a wide range of symbols, are ASCII compatible, and are pretty much the industry standard. All databases have some function like this, be it
ASCII(x)
,UNICODE(x)
, orORD(x)
. The behavior varies a bit though. SQLite, TSQL, and Postgres return unicode points by default. Oracle's and MySQL's results are dependent on the encoding, hence they require a conversion to UTF.The documentation is a bit of a mess, especially around the encoding requirements, but I tried this on all of the aforementioned engines and the behavior is consistent.
SQLite: unicode
TSQL: unicode
Postgres: ascii (returns unicode points for utf8 strings)
MySQL: ord is encoding dependent, so a utf32 conversion is necessary
Oracle: ascii also needs a conversion through unistr