Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for UNICODE for sqlite, mysql, tsql, postgres, and oracle #4554

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

pruzko
Copy link
Contributor

@pruzko pruzko commented Jan 2, 2025

Added support for UNICODE for #4233

SQLGlot already supports exp.Chr that converts integers to chars. It would be useful to have an inverse function that returns an integer representations of chars.

Here I propose unicode points as the default abstraction since they cover a wide range of symbols, are ASCII compatible, and are pretty much the industry standard. All databases have some function like this, be it ASCII(x), UNICODE(x), or ORD(x). The behavior varies a bit though. SQLite, TSQL, and Postgres return unicode points by default. Oracle's and MySQL's results are dependent on the encoding, hence they require a conversion to UTF.

The documentation is a bit of a mess, especially around the encoding requirements, but I tried this on all of the aforementioned engines and the behavior is consistent.

SQLite: unicode
TSQL: unicode
Postgres: ascii (returns unicode points for utf8 strings)
MySQL: ord is encoding dependent, so a utf32 conversion is necessary
Oracle: ascii also needs a conversion through unistr

Copy link
Collaborator

@VaggelisD VaggelisD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @pruzko, thank you a lot for these PRs! Will go ahead and look at each one.

As a general rule, would you mind adding in the description any links/documentation that validate the decisions you've made for each dialect?

Do note that this PR only implements parsing for SQLite (implicitly done through the exp.Unicode) and generation for the other dialects, but not the other way around; This means that users won't be able to convert Postgres's ASCII(x) to SQLite's UNICODE(x) because without Postgres parsing support, ASCII(x) will be an exp.Anonymous function.

Comment on lines 1265 to 1266
char_ord = exp.func("ord", char_utf)
return self.sql(char_ord)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simplify these 2 lines into return self.func("ORD", char_utf)

Comment on lines 399 to 401
unistr_func = exp.func("UNISTR", expression.this)
unicode_func = exp.Anonymous(this="ASCII", expressions=[unistr_func])
return self.sql(unicode_func)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto regarding self.func

@pruzko
Copy link
Contributor Author

pruzko commented Jan 6, 2025

Hi, I missed your comments. Sure, I'll address this soon.

@pruzko
Copy link
Contributor Author

pruzko commented Jan 6, 2025

I cleaned up the code and updated the description. I've also added "ASCII" to Postres's parser.

However, I'd also like to add parsing for MySQL (and Oracle) that would:

  1. parse ord
  2. check if the first argument passed to ord is a conversion to utf8
  3. if yes, parse it into exp.Unicode and consume both ord and the conversion, else parse per normal

Do you already have something like this in the codebase please?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants