add support for UNICODE for sqlite, mysql, tsql, postgres, and oracle #4554

pruzko · 2025-01-02T09:39:47Z

Added support for UNICODE for #4233

SQLGlot already supports exp.Chr that converts integers to chars. It would be useful to have an inverse function that returns an integer representations of chars.

Here I propose unicode points as the default abstraction since they cover a wide range of symbols, are ASCII compatible, and are pretty much the industry standard. All databases have some function like this, be it ASCII(x), UNICODE(x), or ORD(x). The behavior varies a bit though. SQLite, TSQL, and Postgres return unicode points by default. Oracle's and MySQL's results are dependent on the encoding, hence they require a conversion to UTF.

The documentation is a bit of a mess, especially around the encoding requirements, but I tried this on all of the aforementioned engines and the behavior is consistent.

SQLite: unicode
TSQL: unicode
Postgres: ascii (returns unicode points for utf8 strings)
MySQL: ord is encoding dependent, so a utf32 conversion is necessary
Oracle: ascii also needs a conversion through unistr

VaggelisD

Hey @pruzko, thank you a lot for these PRs! Will go ahead and look at each one.

As a general rule, would you mind adding in the description any links/documentation that validate the decisions you've made for each dialect?

Do note that this PR only implements parsing for SQLite (implicitly done through the exp.Unicode) and generation for the other dialects, but not the other way around; This means that users won't be able to convert Postgres's ASCII(x) to SQLite's UNICODE(x) because without Postgres parsing support, ASCII(x) will be an exp.Anonymous function.

VaggelisD · 2025-01-06T08:53:19Z

sqlglot/dialects/mysql.py

+            char_ord = exp.func("ord", char_utf)
+            return self.sql(char_ord)


We can simplify these 2 lines into return self.func("ORD", char_utf)

VaggelisD · 2025-01-06T08:54:17Z

sqlglot/dialects/oracle.py

+            unistr_func = exp.func("UNISTR", expression.this)
+            unicode_func = exp.Anonymous(this="ASCII", expressions=[unistr_func])
+            return self.sql(unicode_func)


Ditto regarding self.func

pruzko · 2025-01-06T09:41:09Z

Hi, I missed your comments. Sure, I'll address this soon.

pruzko · 2025-01-06T12:53:58Z

I cleaned up the code and updated the description. I've also added "ASCII" to Postres's parser.

However, I'd also like to add parsing for MySQL (and Oracle) that would:

parse ord
check if the first argument passed to ord is a conversion to utf8
if yes, parse it into exp.Unicode and consume both ord and the conversion, else parse per normal

Do you already have something like this in the codebase please?

pruzko added 4 commits January 2, 2025 10:35

add support for UNICODE for sqlite, mysql, tsql, postgres, and oracle

9496d9f

UNICODE tests

fdcffa7

linter fix

c3480bd

oracle unicode fix

c2aa3a8

VaggelisD reviewed Jan 6, 2025

View reviewed changes

clean-up

310ff7a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add support for UNICODE for sqlite, mysql, tsql, postgres, and oracle #4554

add support for UNICODE for sqlite, mysql, tsql, postgres, and oracle #4554

pruzko commented Jan 2, 2025 •

edited

Loading

VaggelisD left a comment

VaggelisD Jan 6, 2025

VaggelisD Jan 6, 2025

pruzko commented Jan 6, 2025

pruzko commented Jan 6, 2025

		char_ord = exp.func("ord", char_utf)
		return self.sql(char_ord)

add support for UNICODE for sqlite, mysql, tsql, postgres, and oracle #4554

Are you sure you want to change the base?

add support for UNICODE for sqlite, mysql, tsql, postgres, and oracle #4554

Conversation

pruzko commented Jan 2, 2025 • edited Loading

VaggelisD left a comment

Choose a reason for hiding this comment

VaggelisD Jan 6, 2025

Choose a reason for hiding this comment

VaggelisD Jan 6, 2025

Choose a reason for hiding this comment

pruzko commented Jan 6, 2025

pruzko commented Jan 6, 2025

pruzko commented Jan 2, 2025 •

edited

Loading