Add Wide String Constants Prefix #1965

NotsoanoNimus · 2025-02-13T01:50:32Z

Adds wide-string constants to the language, e.g. char[?] my_string = L"Each character is 16 bytes wide! Escapes\t are understood!";

This is an analog to an existing (albeit obscure) C feature, which I referenced in my open issue: #1947

I did my best here to conform to code styling/requirements, but please feel free to clean things up or stress-test them.

@lerno I know you said you were going to make a builtin, but I wanted to try my hand at modifying the compiler and writing a test. Go easy on me. If you don't want this feature in the language, I understand, but this was great to get my hands into the compiler anyway!

NotsoanoNimus · 2025-02-13T02:00:07Z

One additional comment: I am aware that this is really only a 16-bit conversion abstraction for 8-bit characters right now (char into ushort). At the moment I’m not really sure how this would behave with Unicode or UTF-8, etc etc.

I would certainly be willing to help make the conversion for strings better and more friendly for those use-cases, preferably with some help from someone with a bit more knowledge about that.

Thanks.

lerno · 2025-02-13T20:03:36Z

Yes, as you know I think this is better as a builtin function to convert it. The reason it can be a builtin is that it could be a macro. The code as written looks nice, however I think mostly people would use L"..." to get an array of ushort instead. Alternatively it's supposed to be the platform wide character which is 32 on MacOS and 16 on Windows I think.

NotsoanoNimus · 2025-02-14T17:36:36Z

While trying to fix the test cases, I think I might adjust the direction of this change to make it more explicit and universally-applicable, rather than having the expansion of the string depend upon the target platform and simply trying to cram in my wchar_t needs. I'm happy to volunteer this change.

Exactly as you mentioned, from the GNU C Language Manual:

The width of the data type wchar_t depends on the target platform, which makes this kind of wide string somewhat less useful than the newer kinds.

From this article...

// This will generally use some kind of Unicode encoding, but the
// exact encoding will be different on different platforms. On
// Windows, UTF-16. On Linux and Mac, UTF-32.
const wchar_t *wstr = L"γειά σου κόσμος";

So I'm going to write in the u and U prefixes that are mentioned in the same article to create arrays of unsigned short and unsigned int types, respectively. I will do this by expanding the is_wide property by one more bit and making that field indicate the selected field width.

    00 - 'normal' string (do not use widestr code)
    01 - 'unsigned short' array (16-bit)
    10 - 'unsigned int' array (32-bit)
    11 - 'wchar_t' array of bytes (dependent on target - selects 01 [Windows] or 10 [Mac/Linux])

Confusingly, the linked article above mentions the u and U prefixes as a sort of UTF-16/32 encoding for the characters presented. I'm not sure how to handle this. For example, if a string is written in a mix of ASCII and emojis, I don't know how it's presented to the compiler. Is 😀12 presented as U+1F600 0x31 0x32 when 'read' in, is it all UTF-8 as it comes through these functions, or ??? ?

I guess I need to learn more about parsing these things and also find out what c3c already has in terms of Unicode support. If you have anything to add on for this quest, I'd appreciate it.

NotsoanoNimus · 2025-02-16T08:07:47Z

I took this as a challenge and really rolled with it. Added ushort[?] a = u"my string here";, uint[?] b = U"Another";, and ushort[?] c = L"test"; to the language as mentioned above. This assumes the encoding of the strings as read by the c3c compiler are UTF-8 strings, as they typically would be. UTF-16 users will need to convert their strings to UTF-8 (another potential builtin idea) before using this construct.

Despite the tests all passing on my local machine, there is one peculiar problem that remains after this change: the size of the data arrays remains at what a char array would have been, and I cannot figure out how to emit the right array size for the dynamic data type and carry over the entire const array's data.

For example, U"-" becomes a type uint[8] when it should simply be a uint[2] (one - and a null char). This is because converting the string "-" to its corresponding 32-bit raw array inflates its size to 8 bytes. But I don't want to emit 32 bytes when 8 would have done the job...

You can see this problem whenever you don't use dynamic array sizes during variable creation/assignments. Any help on this part would be awesome.

…ases

NotsoanoNimus · 2025-03-01T23:40:04Z

@lerno I've adjusted this PR to use builtins rather than semantic/syntactical features in order to get wide strings. Now that it's done, I agree that it ends up being better since it's such a seldom-used feature.

I think there might be some residual issues with the length of resulting arrays and their types, but since my local tests are a bit wonky right now (running off the latest compiler build), I can't really trust them. As far as I can tell from other programs that are using the build, it seems fine -- so I'll wait on the tests/checks to run.

lerno · 2025-03-04T12:48:18Z

I couldn't update on your branch, so I created a new pull request with the updates I made and merged that one. Thank you for the contribution!

NotsoanoNimus added 3 commits February 12, 2025 19:59

add wide-string constants with the 'L' prefix (C-like)

8e09a4c

add test file for wide string constants

ff01b4b

clean up wide string test file

c650b6a

add full UTF-8 to wide-str language capabilities

fa264b6

NotsoanoNimus added 4 commits March 1, 2025 16:42

move wstr language constant expressions to builtins and adjust test c…

737b9c9

…ases

actually add updated test cases

e94343f

Merge branch 'master' into master

4ffa0a0

adjust wstr test cases

8b23de6

lerno mentioned this pull request Mar 4, 2025

add test file for wide string constants #2016

Merged

lerno closed this in #2016 Mar 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add Wide String Constants Prefix #1965

Add Wide String Constants Prefix #1965

Uh oh!

NotsoanoNimus commented Feb 13, 2025 •

edited

Loading

Uh oh!

NotsoanoNimus commented Feb 13, 2025 •

edited

Loading

Uh oh!

lerno commented Feb 13, 2025

Uh oh!

NotsoanoNimus commented Feb 14, 2025

Uh oh!

NotsoanoNimus commented Feb 16, 2025 •

edited

Loading

Uh oh!

NotsoanoNimus commented Mar 1, 2025

Uh oh!

lerno commented Mar 4, 2025

Uh oh!

Uh oh!

Uh oh!

Add Wide String Constants Prefix #1965

Add Wide String Constants Prefix #1965

Uh oh!

Conversation

NotsoanoNimus commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NotsoanoNimus commented Feb 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lerno commented Feb 13, 2025

Uh oh!

NotsoanoNimus commented Feb 14, 2025

Uh oh!

NotsoanoNimus commented Feb 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NotsoanoNimus commented Mar 1, 2025

Uh oh!

lerno commented Mar 4, 2025

Uh oh!

Uh oh!

NotsoanoNimus commented Feb 13, 2025 •

edited

Loading

NotsoanoNimus commented Feb 13, 2025 •

edited

Loading

NotsoanoNimus commented Feb 16, 2025 •

edited

Loading