Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Wide String Constants Prefix #1965

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

NotsoanoNimus
Copy link

@NotsoanoNimus NotsoanoNimus commented Feb 13, 2025

Adds wide-string constants to the language, e.g. char[?] my_string = L"Each character is 16 bytes wide! Escapes\t are understood!";

This is an analog to an existing (albeit obscure) C feature, which I referenced in my open issue: #1947

I did my best here to conform to code styling/requirements, but please feel free to clean things up or stress-test them.

@lerno I know you said you were going to make a builtin, but I wanted to try my hand at modifying the compiler and writing a test. Go easy on me. If you don't want this feature in the language, I understand, but this was great to get my hands into the compiler anyway!

@NotsoanoNimus
Copy link
Author

NotsoanoNimus commented Feb 13, 2025

One additional comment: I am aware that this is really only a 16-bit conversion abstraction for 8-bit characters right now (char into ushort). At the moment I’m not really sure how this would behave with Unicode or UTF-8, etc etc.

I would certainly be willing to help make the conversion for strings better and more friendly for those use-cases, preferably with some help from someone with a bit more knowledge about that.

Thanks.

@lerno
Copy link
Collaborator

lerno commented Feb 13, 2025

Yes, as you know I think this is better as a builtin function to convert it. The reason it can be a builtin is that it could be a macro. The code as written looks nice, however I think mostly people would use L"..." to get an array of ushort instead. Alternatively it's supposed to be the platform wide character which is 32 on MacOS and 16 on Windows I think.

@NotsoanoNimus
Copy link
Author

While trying to fix the test cases, I think I might adjust the direction of this change to make it more explicit and universally-applicable, rather than having the expansion of the string depend upon the target platform and simply trying to cram in my wchar_t needs. I'm happy to volunteer this change.

Exactly as you mentioned, from the GNU C Language Manual:

The width of the data type wchar_t depends on the target platform, which makes this kind of wide string somewhat less useful than the newer kinds.

From this article...

// This will generally use some kind of Unicode encoding, but the
// exact encoding will be different on different platforms. On
// Windows, UTF-16. On Linux and Mac, UTF-32.
const wchar_t *wstr = L"γειά σου κόσμος";

So I'm going to write in the u and U prefixes that are mentioned in the same article to create arrays of unsigned short and unsigned int types, respectively. I will do this by expanding the is_wide property by one more bit and making that field indicate the selected field width.

    00 - 'normal' string (do not use widestr code)
    01 - 'unsigned short' array (16-bit)
    10 - 'unsigned int' array (32-bit)
    11 - 'wchar_t' array of bytes (dependent on target - selects 01 [Windows] or 10 [Mac/Linux])

Confusingly, the linked article above mentions the u and U prefixes as a sort of UTF-16/32 encoding for the characters presented. I'm not sure how to handle this. For example, if a string is written in a mix of ASCII and emojis, I don't know how it's presented to the compiler. Is 😀12 presented as U+1F600 0x31 0x32 when 'read' in, is it all UTF-8 as it comes through these functions, or ??? ?

I guess I need to learn more about parsing these things and also find out what c3c already has in terms of Unicode support. If you have anything to add on for this quest, I'd appreciate it.

@NotsoanoNimus
Copy link
Author

NotsoanoNimus commented Feb 16, 2025

I took this as a challenge and really rolled with it. Added ushort[?] a = u"my string here";, uint[?] b = U"Another";, and ushort[?] c = L"test"; to the language as mentioned above. This assumes the encoding of the strings as read by the c3c compiler are UTF-8 strings, as they typically would be. UTF-16 users will need to convert their strings to UTF-8 (another potential builtin idea) before using this construct.

Despite the tests all passing on my local machine, there is one peculiar problem that remains after this change: the size of the data arrays remains at what a char array would have been, and I cannot figure out how to emit the right array size for the dynamic data type and carry over the entire const array's data.

For example, U"-" becomes a type uint[8] when it should simply be a uint[2] (one - and a null char). This is because converting the string "-" to its corresponding 32-bit raw array inflates its size to 8 bytes. But I don't want to emit 32 bytes when 8 would have done the job...

You can see this problem whenever you don't use dynamic array sizes during variable creation/assignments. Any help on this part would be awesome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants