-
Notifications
You must be signed in to change notification settings - Fork 858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clarify string descriptions #875
base: main
Are you sure you want to change the base?
Conversation
I don't believe these changes make the spec any any clearer. In fact I'd argue it's a clear regression in clarity; aggregating the string information for all strings and listing exceptions requires the reader to do the heavy lifting. Repeatedly listing it for each string type is redundant, yes, but expects less cognitive load from the reader and fits the flow of the document better. |
toml.md
Outdated
All strings must contain only valid UTF-8 encoded characters as is the case for | ||
the TOML document as a whole. Certain control characters are not allowed to | ||
occur literally in any kind of string: U+0000 to U+0008, U+000B, U+000C, U+000E | ||
to U+001F, and U+007F. In basic strings and multi-line basic strings, but not in | ||
literal strings or multi-line literal strings, those control characters can be | ||
described with escapes as specified below. Additional restrictions are described | ||
below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you do persist with this change, then I'd simplify this paragraph. There's just too much here.
All strings must contain only valid UTF-8 encoded characters as is the case for | |
the TOML document as a whole. Certain control characters are not allowed to | |
occur literally in any kind of string: U+0000 to U+0008, U+000B, U+000C, U+000E | |
to U+001F, and U+007F. In basic strings and multi-line basic strings, but not in | |
literal strings or multi-line literal strings, those control characters can be | |
described with escapes as specified below. Additional restrictions are described | |
below. | |
Strings must contain only valid UTF-8 encoded characters. Certain control characters are not allowed to occur literally in any kind of string: U+0000 to U+0008, U+000B, U+000C, U+000E to U+001F, and U+007F. |
The point about the basic strings supporting escaped control characters is already covered in the "basic strings" section.
(An argument can be made that the list of disallowed characters should be represented as an actual bullet-point list, though that's a matter of taste, and beyond the scope of this PR since it wasn't that way to begin with.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think moving the "Any Unicode character may be used except [..]" one paragraph up makes sense. Now it just looks like it applies only to basic strings, rather than all strings.
Can then also remove the same text for multi-line strings and "Control characters other than tab are not permitted in a literal string" at the end of the "Multi-line literal strings" section.
I'd write it as something like:
There are four ways to express strings: basic, multi-line basic, literal, and multi-line literal. All strings must be encoded as valid UTF-8, and can contain any codepoint except control characters other than tab (U+0000 to U+0008, U+000A to U+001F, U+007F). Multi-line strings can also contain newlines (U+000A) and carriage returns (U+000D).
This way you have "what bytes/characters can be in a string?" in a single concise paragraph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arp242 I like that wording.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@arp242 Thanks, that is an improvement. Putting those details at the beginning rather than at the end of each section makes all the sections easier to understand. I have also reworded the last paragraph in strings to make it clear it is not a another part of the spec but just advice (paragraph starting "Because most control characters are not permitted...").
I would avoid the phrase "All strings must be encoded as valid UTF-8" since it suggests that strings can be encoded independently of the rest of a TOML document, which of course is not the case. The whole document is encoded as UTF-8; so, when looking at strings, we can only see them as series of Unicode codepoints, not of bytes. So, instead of
I'd propose
|
@ChristianSi Good idea! I also simplified the language to avoid the double negative, clarifying the situation with tab. This is what the paragraph looks like in this PR now:
@arp242 What do you think about this change to your rewording? |
multi-line literal. Strings can contain any valid Unicode codepoint except the | ||
following control characters: U+0000 to U+0008, U+000A to U+001F, and | ||
U+007F. Note that tab (U+0009) is allowed. Multi-line strings can also contain | ||
newlines (U+000A) and carriage returns (U+000D). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that saying that U+000A
and U+000D
are not allowed first1 and then adding an exception for multi-line strings is kind of a double negative (an exception of the previous exception)...
I would recommend restricting the code point ranges/enumeration to the ones that are allowed in all types of strings.
Then I would add a second (separated) statement specifically saying that "basic" and "literal" strings (single-line) don't allow newlines/carriage returns.
For example, something like:
Strings can contain any valid Unicode codepoint except the following control characters:
U+0000 to U+0008, U+000B, U+000C, U+000E, U+001F, and U+007F.
Note that tab (U+0009) is allowed.
Newlines (U+000A) and carriage returns (U+000D) are allowed in multi-line strings
but forbidden in basic and literal strings.
Footnotes
-
U+000A
andU+000D
are elements of the previously mentioned character ranges/enumeration ↩
Control characters other than tab are not permitted in a literal string. Thus, | ||
for binary data, it is recommended that you use Base64 or another suitable ASCII | ||
or UTF-8 encoding. The handling of that encoding will be application-specific. | ||
Because most control characters are not permitted even in literal and multi-line | ||
literal strings, these literal strings are not suited for representing blobs of | ||
binary data. It is recommended that you use Base64 or another suitable ASCII or | ||
UTF-8 encoding. The handling of that encoding will be application-specific. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have an alternative paragraph that expresses these same sentiments in #929.
Restructure the list of disallowed code points so the control characters that are disallowed
in all types of strings are listed at the beginning, and just provide the differences for the various types.
Clarify that backslash and quotation mark can occur literally, but only as part of an escape sequence.
Also put newlines in a couple of other places where the lines exceeded 80 characters.