Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge KDL v2 #286

Open
wants to merge 98 commits into
base: main
Choose a base branch
from
Open

Merge KDL v2 #286

wants to merge 98 commits into from

Conversation

zkat
Copy link
Member

@zkat zkat commented Aug 28, 2022

Here it is! The long-awaited KDL v2, which is where we go ahead and make a handful of technically-breaking changes to address some corner cases we've run into over the past year while KDL has been getting implemented in a bunch of languages by various people.

I'd love to get feedback on what we have slated, and whether there's anything else we should definitely include when this goes out.

danini-the-panini and others added 4 commits August 28, 2022 12:59
Honestly, they're just too implementation-specific
As I read the grammar in the spec, `"//"` wouldn't parse as a single-line-comment as it requires as least one non-newline character after the slashes.
@zkat
Copy link
Member Author

zkat commented Aug 28, 2022

/cc @CAD97

@CAD97
Copy link
Contributor

CAD97 commented Aug 28, 2022

I have a slight preference for #241 over #204 personally, though only slight.

@@ -0,0 +1 @@
node "\\"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't there also be an output case for this, and one for forwardslash (node "/") then?

Seeing as the escaped form was removed

EDIT: mixing up my slashes again

@hkolbeck
Copy link
Member

hkolbeck commented Aug 29, 2022

I have a preference for #204, because the primary use case I can see for # in bare identifiers is hashtag-like which would be illegal under either, and it seems better to go with the simpler rule.

That preference is not terribly strong, though.

Edit: I misread, I'm fine with either

@CAD97
Copy link
Contributor

CAD97 commented Aug 29, 2022

the primary use case I can see for # in bare identifiers is hashtag-like

To clarify, #241 allows #ident as a bare ident, and both will of course still allow "r#ident" as a quoted ident.

Argument for allowing: transliterating CSS selectors, for e.g. CSS-in-KDL. Argument against allowing: using the syntax in KQL as a selector like CSS.

@lilyball
Copy link
Contributor

Argument for allowing: transliterating CSS selectors, for e.g. CSS-in-KDL. Argument against allowing: using the syntax in KQL as a selector like CSS.

#foo in CSS is special-casing the id attribute. KQL doesn't have an equivalent to HTML's id, and using #foo syntax in KQL to mean something else might be confusing given its meaning in CSS, so I don't find the argument against compelling.

My inclination is to prefer #241 as well, as I think being able to write hashtags is neat. It also allows for doing things like writing Nix flake references as bare words, e.g. nixpkgs#hello.

@Lucretiel
Copy link
Contributor

Can we squeeze #213 into this? The specific proposal is the addition of escaped whitespace in string literals– that \, followed by literal (non-escaped) whitespace, should consume and discard all that whitespace. This is a slight simplification of the Rust rule, which specifically requires that \ be followed by \n.

@hkolbeck
Copy link
Member

I'm also a fan of #213, though it seems like there's some ambiguity in the discussion. Namely, does

- "x\
    y\
    z"

Translate to "xyz", "x y z", or "x\ny\nz"?

@zkat
Copy link
Member Author

zkat commented Aug 31, 2022

Can we squeeze #213 into this? The specific proposal is the addition of escaped whitespace in string literals– that \, followed by literal (non-escaped) whitespace, should consume and discard all that whitespace. This is a slight simplification of the Rust rule, which specifically requires that \ be followed by \n.

@Lucretiel do you have time to put together a PR with this grammar+prose change? I'm game.

@Lucretiel
Copy link
Contributor

Can we squeeze #213 into this? The specific proposal is the addition of escaped whitespace in string literals– that \, followed by literal (non-escaped) whitespace, should consume and discard all that whitespace. This is a slight simplification of the Rust rule, which specifically requires that \ be followed by \n.

@Lucretiel do you have time to put together a PR with this grammar+prose change? I'm game.

Yes, tonight I can put that together :) should it be in the form of an amendment to SPEC.md?

@Lucretiel
Copy link
Contributor

Lucretiel commented Aug 31, 2022

I'm also a fan of #213, though it seems like there's some ambiguity in the discussion. Namely, does

- "x\
    y\
    z"

Translate to "xyz", "x y z", or "x\ny\nz"?

I agree there's some ambiguity in the original. That example would translate to "xyz", because all literal whitespace after the \ is consumed and discarded. If you want to retain whitespace, it should either come before the \ or itself be escaped. I think my comment (#213 (comment)) succinctly describes this.

@zkat
Copy link
Member Author

zkat commented Aug 31, 2022

Can we squeeze #213 into this? The specific proposal is the addition of escaped whitespace in string literals– that \, followed by literal (non-escaped) whitespace, should consume and discard all that whitespace. This is a slight simplification of the Rust rule, which specifically requires that \ be followed by \n.

@Lucretiel do you have time to put together a PR with this grammar+prose change? I'm game.

Yes, tonight I can put that together :) should it be in the form of an amendment to SPEC.md?

yep!

I agree there's some ambiguity in the original. That example would translate to "xyz", because all literal whitespace after the \ is consumed and discarded. If you want to retain whitespace, it should either come before the \ or itself be escaped. I think my comment (#213 (comment)) succinctly describes this.

Is this what Rust does? I would've expected that to at least preserve the first newline. Then again, this is consistent with KDL's existing escline rule where \<newline> is the same as <non-newline whitespace>

@CAD97
Copy link
Contributor

CAD97 commented Aug 31, 2022

Is this what Rust does?

[playground]

[src/main.rs:2] dbg!("\
    here\
    is\
    an\
    example\
    ") = "hereisanexample"

@hkolbeck
Copy link
Member

hkolbeck commented Aug 31, 2022

It's worth noting that bash behaves similarly as far as just dropping the newline, though it doesn't consume space afterward:

❯ echo foo\
… ❯ bar\
… ❯ baz
foobarbaz

With that I think xyz is the right output, and am +1 on including it in v2

Edit: Scratch that, I'm a space cadet:

❯ echo foo\
      bar\
      baz
foo bar baz

I'm more prone to emulating bash over rust, but I'm curious how others feel

@Lucretiel
Copy link
Contributor

Lucretiel commented Aug 31, 2022

Bash's behavior is concerned with syntactic whitespace (ie, allowing commands to spread over multiple lines with line continuations). It doesn't meaningfully behave in terms of consuming or not consuming specific whitespace so much as it extends a line to the next line while retaining the separation of tokens for a command. In your echo example, all that's happened is that the foo and bar and baz have correctly been passed as different arguments to echo; it's no different than:

> echo foo             bar \
   baz
foo bar baz

Kaydle has basically the same behavior with its own line continuation syntax, where you can use a \ to continue a single node into the next line. All these nodes are the same:

node 1 2 3
node 1    2   3
node 1\
  2\
  3

#213 is instead concerned with treatment of escaped whitespace in strings, where I think the plain consumption of unescaped whitespace makes the most sense

Is this what Rust does? I would've expected that to at least preserve the first newline. Then again, this is consistent with KDL's existing escline rule where <newline> is the same as

Rust does just consume all whitespace, regardless of type. The canonical way to add newlines to a whitespace-escaped string to to escape them:

assert_eq!(
    "line 1\n\
    line 2\n\
    line 3\n",

"line 1
line 2
line 3
"
);

Though more commonly I use it to stretch out long sentences with simple spaces:

assert_eq!(
    "This is a sentence with a \
    lot of words in it.",
    "This is a sentence with a lot of words in it."
);

JSON-IN-KDL.md Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the JiK spec mention that #inf, #-inf, and #nan are invalid values in a JiK context?
It is already implied, as JSON can't represent these values, but I think it would make sense to make this explicit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good for a warning, yeah.

Comment on lines +31 to +32
* `null`, `true`, and `false` are now `#null`, `#true`, and `#false`. Using
the unprefixed versions of these values is a syntax error.
Copy link

@chitoyuu chitoyuu Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry if this has been explained somewhere before, a 100+ comment discussion can be hard to follow.

Speaking purely from a user's perspective, this specific change feels a bit unnecessary. I assume it's being done to prevent ambiguity, but null, true and false are keywords common enough, that I don't think anyone with the slightest experience writing code would be surprised if they have special meanings unlike normal identifiers. If anything, being a Rust user, I would come into this expecting the exact opposite: that true means the boolean and #true means the raw identifier, similar to how it works in Rust sans the r.

Is there something I'm missing here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We decided this would be enough of a potential footgun since the change to allow unquoted strings. Programming languages like Rust have other defenses against this kind of confusion, but kdl would need to do something different (like this prefixing) to prevent, say, the kind of things you see happen in plain JavaScript

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation, and fair enough, if that's the point of balance you've decided on for your project.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
examples/ci.kdl Outdated Show resolved Hide resolved
SPEC.md Outdated Show resolved Hide resolved
SPEC.md Show resolved Hide resolved
@Bannerets
Copy link
Contributor

Bannerets commented Mar 5, 2024

How exactly do multi-line strings interact with whitespace escapes (\ )? Perhaps that should be clarified in the SPEC?

That is, are these allowed?

  "
  foo \
 bar
  baz
  "
  "
  foo
  bar\
  "

SPEC.md Outdated Show resolved Hide resolved
zkat and others added 4 commits March 5, 2024 12:45
Co-authored-by: Bannerets <[email protected]>
Co-authored-by: Bannerets <[email protected]>
Co-authored-by: Bannerets <[email protected]>
Co-authored-by: Bannerets <[email protected]>
@zkat
Copy link
Member Author

zkat commented Apr 1, 2024

@Bannerets I've made some changes to clarify this. wdyt?

@zkat
Copy link
Member Author

zkat commented Apr 1, 2024

🤔 maybe it should be the other way around: it might be easier to implement to have escapes process first, and then do multiline stuff. It might also make more sense, too.

But if you're using multiline escapes in multiline strings, you're asking for trouble anyway.

@Bannerets
Copy link
Contributor

I think interpreting escapes after dedenting seems intuitive, in the sense that the string is parsed exactly as if indentation were just not in the source code. Quickly checking multi-line strings in Elixir, it looks like the string is dedented before escapes, and, e.g., \n doesn't need any extra indentation space after it to work (Elixir also appends final newline). It may also look weird that \s can be used in place of indentation (and mixed with spaces, making the literal string not look evenly indented) in case it is interpreted first. Functions like python's textwrap.dedent can remove indentation only after the escaped string has been parsed, and it feels more like a deficiency. As far as the JavaScript proposal goes, IIRC it takes the String.raw... form of the string so escapes should not interpreted. In the end, I support escapes after dedenting.

The kdlua implementation expands escapes before dedenting, I think (link).

@zkat
Copy link
Member Author

zkat commented Apr 2, 2024

ahhh yes. I see your point. I'll change it back to be dedent-before-escape.

@zkat
Copy link
Member Author

zkat commented Apr 2, 2024

There, that's done :)

opening `"`, and a final newline plus whitespace preceding the closing `"`.
* SMALL EQUALS SIGN (`U+FE66`), FULLWIDTH EQUALS SIGN (`U+FF1D`), and HEAVY
EQUALS SIGN (`U+1F7F0`) are now treated the same as `=` and can be used for
properties (e.g. `お名前=☜(゚ヮ゚☜)`). They are also no longer valid in bare
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example doesn't look right: surely ( and ) aren't allowed here?

@tjol
Copy link
Contributor

tjol commented May 25, 2024

Implementing the multi-line string and whitespace escaping rules is proving quite subtle.

When processing a Multi-line String, implementations MUST resolve all whitespace escapes after dedenting the string.

This sounds simple enough: if there are newlines in the string, I check that the indentation is consistent and remove it. Then, I handle the various backslash escapes.

That should take care of illegal strings like

  "
  foo \
bar
  baz
  "

(from in the spec), and legal strings like

"
    Hello
    \
         World
    "

(which is equal to "Hello\nWorld", if I'm understanding this correctly)

This algorithm does not work for this example in the spec:

    "Hello\n\
    World"

Before considering the \ escaping the newline, this looks very much like a syntax error: there is a newline in the string, but there is no initial or final newline. I believe the formal grammar also prohibits this string.

The spec prose appears to have a solution to this conundrum: (emphasis mine)

When a Quoted or Raw String spans multiple lines with literal, non-escaped Newlines, it follows a special multi-line syntax ...

So if all newlines in the string are escaped, it is not a multi-line string? To my mind that should imply that escaped newlines are not newlines for the purposes of dedenting, and contradicts the rule that dedenting comes before backslash escapes. Overall, not very satisfying.

I would suggest:

  • removing “, non-escaped” from the definition of multi-line strings, requiring that every string that contains ANY literal newlines follows the multi-line-string rules. This matches what the formal grammar already says.
  • replacing the offending example with
"
Hello\n\
    World
"

Edit: I've added a PR - #391

@tjol
Copy link
Contributor

tjol commented May 27, 2024

I wonder if the more intuitive way for strings to work would be:

  1. remove escaped whitespace
  2. dedent multi-line string
  3. resolve other backslash escapes

This should have the same result as the #391 rule for all strings valid under that rule, but also accept more cases with escaped newlines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking This can only be done for the next major version of KDL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet