New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add XML Parsing with DocXmlElement
#331
base: main
Are you sure you want to change the base?
Conversation
647e7a4
to
d825766
Compare
dc655d5
to
af36851
Compare
ee71f02
to
7e437f3
Compare
@suneettipirneni I finally got some time to look at this PR tonight -- it is really good! 👍👍 Very cool that you added lots of test coverage. I need a little more time to study the logic in NdoeParser.ts, but overall this PR is on the right track. |
According to the current design of the AST, I believe that tsdoc/tsdoc/src/nodes/DocPlainText.ts Lines 49 to 52 in d8ce4ae
(I forget why this constraint was introduced. Maybe so that visitors that process |
Yes, in fact HTML entities are an important feature for representing arbitrary Unicode characters in an ASCII source file. CommonMark allows them basically anywhere except for code spans, not just inside an XML element. But the problem is bigger than this PR. For example, how best to represent an HTML entity inside an attribute value? And should TSDoc allow entities inside other context such as The problem is not super hard, but a rigorous specification of escaping syntaxes was one of the details that I thought we should figure out before writing the TSDoc spec. |
common/changes/@microsoft/tsdoc-config/feat-docxmlnodes_2022-08-03-21-34.json
Outdated
Show resolved
Hide resolved
common/changes/@microsoft/tsdoc/feat-docxmlnodes_2022-08-03-21-34.json
Outdated
Show resolved
Hide resolved
tsdoc/src/parser/NodeParser.ts
Outdated
endTagNameStartMarker, | ||
endTagnameEndMarker, | ||
TSDocMessageId.XmlTagNameMismatch, | ||
`Expecting closing tag name "${endTagNameExcerpt.toString()}" to match opening tag name "${startTagNameExcerpt.toString()}"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the TSDoc Playground, the representation of this error is not quite right:
For the purposes of syntax highlighting and roundtripping, the parser tries to ensure that every input character corresponds to some AST excerpt, without any characters getting discarded.
Some of the unit tests validate this property:
tsdoc/tsdoc/src/parser/__tests__/TestHelpers.ts
Lines 177 to 180 in d8ce4ae
private static _getTokenCoverageGapsSnapshot(parserContext: ParserContext): string[] { | |
const tokenCoverageChecker: TokenCoverageChecker = new TokenCoverageChecker(parserContext); | |
return tokenCoverageChecker.getGaps(parserContext.docComment).map((x) => x.toString()); | |
} |
It raises a design question of how best to represent </c>
in the AST?
- If it's too small of a fragment to be meaningful, then we could represent it as
ErrorText="</c>"
. - For a richer representation, we could represent it as a loose
XmlElement
who is missing its opening excerpts. And maybe<b>
would become a looseXmlElement
who is missing its closing excerpts. (<a>
does NOT need to be a child of<b>
in this representation.) - An "intelligent" possibility would be to interpret
</c>
as the closing tag for<b>
. So<b>
is a normal nesting element, just accompanied by an error message.
I don't have a strong opinion, but I suspect that 3 is maybe overengineering, and the decision of 1-vs-2 depends on how good we want the syntax highlighting to be, and how much time we have to spend on this PR. 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try out option 2 and see if it's feasible, it seems like the most ideal solution in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@octogonz I'm re-reading option 3 and just to be clear what you're proposing for that option is ErrorText=<b>
instead of setting the error text to </c>
? If so, I think that may be easier than option 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, the parser has two ways of modeling syntax errors:
DocErrorText
which will get syntax highlighted as red characters, plus an accompanying error message in the logs- a normal AST node such as
DocXmlElement
, whose properties are perhaps only partially initialized, plus an accompanying error message in the logs
The advantage of 2 is that syntax highlighting can be applied to the recognized characters, rather than having the entire span be red text. And an API consumer can still access the recognized parts of the syntax such as the tag name. The disadvantage of 2 is that if the AST node is too degenerate, then it is awkward for consumers of the API to handle. Or the AST node may completely incorrect (for example if we interpreted some stray characters <y
as a DocXmlElement
).
It's a judgement call. There isn't a clear best answer.
just to be clear what you're proposing for that option is
ErrorText=<b>
instead of setting the error text to</c>
? If so, I think that may be easier than option 2.
No, what I meant for option 3 was that excerpts <b>
and </c>
would belong to the same DocXmlElement
. I think it's not easier, because the some judgement calls are needed for less obvious such as <a><b><c></c></a></b>
or <a><b><c></c></a>
.
(If you want to save time, for an input like <b><a></a></c>
, the least effort might be to generate <b>
and </c>
as DocErrorText
-- it's not elegant, but we could always improve it later.)
I don't have a strong preference -- it's up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi again @octogonz,
I actually went with option 3. The main reason I did this was because of the behaviors I observed when trying different XML parsers. Most of them properly figure out what's supposed to be the end tag for a given start tag. As for cases where the error is non-obvious, I followed the errors patterns I saw in the other parsers.
For example,
<a><b><c></c></a></b>
Would yield the following AST:
- XmlElement
* XmlStartTag_OpeningDelimiter="<"
* XmlElement_Name="a"
* XmlStartTag_ClosingDelimiter=">"
- XmlElement
* XmlStartTag_OpeningDelimiter="<"
* XmlElement_Name="b"
* XmlStartTag_ClosingDelimiter=">"
- XmlElement
* XmlStartTag_OpeningDelimiter="<"
* XmlElement_Name="c"
* XmlStartTag_ClosingDelimiter=">"
* XmlEndTag_OpeningDelimiter="</"
* XmlEndTag_Name="c"
* XmlEndTag_ClosingDelimiter=">"
* XmlEndTag_OpeningDelimiter="</"
* XmlEndTag_Name="a"
* XmlEndTag_ClosingDelimiter=">"
* XmlEndTag_OpeningDelimiter="</"
* XmlEndTag_Name="b"
* XmlEndTag_ClosingDelimiter=">"
Along with the following error in the log:
Expected closing tag "</b>" but received "</a>" instead.
A loose start tag will always try to find the next available unmatched end tag. If the end tag doesn't match a log error will be emitted and the AST for that particular element would be the start tag and then a mismatched end tag. This means the parser is always biased towards valid start tags. If a start tag cannot be parsed properly the current element cannot be a DocXmlElement
.
Examples
<a><b><c></c></a>
Log:
Expected closing tag "</b>" but received "</a>" instead.
AST
- XmlElement
* XmlStartTag_OpeningDelimiter="<"
* XmlElement_Name="a"
* XmlStartTag_ClosingDelimiter=">"
- XmlElement
* XmlStartTag_OpeningDelimiter="<"
* XmlElement_Name="b"
* XmlStartTag_ClosingDelimiter=">"
- XmlElement
* XmlStartTag_OpeningDelimiter="<"
* XmlElement_Name="c"
* XmlStartTag_ClosingDelimiter=">"
* XmlEndTag_OpeningDelimiter="</"
* XmlEndTag_Name="c"
* XmlEndTag_ClosingDelimiter=">"
* XmlEndTag_OpeningDelimiter="</"
* XmlEndTag_Name="a"
* XmlEndTag_ClosingDelimiter=">"
Here c
is a valid XML element, however, b
is not because the next closest end tag is </a>
. a
is also not a valid XML element because the next closest tag closing tag </c>
is already matched with <c>
, the next closing tag after </c>
(</a>
) is matched with b
. This results in a
not having a closing tag.
It can represented visually like so:
On the other hand, with this:
<a><b></c></a>
Log:
Expected closing tag "</b>" but received "</c>" instead.
AST
- XmlElement
* XmlStartTag_OpeningDelimiter="<"
* XmlElement_Name="a"
* XmlStartTag_ClosingDelimiter=">"
- XmlElement
* XmlStartTag_OpeningDelimiter="<"
* XmlElement_Name="b"
* XmlStartTag_ClosingDelimiter=">"
* XmlEndTag_OpeningDelimiter="</"
* XmlEndTag_Name="c"
* XmlEndTag_ClosingDelimiter=">"
* XmlEndTag_OpeningDelimiter="</"
* XmlEndTag_Name="a"
* XmlEndTag_ClosingDelimiter=">"
a
is a valid element while b
and c
are parsed into a single invalid DocXmlElement
with mismatched tag names.
<a><b><c/></a>
Log:
Expected closing tag "</b>" but received "</a>" instead.
AST
- XmlElement
* XmlStartTag_OpeningDelimiter="<"
* XmlElement_Name="a"
* XmlStartTag_ClosingDelimiter=">"
- XmlElement
* XmlStartTag_OpeningDelimiter="<"
* XmlElement_Name="b"
* XmlStartTag_ClosingDelimiter=">"
- XmlElement
* XmlStartTag_OpeningDelimiter="<"
* XmlElement_Name="c"
* XmlStartTag_ClosingDelimiter="/>"
* XmlEndTag_OpeningDelimiter="</"
* XmlEndTag_Name="a"
* XmlEndTag_ClosingDelimiter=">"
In this example, a
is no longer considered a valid XML element. <b>
matches with the next closest closing tag </a>
. <a>
looks for another closing tag, it can't find one so it becomes a loose XML starting tag. <c/>
is a valid self-closing element.
Cases where the element is not large enough to create a meaningful DomXmlElement
, the text is simply considered DocErrorText
.
With this I've been able to get a good amount of the valid syntax highlighting for loose nodes to work in various situations. I look forward to hearing your thoughts and feedback on this.
Co-authored-by: Pete Gonzalez <[email protected]>
…/tsdoc into feat/docxmlnodes
When I was fixing the duplicate attribute problem in the parser an idea popped into my head. I think we could improve the way we store attributes by using maps instead of arrays. From a developer perspective, if I want to access an attribute I'd usually already know the key for the value I want. Storing attributes in a map would allow developers to skip the boilerplate of using Array.find() |
It's a helpful API, but I would suggest to expose it as a secondary API, something like We could convert the main model to be
In other words |
(let us know when this PR is ready for another look) |
tsdoc/src/nodes/DocXmlElement.ts
Outdated
} | ||
|
||
private static _isValidXmlName(name: string): boolean { | ||
return /^[a-zA-Z_][a-zA-Z0-9_.-]*$/.test(name); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.
should not be allowed in attribute names. At least this wasn't permitted in the previous doc, and the parser doesn't accept it because it relies on StringChecks.explainIfInvalidXmlName()
.
(The specs for XML/HTML/CommonMark do allow many other unusual characters, but for TSDoc we've taken a minimalist approach of only allowing syntaxes with a compelling justification. This is partly to make it more intuitive for humans to predict how TSDoc will get parsed, and partly to avoid difficult ambiguity/escaping edge cases caused by combining grammars of JSDoc, XML, and markdown.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will replace this RegExp with StringChecks.explainIfInvalidXmlName()
…sents an element (whereas DocHtmlStartTag represented a tag)
@suneettipirneni I greatly appreciate your patience with this long review. 😇🙏 I think this feature is ready to merge, except for a couple final questions: 1. Backwards compatibility
I will do this work and push it to your branch. 2. Nested contentThe handling of nested HTML content has changed with this PR. The behavior makes sense, because the meaning of a custom tag is user-defined. The content may be suitable to render verbatim ( Nonetheless, the example in the above screenshots is probably counterintuitive for end users. I want to think a bit about how to handle this without adding nontrivial additional work to your PR. If you have any ideas, let me know. @zelliott @adventure-yunfei @AlexJerabek @dzearing @chengcyber |
🤔 Maybe what we need is a "If your tool doesn't implement this XML element name, it's safe to process the element's content as if the element's tags were removed." Here's a more complicated example: /**
* <notebox>
* This is some <b>sample text</b>.
* <table-of-contents>
* <nav-node-name>My API Item</nav-node-name>
* <category>ui-controls</category>
* </table-of-contents>
* </notebox>
*/ ...which ideally might be rendered as:
In this example, In |
Am I correct in understanding that this is the behavior of the TSDoc website (i.e. whatever's actually rendering TSDoc), not TSDoc? I'm under the impression that TSDoc doesn't contain any rendering logic itself. So if the above behavior is unintuitive, wouldn't that mean fixing how the TSDoc website handles certain XML elements? With the |
To some degree this option feels a bit out of place to me. My main dilemma is whether tsdoc should be making these distinctions or the renderers themselves. Renderers already make many implementation-specific decisions on the rendered representation of a given doc node. Shouldn't they also be responsible for making the distinction of what's a I also fear this field may be confusing for consumers of TSDoc especially in cases where additional data like attributes play a key role. If an for (const node of nodes) {
switch (node.kind) {
case DocNodeKind.XmlElement:
const xmlNode = node as DocXmlElement;
if (xmlNode.contentContainer) {
const text = // Logic to resolve the inner plain text(s) and convert them to strings ... ;
output.append(text);
break;
}
// Otherwise render recognized XML elements
}
} While this code is ultimately a logic error by the programmer (you could argue that custom rendered elements should be checked first) it may be caused by the somewhat valid assumption that As for the playground, I think having it render common xml tags would be ideal. Or it could do what Monaco does and render any XML inner text as plaintext, with the exception of elements like |
Yes. Here's some examples of various possible "renderers" of TSDoc:
Importantly, multiple different tools may need to process the same code comment, which is why interoperability between tools is a key requirement for TSDoc.
Yes. But TSDoc does provide a config file that allows users to define custom tags (both JSDoc tags and XML tags). This configuration enables the parser to report syntax errors about undefined tags, and (for JSDoc tags at least) it distinguishes "undefined" tags versus "unsupported" tags:
TSDoc's own standard tags are all "defined" by default, whereas most tools will only "support" a subset of them.
Well, the original question was: "If a tool doesn't implement an XML element, what should it do?" The simple, safe answer is to ignore the element and all of its content. But suppose I invent a custom tag |
Closes #138
This PR adds the ability for tsdoc to parse XML elements. For example the following input:
Would yield this AST:
Behavior
Unlike the current version of the parser, start and end tags alone aren't considered valid components. Take the following input:
Currently this input will result in a
DocHtmlStartTag
for<a>
,DocPlainText
forb
, andDocHtmlEndTag
for</c>
. Each of these nodes viewed in isolation are valid syntax, however, when parsed together as a unit they represent invalid XML. Thus, with the new XML parser this example would be considered erroneous:As for its representation within the AST, the given example would not have anything resembling a
DocXmlElement
instead it would just beDocErrorText
. Once again, this is because as a whole the input cannot be considered valid XML, therefore it's not constructed in any way as an XML node.That being said, I think my current implementation can be improved to provide more text even in erroneous code.
Some Possible Changes
To align with the XML spec, XML nodes can have one of two nodes
However, in writing the parser I was unsure about a possible third case:
SoftBreak
.Softbreaks are valid children within XML elements. Currently softbreaks aren't parsed as individual nodes, but as a part of a
PlainText
node.For example, the input:
Would generate the following AST:
Alternatively the AST could be:
I'd love to hear feedback on which approach is preferred here.
Additionally, XML has special escape characters. Should we parse something like the following:
As
I would also appreciate feedback on this as well.