Add XML Parsing with `DocXmlElement` #331

suneettipirneni · 2022-08-03T21:29:47Z

Closes #138

This PR adds the ability for tsdoc to parse XML elements. For example the following input:

/*
 * <warning>
 *  This is a <bold>destructive</bold> operation.
 * </warning>
 */

Would yield this AST:

- Comment
  - Section
    - Paragraph
      - XmlElement
        * XmlStartTag_OpeningDelimiter="<"
        * XmlElement_Name="warning"
        * XmlStartTag_ClosingDelimiter=">"
        - PlainText
          * PlainText="this is a "
        - XmlElement
          * XmlStartTag_OpeningDelimiter="<"
          * XmlElement_Name="bold"
          * XmlStartTag_ClosingDelimiter=">"
          - PlainText
            * PlainText="destructive"
          * XmlEndTag_OpeningDelimiter="</"
          * XmlEndTag_Name="bold"
          * XmlEndTag_ClosingDelimiter=">"
        - PlainText
          * PlainText="operation."
        * XmlEndTag_OpeningDelimiter="</"
        * XmlEndTag_Name="warning"
        * XmlEndTag_ClosingDelimiter=">"

Behavior

Unlike the current version of the parser, start and end tags alone aren't considered valid components. Take the following input:

<a>b</c>

Currently this input will result in a DocHtmlStartTag for <a>, DocPlainText for b, and DocHtmlEndTag for </c>. Each of these nodes viewed in isolation are valid syntax, however, when parsed together as a unit they represent invalid XML. Thus, with the new XML parser this example would be considered erroneous:

Expecting closing tag name to match opening tag name, got "c" but expected "a"

As for its representation within the AST, the given example would not have anything resembling a DocXmlElement instead it would just be DocErrorText. Once again, this is because as a whole the input cannot be considered valid XML, therefore it's not constructed in any way as an XML node.

That being said, I think my current implementation can be improved to provide more text even in erroneous code.

Some Possible Changes

To align with the XML spec, XML nodes can have one of two nodes

PlainText
XmlElement

However, in writing the parser I was unsure about a possible third case: SoftBreak.

Softbreaks are valid children within XML elements. Currently softbreaks aren't parsed as individual nodes, but as a part of a PlainText node.

For example, the input:

<a>b
</a>

Would generate the following AST:

- XmlElement
        - PlainText
          * PlainText="b\n"

Alternatively the AST could be:

- XmlElement
        - PlainText
          * PlainText="b"
        - SoftBreak
          * SoftBreak="\n"

I'd love to hear feedback on which approach is preferred here.

Additionally, XML has special escape characters. Should we parse something like the following:

<a>1 is &lt; 100</a>

As

- XmlElement
        - PlainText
          * PlainText="1 is < 100"

I would also appreciate feedback on this as well.

ghost · 2022-08-03T21:29:59Z

All CLA requirements met.

…losing tags

…ocxmlnodes

octogonz · 2022-08-10T08:16:27Z

@suneettipirneni I finally got some time to look at this PR tonight -- it is really good! 👍👍 Very cool that you added lots of test coverage. I need a little more time to study the logic in NdoeParser.ts, but overall this PR is on the right track.

octogonz · 2022-08-10T08:37:09Z

However, in writing the parser I was unsure about a possible third case: SoftBreak.

According to the current design of the AST, I believe that \n is mostly forbidden in DocPlainText:

tsdoc/tsdoc/src/nodes/DocPlainText.ts

Lines 49 to 52 in d8ce4ae

 if (DocPlainText._newlineCharacterRegExp.test(parameters.text)) { 

 // Use DocSoftBreak to represent manual line splitting 

 throw new Error('The DocPlainText content must not contain newline characters'); 

 }

(I forget why this constraint was introduced. Maybe so that visitors that process DocPlainText do not have to worry about CRLF vs LF line ending differences?)

octogonz · 2022-08-10T08:59:43Z

Additionally, XML has special escape characters. Should we parse something like the following:

Yes, in fact HTML entities are an important feature for representing arbitrary Unicode characters in an ASCII source file. CommonMark allows them basically anywhere except for code spans, not just inside an XML element.

But the problem is bigger than this PR. For example, how best to represent an HTML entity inside an attribute value? And should TSDoc allow entities inside other context such as {@link ____}?

The problem is not super hard, but a rigorous specification of escaping syntaxes was one of the details that I thought we should figure out before writing the TSDoc spec.

tsdoc/src/parser/NodeParser.ts

common/changes/@microsoft/tsdoc-config/feat-docxmlnodes_2022-08-03-21-34.json

common/changes/@microsoft/tsdoc/feat-docxmlnodes_2022-08-03-21-34.json

tsdoc/src/parser/TSDocMessageId.ts

octogonz · 2022-08-10T09:34:08Z

tsdoc/src/parser/NodeParser.ts

+ endTagNameStartMarker,
+ endTagnameEndMarker,
+ TSDocMessageId.XmlTagNameMismatch,
+ `Expecting closing tag name "${endTagNameExcerpt.toString()}" to match opening tag name "${startTagNameExcerpt.toString()}"`


In the TSDoc Playground, the representation of this error is not quite right:

For the purposes of syntax highlighting and roundtripping, the parser tries to ensure that every input character corresponds to some AST excerpt, without any characters getting discarded.

Some of the unit tests validate this property:

tsdoc/tsdoc/src/parser/__tests__/TestHelpers.ts

Lines 177 to 180 in d8ce4ae

private static _getTokenCoverageGapsSnapshot(parserContext: ParserContext): string[] {

const tokenCoverageChecker: TokenCoverageChecker = new TokenCoverageChecker(parserContext);

return tokenCoverageChecker.getGaps(parserContext.docComment).map((x) => x.toString());

}

It raises a design question of how best to represent </c> in the AST?

If it's too small of a fragment to be meaningful, then we could represent it as ErrorText="</c>".

For a richer representation, we could represent it as a loose XmlElement who is missing its opening excerpts. And maybe  would become a loose XmlElement who is missing its closing excerpts. (<a> does NOT need to be a child of  in this representation.)

An "intelligent" possibility would be to interpret </c> as the closing tag for . So  is a normal nesting element, just accompanied by an error message.

I don't have a strong opinion, but I suspect that 3 is maybe overengineering, and the decision of 1-vs-2 depends on how good we want the syntax highlighting to be, and how much time we have to spend on this PR. 🙂

I'll try out option 2 and see if it's feasible, it seems like the most ideal solution in my opinion.

@octogonz I'm re-reading option 3 and just to be clear what you're proposing for that option is ErrorText= instead of setting the error text to </c>? If so, I think that may be easier than option 2.

To be clear, the parser has two ways of modeling syntax errors:

DocErrorText which will get syntax highlighted as red characters, plus an accompanying error message in the logs

a normal AST node such as DocXmlElement, whose properties are perhaps only partially initialized, plus an accompanying error message in the logs

The advantage of 2 is that syntax highlighting can be applied to the recognized characters, rather than having the entire span be red text. And an API consumer can still access the recognized parts of the syntax such as the tag name. The disadvantage of 2 is that if the AST node is too degenerate, then it is awkward for consumers of the API to handle. Or the AST node may completely incorrect (for example if we interpreted some stray characters <y as a DocXmlElement).

It's a judgement call. There isn't a clear best answer.

just to be clear what you're proposing for that option is ErrorText= instead of setting the error text to </c>? If so, I think that may be easier than option 2.

No, what I meant for option 3 was that excerpts  and </c> would belong to the same DocXmlElement. I think it's not easier, because the some judgement calls are needed for less obvious such as <a><c></c></a> or <a><c></c></a>.

(If you want to save time, for an input like <a></a></c>, the least effort might be to generate  and </c> as DocErrorText -- it's not elegant, but we could always improve it later.)

I don't have a strong preference -- it's up to you.

Hi again @octogonz,

I actually went with option 3. The main reason I did this was because of the behaviors I observed when trying different XML parsers. Most of them properly figure out what's supposed to be the end tag for a given start tag. As for cases where the error is non-obvious, I followed the errors patterns I saw in the other parsers.

For example,

<a><c></c></a>

Would yield the following AST:

- XmlElement * XmlStartTag_OpeningDelimiter="<" * XmlElement_Name="a" * XmlStartTag_ClosingDelimiter=">" - XmlElement * XmlStartTag_OpeningDelimiter="<" * XmlElement_Name="b" * XmlStartTag_ClosingDelimiter=">" - XmlElement * XmlStartTag_OpeningDelimiter="<" * XmlElement_Name="c" * XmlStartTag_ClosingDelimiter=">" * XmlEndTag_OpeningDelimiter="</" * XmlEndTag_Name="c" * XmlEndTag_ClosingDelimiter=">" * XmlEndTag_OpeningDelimiter="</" * XmlEndTag_Name="a" * XmlEndTag_ClosingDelimiter=">" * XmlEndTag_OpeningDelimiter="</" * XmlEndTag_Name="b" * XmlEndTag_ClosingDelimiter=">"

Along with the following error in the log:
Expected closing tag "" but received "</a>" instead.

A loose start tag will always try to find the next available unmatched end tag. If the end tag doesn't match a log error will be emitted and the AST for that particular element would be the start tag and then a mismatched end tag. This means the parser is always biased towards valid start tags. If a start tag cannot be parsed properly the current element cannot be a DocXmlElement.

Examples

<a><c></c></a>
Log:

Expected closing tag "" but received "</a>" instead.

AST

- XmlElement * XmlStartTag_OpeningDelimiter="<" * XmlElement_Name="a" * XmlStartTag_ClosingDelimiter=">" - XmlElement * XmlStartTag_OpeningDelimiter="<" * XmlElement_Name="b" * XmlStartTag_ClosingDelimiter=">" - XmlElement * XmlStartTag_OpeningDelimiter="<" * XmlElement_Name="c" * XmlStartTag_ClosingDelimiter=">" * XmlEndTag_OpeningDelimiter="</" * XmlEndTag_Name="c" * XmlEndTag_ClosingDelimiter=">" * XmlEndTag_OpeningDelimiter="</" * XmlEndTag_Name="a" * XmlEndTag_ClosingDelimiter=">"

Here c is a valid XML element, however, b is not because the next closest end tag is </a>. a is also not a valid XML element because the next closest tag closing tag </c> is already matched with <c>, the next closing tag after </c> (</a>) is matched with b. This results in a not having a closing tag.

It can represented visually like so:

On the other hand, with this:
<a></c></a>

Log: Expected closing tag "" but received "</c>" instead.

AST

- XmlElement * XmlStartTag_OpeningDelimiter="<" * XmlElement_Name="a" * XmlStartTag_ClosingDelimiter=">" - XmlElement * XmlStartTag_OpeningDelimiter="<" * XmlElement_Name="b" * XmlStartTag_ClosingDelimiter=">" * XmlEndTag_OpeningDelimiter="</" * XmlEndTag_Name="c" * XmlEndTag_ClosingDelimiter=">" * XmlEndTag_OpeningDelimiter="</" * XmlEndTag_Name="a" * XmlEndTag_ClosingDelimiter=">"

a is a valid element while b and c are parsed into a single invalid DocXmlElement with mismatched tag names.

<a><c/></a>
Log:

Expected closing tag "" but received "</a>" instead.

AST

- XmlElement * XmlStartTag_OpeningDelimiter="<" * XmlElement_Name="a" * XmlStartTag_ClosingDelimiter=">" - XmlElement * XmlStartTag_OpeningDelimiter="<" * XmlElement_Name="b" * XmlStartTag_ClosingDelimiter=">" - XmlElement * XmlStartTag_OpeningDelimiter="<" * XmlElement_Name="c" * XmlStartTag_ClosingDelimiter="/>" * XmlEndTag_OpeningDelimiter="</" * XmlEndTag_Name="a" * XmlEndTag_ClosingDelimiter=">"

In this example, a is no longer considered a valid XML element.  matches with the next closest closing tag </a>. <a> looks for another closing tag, it can't find one so it becomes a loose XML starting tag. <c/> is a valid self-closing element.

Cases where the element is not large enough to create a meaningful DomXmlElement, the text is simply considered DocErrorText.

With this I've been able to get a good amount of the valid syntax highlighting for loose nodes to work in various situations. I look forward to hearing your thoughts and feedback on this.

Co-authored-by: Pete Gonzalez <[email protected]>

…/tsdoc into feat/docxmlnodes

suneettipirneni · 2022-08-10T22:46:20Z

When I was fixing the duplicate attribute problem in the parser an idea popped into my head. I think we could improve the way we store attributes by using maps instead of arrays. From a developer perspective, if I want to access an attribute I'd usually already know the key for the value I want. Storing attributes in a map would allow developers to skip the boilerplate of using Array.find()

octogonz · 2022-08-11T01:16:53Z

When I was fixing the duplicate attribute problem in the parser an idea popped into my head. I think we could improve the way we store attributes by using maps instead of arrays. From a developer perspective, if I want to access an attribute I'd usually already know the key for the value I want. Storing attributes in a map would allow developers to skip the boilerplate of using Array.find()

It's a helpful API, but I would suggest to expose it as a secondary API, something like DocXmlElement.tryGetXmlAttribute(name): XmlAttribute | undefined.

We could convert the main model to be xmlAttributes: Map<string, DocXmlAttribute>, but this has some downsides:

If we expose the Map directly, there's no way to validate items before they are added.
Even if it is a ReadonlyMap, often you also want to validate keys before they are looked up. (one example from this repo)
The order in which attributes appear can be syntactically useful to know, even if it has no semantic meaning. Map does not expose that very well.

In other words Map is the best way to implement lookups, but it is often not the best class to expose as public API contract.

octogonz · 2022-08-11T01:18:51Z

(let us know when this PR is ready for another look)

octogonz · 2022-09-14T02:12:17Z

tsdoc/src/nodes/DocXmlElement.ts

+ }
+
+ private static _isValidXmlName(name: string): boolean {
+ return /^[a-zA-Z_][a-zA-Z0-9_.-]*$/.test(name);


. should not be allowed in attribute names. At least this wasn't permitted in the previous doc, and the parser doesn't accept it because it relies on StringChecks.explainIfInvalidXmlName().

(The specs for XML/HTML/CommonMark do allow many other unusual characters, but for TSDoc we've taken a minimalist approach of only allowing syntaxes with a compelling justification. This is partly to make it more intuitive for humans to predict how TSDoc will get parsed, and partly to avoid difficult ambiguity/escaping edge cases caused by combining grammars of JSDoc, XML, and markdown.)

I will replace this RegExp with StringChecks.explainIfInvalidXmlName()

…sents an element (whereas DocHtmlStartTag represented a tag)

octogonz · 2022-09-14T03:09:44Z

@suneettipirneni I greatly appreciate your patience with this long review. 😇🙏

I think this feature is ready to merge, except for a couple final questions:

1. Backwards compatibility

In testing, I realized that the upgraded tsdoc-config library cannot read old tsdoc.json files. This means that if several different tools are in use, mixing tsdoc-config versions would make it impossible to parse the file. The tools would all need to be upgraded together. I think we can make the upgrade experience a little easier by relaxing the schema/loader to accept the old setting names. (Then we can remove this backwards compatibility later, for example when we publish the 1.0.0 version of the package.)

I will do this work and push it to your branch.

2. Nested content

The handling of nested HTML content has changed with this PR.

Old behavior:

New behavior:

The behavior makes sense, because the meaning of a custom tag is user-defined. The content may be suitable to render verbatim (, <blockquote>) but maybe not (<style>, <meta>). A reasonable conservative policy is to not render the content unless the tag is recognized.

Nonetheless, the example in the above screenshots is probably counterintuitive for end users. I want to think a bit about how to handle this without adding nontrivial additional work to your PR. If you have any ideas, let me know.

@zelliott @adventure-yunfei @AlexJerabek @dzearing @chengcyber

octogonz · 2022-09-14T03:45:50Z

The handling of nested HTML content has changed with this PR.

🤔 Maybe what we need is a tsdoc.json setting that can specify:

"If your tool doesn't implement this XML element name, it's safe to process the element's content as if the element's tags were removed."

Here's a more complicated example:

/**
 * <notebox>
 *   This is some <b>sample text</b>.
 *   <table-of-contents>
 *     <nav-node-name>My API Item</nav-node-name>
 *     <category>ui-controls</category>
 *   </table-of-contents>
 * </notebox>
 */

...which ideally might be rendered as:

This is some sample text.

In this example, <notebox> and  can be safely ignored, whereas <table-of-contents>, <nav-node-name>, and <category> are data properties whose contents should NOT be rendered.

In tsdoc.json we would somehow define these element names and classify them as safe (e.g. contentContainer=true). And then our default settings could apply this classification to common HTML elements such as <a>, , , etc.

zelliott · 2022-09-14T14:48:20Z

The behavior makes sense, because the meaning of a custom tag is user-defined. The content may be suitable to render verbatim (, <blockquote>) but maybe not (<style>, <meta>). A reasonable conservative policy is to not render the content unless the tag is recognized.

Am I correct in understanding that this is the behavior of the TSDoc website (i.e. whatever's actually rendering TSDoc), not TSDoc? I'm under the impression that TSDoc doesn't contain any rendering logic itself. So if the above behavior is unintuitive, wouldn't that mean fixing how the TSDoc website handles certain XML elements?

With the tsdoc.json setting proposal, I think you're suggesting that TSDoc can help inform a consumer which XML tags can have their contents rendered. There can be reasonable default settings (i.e. common HTML elements), but it can also be configurable. I suppose this is useful if the author of the TSDoc comments and the consumer rendering them are different people... is this a common use case?

suneettipirneni · 2022-09-14T19:48:40Z

In tsdoc.json we would somehow define these element names and classify them as safe (e.g. contentContainer=true). And then our default settings could apply this classification to common HTML elements such as <a>, , , etc.

To some degree this option feels a bit out of place to me. My main dilemma is whether tsdoc should be making these distinctions or the renderers themselves. Renderers already make many implementation-specific decisions on the rendered representation of a given doc node. Shouldn't they also be responsible for making the distinction of what's a contentContainer node and what isn't?

I also fear this field may be confusing for consumers of TSDoc especially in cases where additional data like attributes play a key role. If an <a> element is marked as a contentContainer and my custom markdown emitter automatically outputs the plaintext of any content-container element I could inadvertently skip over the href attribute.

for (const node of nodes) {
	switch (node.kind) {
		case DocNodeKind.XmlElement:
			const xmlNode = node as DocXmlElement;
			if (xmlNode.contentContainer) {
				const text = // Logic to resolve the inner plain text(s) and convert them to strings ... ;
				output.append(text);
				break;
			}

			// Otherwise render recognized XML elements
	}
}

While this code is ultimately a logic error by the programmer (you could argue that custom rendered elements should be checked first) it may be caused by the somewhat valid assumption that a is a contentContainer since it only contains text as child elements. This could also become an issue if there's lack of shared knowledge between the programmers writing the docs and the programmers creating emitters.

As for the playground, I think having it render common xml tags would be ideal. Or it could do what Monaco does and render any XML inner text as plaintext, with the exception of elements like <style>.

octogonz · 2022-09-14T22:12:36Z

Am I correct in understanding that this is the behavior of the TSDoc website (i.e. whatever's actually rendering TSDoc), not TSDoc?

Yes. Here's some examples of various possible "renderers" of TSDoc:

The TSDoc playground shown in those screenshots
An API reference website generated using API Documenter
A custom web app that displays API documentation by parsing TSDoc from .api.json files produced by API Extractor
It doesn't need to be a website: for example, a VS Code extension that shows rich tooltips by parsing TSDoc
It doesn't even need to render anything: for example, a lint rule that counts how many English words are in your TSDoc summary

Importantly, multiple different tools may need to process the same code comment, which is why interoperability between tools is a key requirement for TSDoc.

I'm under the impression that TSDoc doesn't contain any rendering logic itself.

Yes. But TSDoc does provide a config file that allows users to define custom tags (both JSDoc tags and XML tags). This configuration enables the parser to report syntax errors about undefined tags, and (for JSDoc tags at least) it distinguishes "undefined" tags versus "unsupported" tags:

undefined = not a recognized tag in the TSDoc dialect that was configured by the person who wrote the comment
unsupported = defined but not implemented by a particular tool that may be inspecting the file.

TSDoc's own standard tags are all "defined" by default, whereas most tools will only "support" a subset of them.

My main dilemma is whether tsdoc should be making these distinctions or the renderers themselves. Renderers already make many implementation-specific decisions on the representation of a given doc node. Shouldn't they also be responsible for making the distinction of what's a contentContainer node and what isn't?

Well, the original question was:

"If a tool doesn't implement an XML element, what should it do?"

The simple, safe answer is to ignore the element and all of its content.

But suppose I invent a custom tag <glossary-word> and use it to mark up sentences like This is a <glossary-word>book<glossary-word>. If a tool doesn't support this tag, deleting the content would be a bad experience. Do we really need to release new versions of every tool to ensure that <glossary-word>'s content can be rendered safely?

suneettipirneni added 5 commits August 1, 2022 15:38

chore: initial working commit

197a2de

chore: add support for plaintext and fix recursion bug

f31d7dc

refactor: decapitalize XML

96d65b6

fix: make emitter emit xml elements properly

b9f25c1

test: add tests

4c37235

chore: update changelog

8bca007

suneettipirneni mentioned this pull request Aug 4, 2022

RFC: Replace DocHtmlStartTag and DocHtmlEndTag with DocXmlElement #138

Open

fix: propertly emit spacing after tag name, and attributes for self c…

d825766

…losing tags

suneettipirneni force-pushed the feat/docxmlnodes branch from 647e7a4 to d825766 Compare August 4, 2022 17:26

suneettipirneni added 4 commits August 4, 2022 13:29

Merge branch 'main' of https://github.com/microsoft/tsdoc into feat/d…

748d764

…ocxmlnodes

test: fix tests

23c2813

fix: properly parse XML inner text nodes

dcc0536

chore: improve error range for mismatched tag names

af36851

suneettipirneni force-pushed the feat/docxmlnodes branch from dc655d5 to af36851 Compare August 5, 2022 03:03

refactor: extract inner text parsing and add syntax highlighting

7e437f3

suneettipirneni force-pushed the feat/docxmlnodes branch from ee71f02 to 7e437f3 Compare August 8, 2022 15:45

chore: fix CI filename issue

2f6849e

suneettipirneni marked this pull request as ready for review August 9, 2022 13:43

suneettipirneni requested review from iclanton and octogonz as code owners August 9, 2022 13:43