Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add XML Parsing with DocXmlElement #331

Open
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

suneettipirneni
Copy link

@suneettipirneni suneettipirneni commented Aug 3, 2022

Closes #138

This PR adds the ability for tsdoc to parse XML elements. For example the following input:

/*
 * <warning>
 *  This is a <bold>destructive</bold> operation.
 * </warning>
 */

Would yield this AST:

- Comment
  - Section
    - Paragraph
      - XmlElement
        * XmlStartTag_OpeningDelimiter="<"
        * XmlElement_Name="warning"
        * XmlStartTag_ClosingDelimiter=">"
        - PlainText
          * PlainText="this is a "
        - XmlElement
          * XmlStartTag_OpeningDelimiter="<"
          * XmlElement_Name="bold"
          * XmlStartTag_ClosingDelimiter=">"
          - PlainText
            * PlainText="destructive"
          * XmlEndTag_OpeningDelimiter="</"
          * XmlEndTag_Name="bold"
          * XmlEndTag_ClosingDelimiter=">"
        - PlainText
          * PlainText="operation."
        * XmlEndTag_OpeningDelimiter="</"
        * XmlEndTag_Name="warning"
        * XmlEndTag_ClosingDelimiter=">"

Behavior

Unlike the current version of the parser, start and end tags alone aren't considered valid components. Take the following input:

<a>b</c>

Currently this input will result in a DocHtmlStartTag for <a>, DocPlainText for b, and DocHtmlEndTag for </c>. Each of these nodes viewed in isolation are valid syntax, however, when parsed together as a unit they represent invalid XML. Thus, with the new XML parser this example would be considered erroneous:

Expecting closing tag name to match opening tag name, got "c" but expected "a"

As for its representation within the AST, the given example would not have anything resembling a DocXmlElement instead it would just be DocErrorText. Once again, this is because as a whole the input cannot be considered valid XML, therefore it's not constructed in any way as an XML node.

That being said, I think my current implementation can be improved to provide more text even in erroneous code.

Some Possible Changes

To align with the XML spec, XML nodes can have one of two nodes

  • PlainText
  • XmlElement

However, in writing the parser I was unsure about a possible third case: SoftBreak.

Softbreaks are valid children within XML elements. Currently softbreaks aren't parsed as individual nodes, but as a part of a PlainText node.

For example, the input:

<a>b
</a>

Would generate the following AST:

- XmlElement
        - PlainText
          * PlainText="b\n"

Alternatively the AST could be:

- XmlElement
        - PlainText
          * PlainText="b"
        - SoftBreak
          * SoftBreak="\n"

I'd love to hear feedback on which approach is preferred here.

Additionally, XML has special escape characters. Should we parse something like the following:

<a>1 is &lt; 100</a>

As

- XmlElement
        - PlainText
          * PlainText="1 is < 100"

I would also appreciate feedback on this as well.

@ghost
Copy link

ghost commented Aug 3, 2022

CLA assistant check
All CLA requirements met.

@suneettipirneni suneettipirneni marked this pull request as ready for review August 9, 2022 13:43
@octogonz
Copy link
Collaborator

@suneettipirneni I finally got some time to look at this PR tonight -- it is really good! 👍👍 Very cool that you added lots of test coverage. I need a little more time to study the logic in NdoeParser.ts, but overall this PR is on the right track.

@octogonz
Copy link
Collaborator

However, in writing the parser I was unsure about a possible third case: SoftBreak.

According to the current design of the AST, I believe that \n is mostly forbidden in DocPlainText:

if (DocPlainText._newlineCharacterRegExp.test(parameters.text)) {
// Use DocSoftBreak to represent manual line splitting
throw new Error('The DocPlainText content must not contain newline characters');
}

(I forget why this constraint was introduced. Maybe so that visitors that process DocPlainText do not have to worry about CRLF vs LF line ending differences?)

@octogonz
Copy link
Collaborator

Additionally, XML has special escape characters. Should we parse something like the following:

Yes, in fact HTML entities are an important feature for representing arbitrary Unicode characters in an ASCII source file. CommonMark allows them basically anywhere except for code spans, not just inside an XML element.

But the problem is bigger than this PR. For example, how best to represent an HTML entity inside an attribute value? And should TSDoc allow entities inside other context such as {@link ____}?

The problem is not super hard, but a rigorous specification of escaping syntaxes was one of the details that I thought we should figure out before writing the TSDoc spec.

endTagNameStartMarker,
endTagnameEndMarker,
TSDocMessageId.XmlTagNameMismatch,
`Expecting closing tag name "${endTagNameExcerpt.toString()}" to match opening tag name "${startTagNameExcerpt.toString()}"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the TSDoc Playground, the representation of this error is not quite right:

image

For the purposes of syntax highlighting and roundtripping, the parser tries to ensure that every input character corresponds to some AST excerpt, without any characters getting discarded.

Some of the unit tests validate this property:

private static _getTokenCoverageGapsSnapshot(parserContext: ParserContext): string[] {
const tokenCoverageChecker: TokenCoverageChecker = new TokenCoverageChecker(parserContext);
return tokenCoverageChecker.getGaps(parserContext.docComment).map((x) => x.toString());
}

It raises a design question of how best to represent </c> in the AST?

  1. If it's too small of a fragment to be meaningful, then we could represent it as ErrorText="</c>".
  2. For a richer representation, we could represent it as a loose XmlElement who is missing its opening excerpts. And maybe <b> would become a loose XmlElement who is missing its closing excerpts. (<a> does NOT need to be a child of <b> in this representation.)
  3. An "intelligent" possibility would be to interpret </c> as the closing tag for <b>. So <b> is a normal nesting element, just accompanied by an error message.

I don't have a strong opinion, but I suspect that 3 is maybe overengineering, and the decision of 1-vs-2 depends on how good we want the syntax highlighting to be, and how much time we have to spend on this PR. 🙂

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try out option 2 and see if it's feasible, it seems like the most ideal solution in my opinion.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@octogonz I'm re-reading option 3 and just to be clear what you're proposing for that option is ErrorText=<b> instead of setting the error text to </c>? If so, I think that may be easier than option 2.

Copy link
Collaborator

@octogonz octogonz Aug 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, the parser has two ways of modeling syntax errors:

  1. DocErrorText which will get syntax highlighted as red characters, plus an accompanying error message in the logs
  2. a normal AST node such as DocXmlElement, whose properties are perhaps only partially initialized, plus an accompanying error message in the logs

The advantage of 2 is that syntax highlighting can be applied to the recognized characters, rather than having the entire span be red text. And an API consumer can still access the recognized parts of the syntax such as the tag name. The disadvantage of 2 is that if the AST node is too degenerate, then it is awkward for consumers of the API to handle. Or the AST node may completely incorrect (for example if we interpreted some stray characters <y as a DocXmlElement).

It's a judgement call. There isn't a clear best answer.

just to be clear what you're proposing for that option is ErrorText=<b> instead of setting the error text to </c>? If so, I think that may be easier than option 2.

No, what I meant for option 3 was that excerpts <b> and </c> would belong to the same DocXmlElement. I think it's not easier, because the some judgement calls are needed for less obvious such as <a><b><c></c></a></b> or <a><b><c></c></a>.

(If you want to save time, for an input like <b><a></a></c>, the least effort might be to generate <b> and </c> as DocErrorText -- it's not elegant, but we could always improve it later.)

I don't have a strong preference -- it's up to you.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi again @octogonz,

I actually went with option 3. The main reason I did this was because of the behaviors I observed when trying different XML parsers. Most of them properly figure out what's supposed to be the end tag for a given start tag. As for cases where the error is non-obvious, I followed the errors patterns I saw in the other parsers.

For example,

<a><b><c></c></a></b>

Would yield the following AST:

	- XmlElement
        * XmlStartTag_OpeningDelimiter="<"
        * XmlElement_Name="a"
        * XmlStartTag_ClosingDelimiter=">"
        - XmlElement
          * XmlStartTag_OpeningDelimiter="<"
          * XmlElement_Name="b"
          * XmlStartTag_ClosingDelimiter=">"
          - XmlElement
            * XmlStartTag_OpeningDelimiter="<"
            * XmlElement_Name="c"
            * XmlStartTag_ClosingDelimiter=">"
            * XmlEndTag_OpeningDelimiter="</"
            * XmlEndTag_Name="c"
            * XmlEndTag_ClosingDelimiter=">"
          * XmlEndTag_OpeningDelimiter="</"
          * XmlEndTag_Name="a"
          * XmlEndTag_ClosingDelimiter=">"
        * XmlEndTag_OpeningDelimiter="</"
        * XmlEndTag_Name="b"
        * XmlEndTag_ClosingDelimiter=">"

Along with the following error in the log:
Expected closing tag "</b>" but received "</a>" instead.

A loose start tag will always try to find the next available unmatched end tag. If the end tag doesn't match a log error will be emitted and the AST for that particular element would be the start tag and then a mismatched end tag. This means the parser is always biased towards valid start tags. If a start tag cannot be parsed properly the current element cannot be a DocXmlElement.

Examples

<a><b><c></c></a>
Log:

Expected closing tag "</b>" but received "</a>" instead.
AST
	- XmlElement
        * XmlStartTag_OpeningDelimiter="<"
        * XmlElement_Name="a"
        * XmlStartTag_ClosingDelimiter=">"
        - XmlElement
          * XmlStartTag_OpeningDelimiter="<"
          * XmlElement_Name="b"
          * XmlStartTag_ClosingDelimiter=">"
          - XmlElement
            * XmlStartTag_OpeningDelimiter="<"
            * XmlElement_Name="c"
            * XmlStartTag_ClosingDelimiter=">"
            * XmlEndTag_OpeningDelimiter="</"
            * XmlEndTag_Name="c"
            * XmlEndTag_ClosingDelimiter=">"
          * XmlEndTag_OpeningDelimiter="</"
          * XmlEndTag_Name="a"
          * XmlEndTag_ClosingDelimiter=">"

Here c is a valid XML element, however, b is not because the next closest end tag is </a>. a is also not a valid XML element because the next closest tag closing tag </c> is already matched with <c>, the next closing tag after </c> (</a>) is matched with b. This results in a not having a closing tag.

It can represented visually like so:
Untitled


On the other hand, with this:
<a><b></c></a>

Log:
Expected closing tag "</b>" but received "</c>" instead.
AST
	- XmlElement
        * XmlStartTag_OpeningDelimiter="<"
        * XmlElement_Name="a"
        * XmlStartTag_ClosingDelimiter=">"
        - XmlElement
          * XmlStartTag_OpeningDelimiter="<"
          * XmlElement_Name="b"
          * XmlStartTag_ClosingDelimiter=">"
          * XmlEndTag_OpeningDelimiter="</"
          * XmlEndTag_Name="c"
          * XmlEndTag_ClosingDelimiter=">"
        * XmlEndTag_OpeningDelimiter="</"
        * XmlEndTag_Name="a"
        * XmlEndTag_ClosingDelimiter=">"

a is a valid element while b and c are parsed into a single invalid DocXmlElement with mismatched tag names.


<a><b><c/></a>
Log:

Expected closing tag "</b>" but received "</a>" instead.
AST
	- XmlElement
        * XmlStartTag_OpeningDelimiter="<"
        * XmlElement_Name="a"
        * XmlStartTag_ClosingDelimiter=">"
        - XmlElement
          * XmlStartTag_OpeningDelimiter="<"
          * XmlElement_Name="b"
          * XmlStartTag_ClosingDelimiter=">"
          - XmlElement
            * XmlStartTag_OpeningDelimiter="<"
            * XmlElement_Name="c"
            * XmlStartTag_ClosingDelimiter="/>"
          * XmlEndTag_OpeningDelimiter="</"
          * XmlEndTag_Name="a"
          * XmlEndTag_ClosingDelimiter=">"

In this example, a is no longer considered a valid XML element. <b> matches with the next closest closing tag </a>. <a> looks for another closing tag, it can't find one so it becomes a loose XML starting tag. <c/> is a valid self-closing element.


Cases where the element is not large enough to create a meaningful DomXmlElement, the text is simply considered DocErrorText.

With this I've been able to get a good amount of the valid syntax highlighting for loose nodes to work in various situations. I look forward to hearing your thoughts and feedback on this.

@suneettipirneni
Copy link
Author

When I was fixing the duplicate attribute problem in the parser an idea popped into my head. I think we could improve the way we store attributes by using maps instead of arrays. From a developer perspective, if I want to access an attribute I'd usually already know the key for the value I want. Storing attributes in a map would allow developers to skip the boilerplate of using Array.find()

@octogonz
Copy link
Collaborator

When I was fixing the duplicate attribute problem in the parser an idea popped into my head. I think we could improve the way we store attributes by using maps instead of arrays. From a developer perspective, if I want to access an attribute I'd usually already know the key for the value I want. Storing attributes in a map would allow developers to skip the boilerplate of using Array.find()

It's a helpful API, but I would suggest to expose it as a secondary API, something like DocXmlElement.tryGetXmlAttribute(name): XmlAttribute | undefined.

We could convert the main model to be xmlAttributes: Map<string, DocXmlAttribute>, but this has some downsides:

  • If we expose the Map directly, there's no way to validate items before they are added.
  • Even if it is a ReadonlyMap, often you also want to validate keys before they are looked up. (one example from this repo)
  • The order in which attributes appear can be syntactically useful to know, even if it has no semantic meaning. Map does not expose that very well.

In other words Map is the best way to implement lookups, but it is often not the best class to expose as public API contract.

@octogonz
Copy link
Collaborator

(let us know when this PR is ready for another look)

}

private static _isValidXmlName(name: string): boolean {
return /^[a-zA-Z_][a-zA-Z0-9_.-]*$/.test(name);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. should not be allowed in attribute names. At least this wasn't permitted in the previous doc, and the parser doesn't accept it because it relies on StringChecks.explainIfInvalidXmlName().

(The specs for XML/HTML/CommonMark do allow many other unusual characters, but for TSDoc we've taken a minimalist approach of only allowing syntaxes with a compelling justification. This is partly to make it more intuitive for humans to predict how TSDoc will get parsed, and partly to avoid difficult ambiguity/escaping edge cases caused by combining grammars of JSDoc, XML, and markdown.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will replace this RegExp with StringChecks.explainIfInvalidXmlName()

@octogonz
Copy link
Collaborator

octogonz commented Sep 14, 2022

@suneettipirneni I greatly appreciate your patience with this long review. 😇🙏

I think this feature is ready to merge, except for a couple final questions:

1. Backwards compatibility

  1. In testing, I realized that the upgraded tsdoc-config library cannot read old tsdoc.json files. This means that if several different tools are in use, mixing tsdoc-config versions would make it impossible to parse the file. The tools would all need to be upgraded together. I think we can make the upgrade experience a little easier by relaxing the schema/loader to accept the old setting names. (Then we can remove this backwards compatibility later, for example when we publish the 1.0.0 version of the package.)

I will do this work and push it to your branch.

2. Nested content

The handling of nested HTML content has changed with this PR.

Old behavior:
image

New behavior:
image

The behavior makes sense, because the meaning of a custom tag is user-defined. The content may be suitable to render verbatim (<b>, <blockquote>) but maybe not (<style>, <meta>). A reasonable conservative policy is to not render the content unless the tag is recognized.

Nonetheless, the example in the above screenshots is probably counterintuitive for end users. I want to think a bit about how to handle this without adding nontrivial additional work to your PR. If you have any ideas, let me know.

@zelliott @adventure-yunfei @AlexJerabek @dzearing @chengcyber

@octogonz
Copy link
Collaborator

octogonz commented Sep 14, 2022

The handling of nested HTML content has changed with this PR.

🤔 Maybe what we need is a tsdoc.json setting that can specify:

"If your tool doesn't implement this XML element name, it's safe to process the element's content as if the element's tags were removed."

Here's a more complicated example:

/**
 * <notebox>
 *   This is some <b>sample text</b>.
 *   <table-of-contents>
 *     <nav-node-name>My API Item</nav-node-name>
 *     <category>ui-controls</category>
 *   </table-of-contents>
 * </notebox>
 */

...which ideally might be rendered as:


This is some sample text.


In this example, <notebox> and <b> can be safely ignored, whereas <table-of-contents>, <nav-node-name>, and <category> are data properties whose contents should NOT be rendered.

In tsdoc.json we would somehow define these element names and classify them as safe (e.g. contentContainer=true). And then our default settings could apply this classification to common HTML elements such as <a>, <b>, <i>, etc.

@zelliott
Copy link

The behavior makes sense, because the meaning of a custom tag is user-defined. The content may be suitable to render verbatim (<b>, <blockquote>) but maybe not (<style>, <meta>). A reasonable conservative policy is to not render the content unless the tag is recognized.

Am I correct in understanding that this is the behavior of the TSDoc website (i.e. whatever's actually rendering TSDoc), not TSDoc? I'm under the impression that TSDoc doesn't contain any rendering logic itself. So if the above behavior is unintuitive, wouldn't that mean fixing how the TSDoc website handles certain XML elements?

With the tsdoc.json setting proposal, I think you're suggesting that TSDoc can help inform a consumer which XML tags can have their contents rendered. There can be reasonable default settings (i.e. common HTML elements), but it can also be configurable. I suppose this is useful if the author of the TSDoc comments and the consumer rendering them are different people... is this a common use case?

@suneettipirneni
Copy link
Author

suneettipirneni commented Sep 14, 2022

In tsdoc.json we would somehow define these element names and classify them as safe (e.g. contentContainer=true). And then our default settings could apply this classification to common HTML elements such as <a>, <b>, <i>, etc.

To some degree this option feels a bit out of place to me. My main dilemma is whether tsdoc should be making these distinctions or the renderers themselves. Renderers already make many implementation-specific decisions on the rendered representation of a given doc node. Shouldn't they also be responsible for making the distinction of what's a contentContainer node and what isn't?

I also fear this field may be confusing for consumers of TSDoc especially in cases where additional data like attributes play a key role. If an <a> element is marked as a contentContainer and my custom markdown emitter automatically outputs the plaintext of any content-container element I could inadvertently skip over the href attribute.

for (const node of nodes) {
	switch (node.kind) {
		case DocNodeKind.XmlElement:
			const xmlNode = node as DocXmlElement;
			if (xmlNode.contentContainer) {
				const text = // Logic to resolve the inner plain text(s) and convert them to strings ... ;
				output.append(text);
				break;
			}

			// Otherwise render recognized XML elements
	}
}

While this code is ultimately a logic error by the programmer (you could argue that custom rendered elements should be checked first) it may be caused by the somewhat valid assumption that a is a contentContainer since it only contains text as child elements. This could also become an issue if there's lack of shared knowledge between the programmers writing the docs and the programmers creating emitters.

As for the playground, I think having it render common xml tags would be ideal. Or it could do what Monaco does and render any XML inner text as plaintext, with the exception of elements like <style>.

@octogonz
Copy link
Collaborator

Am I correct in understanding that this is the behavior of the TSDoc website (i.e. whatever's actually rendering TSDoc), not TSDoc?

Yes. Here's some examples of various possible "renderers" of TSDoc:

  • The TSDoc playground shown in those screenshots
  • An API reference website generated using API Documenter
  • A custom web app that displays API documentation by parsing TSDoc from .api.json files produced by API Extractor
  • It doesn't need to be a website: for example, a VS Code extension that shows rich tooltips by parsing TSDoc
  • It doesn't even need to render anything: for example, a lint rule that counts how many English words are in your TSDoc summary

Importantly, multiple different tools may need to process the same code comment, which is why interoperability between tools is a key requirement for TSDoc.

I'm under the impression that TSDoc doesn't contain any rendering logic itself.

Yes. But TSDoc does provide a config file that allows users to define custom tags (both JSDoc tags and XML tags). This configuration enables the parser to report syntax errors about undefined tags, and (for JSDoc tags at least) it distinguishes "undefined" tags versus "unsupported" tags:

  • undefined = not a recognized tag in the TSDoc dialect that was configured by the person who wrote the comment
  • unsupported = defined but not implemented by a particular tool that may be inspecting the file.

TSDoc's own standard tags are all "defined" by default, whereas most tools will only "support" a subset of them.

My main dilemma is whether tsdoc should be making these distinctions or the renderers themselves. Renderers already make many implementation-specific decisions on the representation of a given doc node. Shouldn't they also be responsible for making the distinction of what's a contentContainer node and what isn't?

Well, the original question was:

"If a tool doesn't implement an XML element, what should it do?"

The simple, safe answer is to ignore the element and all of its content.

But suppose I invent a custom tag <glossary-word> and use it to mark up sentences like This is a <glossary-word>book<glossary-word>. If a tool doesn't support this tag, deleting the content would be a bad experience. Do we really need to release new versions of every tool to ensure that <glossary-word>'s content can be rendered safely?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFC: Replace DocHtmlStartTag and DocHtmlEndTag with DocXmlElement
3 participants