-
Notifications
You must be signed in to change notification settings - Fork 580
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mojo::DOM treats "< foo" as start tag, phantom elements ensue #2031
Comments
The following change would implement the correct (at least for HTML) behavior: diff --git lib/Mojo/DOM/HTML.pm lib/Mojo/DOM/HTML.pm
index e10b1532d..81f54c014 100644
--- lib/Mojo/DOM/HTML.pm
+++ lib/Mojo/DOM/HTML.pm
@@ -36,8 +36,10 @@ my $TOKEN_RE = qr/
|
\?(.*?)\? # Processing Instruction
|
- \s*((?:\/\s*)?[^<>\s\/0-9.\-][^<>\s\/]*\s*(?:(?:$ATTR_RE){0,32766})*+) # Tag
+ (\/?[^<>\s\/0-9.\-][^<>\s\/]*\s*(?:(?:$ATTR_RE){0,32766})*+) # Tag
)>
+ |
+ <\/ (?![a-z]) ([^>]*) > # Invalid-first-character-of-tag-name error (bogus comment)
|
(<) # Runaway "<"
)??
@@ -101,12 +103,15 @@ sub parse {
my $xml = $self->xml;
my $current = my $tree = ['root'];
while ($html =~ /\G$TOKEN_RE/gcso) {
- my ($text, $doctype, $comment, $cdata, $pi, $tag, $runaway) = ($1, $2, $3, $4, $5, $6, $11);
+ my ($text, $doctype, $comment, $cdata, $pi, $tag, $bogus_comment, $runaway) = ($1, $2, $3, $4, $5, $6, $11, $12);
# Text (and runaway "<")
$text .= '<' if defined $runaway;
_node($current, 'text', html_unescape $text) if defined $text;
+ # Malformed end tag
+ $comment = $bogus_comment if length $bogus_comment;
+
# Tag
if (defined $tag) {
That is:
The problem is all the tests in |
If tests are wrong then they should be fixed. |
The problem is stuff like this: subtest 'XML name characters' => sub {
my $dom = Mojo::DOM->new->xml(1)->parse('<Foo><1a>foo</1a></Foo>');
is $dom->at('Foo')->text, '<1a>foo</1a>', 'right text';
is "$dom", '<Foo><1a>foo</1a></Foo>', 'right result';
$dom = Mojo::DOM->new->xml(1)->parse('<Foo><.a>foo</.a></Foo>');
is $dom->at('Foo')->text, '<.a>foo</.a>', 'right text';
is "$dom", '<Foo><.a>foo</.a></Foo>', 'right result';
$dom = Mojo::DOM->new->xml(1)->parse('<Foo><.>foo</.></Foo>');
is $dom->at('Foo')->text, '<.>foo</.>', 'right text';
is "$dom", '<Foo><.>foo</.></Foo>', 'right result';
$dom = Mojo::DOM->new->xml(1)->parse('<Foo><-a>foo</-a></Foo>');
is $dom->at('Foo')->text, '<-a>foo</-a>', 'right text';
is "$dom", '<Foo><-a>foo</-a></Foo>', 'right result';
$dom = Mojo::DOM->new->xml(1)->parse('<Foo><a1>foo</a1></Foo>');
is $dom->at('Foo a1')->text, 'foo', 'right text';
is "$dom", '<Foo><a1>foo</a1></Foo>', 'right result';
$dom = Mojo::DOM->new->xml(1)->parse('<Foo><a .b -c 1>foo</a></Foo>');
is $dom->at('Foo')->text, '<a .b -c 1>foo', 'right text';
is "$dom", '<Foo><a .b -c 1>foo</Foo>', 'right result';
$dom = Mojo::DOM->new->xml(1)->parse('<😄 😄="😄">foo</😄>');
is $dom->at('😄')->text, 'foo', 'right text';
is "$dom", '<😄 😄="😄">foo</😄>', 'right result';
$dom = Mojo::DOM->new->xml(1)->parse('<こんにちは こんにちは="こんにちは">foo</こんにちは>');
is $dom->at('こんにちは')->text, 'foo', 'right text';
is "$dom", '<こんにちは こんにちは="こんにちは">foo</こんにちは>', 'right result';
}; It specifically tests for "incorrect" (for HTML) behavior of the parser. I don't know enough about XML to say whether this is correct for XML, but if so, you might need different tokenizers for HTML and XML. :-/ PS: In HTML mode, the correct parse for <😄 😄="😄">foo</😄> would be lt;😄 😄="😄">foo<!--😄--> |
I don't see the relation between |
Recall:
Like space, 😄 is not Consider this example: <😄 title="<script>console.log('hi');</script>"></😄> According to the HTML5 spec, this contains a script element because it parses like <😄 title="<script>console.log('hi');</script>"><!--😄--> But Mojo::DOM doesn't see the This is essentially unfixable: Browsers will always see a different document structure than Mojo::DOM as long as tag names can start with non-ascii-alpha characters. |
That is certainly an interesting case. 🤔 |
I think the HTML/XML overlap is where we draw the line with correctness, and this will remain a case we handle like it was XML. But we can still make other cases that do not conflict with XML more strict. |
Steps to reproduce the behavior
Expected behavior
Test passes.
After seeing
<
, we are in the tag open state. The semantically relevant characters that can follow are!
,/
, ASCII alpha,?
, and EOF. Anything else (including spaces) triggers an invalid-first-character-of-tag-name error. If the parser doesn't abort, it should treat the<
character literally, as if<
had been seen.Actual behavior
The text was updated successfully, but these errors were encountered: