Mojo::DOM treats "< foo" as start tag, phantom elements ensue #2031

mauke · 2023-01-29T21:10:58Z

Mojolicious version: 9.31
Perl version: v5.36.0
Operating system: Ubuntu 22.04.1 LTS

Steps to reproduce the behavior

#!/usr/bin/env perl
use v5.12.0;
use warnings;
use Test::More;
use Mojo::DOM;

my $dom = Mojo::DOM->new('if a < script then="<!--"> </script> <p>FAIL</p>-->');

is_deeply
    $dom->find('script, p')->map(sub { $_->to_string })->to_array,
    [],
    'fragment contains no tags, just a comment';

done_testing;

Expected behavior

Test passes.

After seeing <, we are in the tag open state. The semantically relevant characters that can follow are !, /, ASCII alpha, ?, and EOF. Anything else (including spaces) triggers an invalid-first-character-of-tag-name error. If the parser doesn't abort, it should treat the < character literally, as if < had been seen.

Actual behavior

not ok 1 - fragment contains no tags, just a comment
#   Failed test 'fragment contains no tags, just a comment'
#   at mojo-dom-bug-9.pl line 10.
#     Structures begin differing at:
#          $got->[0] = '<script then="&lt;!--"> </script>'
#     $expected->[0] = Does not exist
1..1
# Looks like you failed 1 test of 1.

The text was updated successfully, but these errors were encountered:

mauke · 2023-02-01T04:05:58Z

The following change would implement the correct (at least for HTML) behavior:

diff --git lib/Mojo/DOM/HTML.pm lib/Mojo/DOM/HTML.pm
index e10b1532d..81f54c014 100644
--- lib/Mojo/DOM/HTML.pm
+++ lib/Mojo/DOM/HTML.pm
@@ -36,8 +36,10 @@ my $TOKEN_RE = qr/
     |
       \?(.*?)\?                                                                # Processing Instruction
     |
-      \s*((?:\/\s*)?[^<>\s\/0-9.\-][^<>\s\/]*\s*(?:(?:$ATTR_RE){0,32766})*+)   # Tag
+      (\/?[^<>\s\/0-9.\-][^<>\s\/]*\s*(?:(?:$ATTR_RE){0,32766})*+)             # Tag
     )>
+  |
+    <\/ (?![a-z]) ([^>]*) >                                                    # Invalid-first-character-of-tag-name error (bogus comment)
   |
     (<)                                                                        # Runaway "<"
   )??
@@ -101,12 +103,15 @@ sub parse {
   my $xml     = $self->xml;
   my $current = my $tree = ['root'];
   while ($html =~ /\G$TOKEN_RE/gcso) {
-    my ($text, $doctype, $comment, $cdata, $pi, $tag, $runaway) = ($1, $2, $3, $4, $5, $6, $11);
+    my ($text, $doctype, $comment, $cdata, $pi, $tag, $bogus_comment, $runaway) = ($1, $2, $3, $4, $5, $6, $11, $12);
 
     # Text (and runaway "<")
     $text .= '<'                                 if defined $runaway;
     _node($current, 'text', html_unescape $text) if defined $text;
 
+    # Malformed end tag
+    $comment = $bogus_comment if length $bogus_comment;
+
     # Tag
     if (defined $tag) {

That is:

< foo> treated as < foo> (invalid-first-character-of-tag-name)
</ foo> treated as  (invalid-first-character-of-tag-name)
</> ignored (missing-end-tag-name)

The problem is all the tests in t/mojo/dom.t that rely on the previous behavior. There's so many, I'm not sure what to do about them.

kraih · 2023-02-01T10:05:09Z

If tests are wrong then they should be fixed.

mauke · 2023-02-02T10:25:22Z

The problem is stuff like this:

subtest 'XML name characters' => sub {
  my $dom = Mojo::DOM->new->xml(1)->parse('<Foo><1a>foo</1a></Foo>');
  is $dom->at('Foo')->text, '<1a>foo</1a>',                        'right text';
  is "$dom",                '<Foo>&lt;1a&gt;foo&lt;/1a&gt;</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><.a>foo</.a></Foo>');
  is $dom->at('Foo')->text, '<.a>foo</.a>',                        'right text';
  is "$dom",                '<Foo>&lt;.a&gt;foo&lt;/.a&gt;</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><.>foo</.></Foo>');
  is $dom->at('Foo')->text, '<.>foo</.>',                        'right text';
  is "$dom",                '<Foo>&lt;.&gt;foo&lt;/.&gt;</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><-a>foo</-a></Foo>');
  is $dom->at('Foo')->text, '<-a>foo</-a>',                        'right text';
  is "$dom",                '<Foo>&lt;-a&gt;foo&lt;/-a&gt;</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><a1>foo</a1></Foo>');
  is $dom->at('Foo a1')->text, 'foo',                     'right text';
  is "$dom",                   '<Foo><a1>foo</a1></Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<Foo><a .b -c 1>foo</a></Foo>');
  is $dom->at('Foo')->text, '<a .b -c 1>foo',                  'right text';
  is "$dom",                '<Foo>&lt;a .b -c 1&gt;foo</Foo>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<😄 😄="😄">foo</😄>');
  is $dom->at('😄')->text, 'foo',              'right text';
  is "$dom",              '<😄 😄="😄">foo</😄>', 'right result';

  $dom = Mojo::DOM->new->xml(1)->parse('<こんにちは こんにちは="こんにちは">foo</こんにちは>');
  is $dom->at('こんにちは')->text, 'foo',                              'right text';
  is "$dom",                  '<こんにちは こんにちは="こんにちは">foo</こんにちは>', 'right result';
};

It specifically tests for "incorrect" (for HTML) behavior of the parser. I don't know enough about XML to say whether this is correct for XML, but if so, you might need different tokenizers for HTML and XML. :-/

PS: In HTML mode, the correct parse for

<😄 😄="😄">foo</😄>

would be

lt;😄 😄="😄"&gt;foo<!--😄-->

kraih · 2023-02-02T10:42:34Z

I don't see the relation between < foo and <😄 😄="😄">. And i'm pretty sure the latter is valid XML, since @mojojs/dom is more strict about what code points to allow for names (based on the XML spec). A separate tokenizer is out of the question, but i don't see the harm in being a bit more relaxed with names anyway. Using the same character ranges as @mojojs/dom would of course be an option.

mauke · 2023-02-02T10:53:04Z

Recall:

After seeing <, we are in the tag open state. The semantically relevant characters that can follow are !, /, ASCII alpha, ?, and EOF. Anything else (including spaces) triggers an invalid-first-character-of-tag-name error.

Like space, 😄 is not !, /, ?, or ASCII alpha, so it is a parse error.

Consider this example:

<😄 title="<script>console.log('hi');</script>"></😄>

According to the HTML5 spec, this contains a script element because it parses like

&lt;😄 title=&quot;<script>console.log('hi');</script>&quot;&gt;<!--😄-->

But Mojo::DOM doesn't see the <script>.

This is essentially unfixable: Browsers will always see a different document structure than Mojo::DOM as long as tag names can start with non-ascii-alpha characters.

kraih · 2023-02-02T11:29:44Z

That is certainly an interesting case. 🤔

kraih · 2023-02-02T11:37:44Z

I think the HTML/XML overlap is where we draw the line with correctness, and this will remain a case we handle like it was XML. But we can still make other cases that do not conflict with XML more strict.

kraih added bug help wanted labels Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mojo::DOM treats "< foo" as start tag, phantom elements ensue #2031

Mojo::DOM treats "< foo" as start tag, phantom elements ensue #2031

mauke commented Jan 29, 2023

mauke commented Feb 1, 2023

kraih commented Feb 1, 2023

mauke commented Feb 2, 2023 •

edited

Loading

kraih commented Feb 2, 2023

mauke commented Feb 2, 2023

kraih commented Feb 2, 2023

kraih commented Feb 2, 2023

Mojo::DOM treats "< foo" as start tag, phantom elements ensue #2031

Mojo::DOM treats "< foo" as start tag, phantom elements ensue #2031

Comments

mauke commented Jan 29, 2023

Steps to reproduce the behavior

Expected behavior

Actual behavior

mauke commented Feb 1, 2023

kraih commented Feb 1, 2023

mauke commented Feb 2, 2023 • edited Loading

kraih commented Feb 2, 2023

mauke commented Feb 2, 2023

kraih commented Feb 2, 2023

kraih commented Feb 2, 2023

mauke commented Feb 2, 2023 •

edited

Loading