Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: update tests to reflect new libxml2 HTML5 parsing behaviors #3308

Merged
merged 4 commits into from
Oct 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ We've resolved many long-standing bugs in the various schema classes, validation
* Introduce support for a new SAX callback `XML::SAX::Document#reference`, which is called to report some parsed XML entities when `XML::SAX::ParserContext#replace_entities` is set to the default value `false`. This is necessary functionality for some applications that were previously relying on incorrect entity error reporting which has been fixed (see below). For more information, read the docs for `Nokogiri::XML::SAX::Document`. [#1926] @flavorjones
* `XML::SAX::Parser#parse_memory` and `#parse_file` now accept an optional `encoding` argument. When not provided, the parser will fall back to the encoding passed to the initializer, and then fall back to autodetection. [#3288] @flavorjones
* `XML::SAX::ParserContext.memory` now accepts an optional `encoding` argument. When not provided, the encoding will be autodetected. [#3288] @flavorjones
* `XML::DocumentFragment#parse_options` and `HTML4::DocumentFragment#parse_options` return the options used to parse the document fragment. @flavorjones
* [CRuby] `Nokogiri::HTML5::Builder` is similar to `HTML4::Builder` but returns an `HTML5::Document`. [#3119] @flavorjones
* [CRuby] Attributes in an HTML5 document can be serialized individually, something that has always been supported by the HTML4 serializer. [#3125, #3127] @flavorjones
* [CRuby] Introduce a compile-time option, `--disable-xml2-legacy`, to remove from libxml2 its dependencies on `zlib` and `liblzma` and disable implicit `HTTP` network requests. These all remain enabled by default, and are present in the precompiled native gems. This option is a precursor for removing these libraries in a future major release, but may be interesting for the security-minded who do not need features like automatic decompression and would like to remove these dependencies. You can read more and give feedback on these plans in #3168. [#3247] @flavorjones
Expand Down
1 change: 1 addition & 0 deletions lib/nokogiri/html4/document_fragment.rb
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,7 @@ def initialize(document, tags = nil, ctx = nil, options = XML::ParseOptions::DEF
return self unless tags

options = Nokogiri::XML::ParseOptions.new(options) if Integer === options
@parse_options = options
yield options if block_given?

if ctx
Expand Down
6 changes: 6 additions & 0 deletions lib/nokogiri/xml/document_fragment.rb
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,11 @@
module Nokogiri
module XML
class DocumentFragment < Nokogiri::XML::Node
# The options used to parse the document fragment. Returns the value of any options that were
# passed into the constructor as a parameter or set in a config block, else the default
# options for the specific subclass.
attr_reader :parse_options

####
# Create a Nokogiri::XML::DocumentFragment from +tags+
def self.parse(tags, options = ParseOptions::DEFAULT_XML, &block)
Expand All @@ -20,6 +25,7 @@ def initialize(document, tags = nil, ctx = nil, options = ParseOptions::DEFAULT_
return self unless tags

options = Nokogiri::XML::ParseOptions.new(options) if Integer === options
@parse_options = options
yield options if block_given?

children = if ctx
Expand Down
9 changes: 2 additions & 7 deletions test/html4/sax/test_document_error.rb
Original file line number Diff line number Diff line change
Expand Up @@ -20,15 +20,10 @@ def start_document
end

def test_warning_document_encounters_error_but_terminates_normally
# Probably I'm doing something wrong, but I can't make nekohtml report errors,
# despite setting http://cyberneko.org/html/features/report-errors.
# See https://nekohtml.sourceforge.net/settings.html for more info.
# I'd love some help here if someone finds this comment and cares enough to dig in.
skip_unless_libxml2("nekohtml sax parser does not seem to report errors?")

warning_parser = Nokogiri::HTML4::SAX::Parser.new(Nokogiri::SAX::TestCase::Doc.new)
warning_parser.parse("<html><body><<div att=")
refute_empty(warning_parser.document.errors, "error collector did not collect an error")

assert(warning_parser.document.end_document_called)
end
end
end
55 changes: 50 additions & 5 deletions test/html4/test_comments.rb
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,19 @@ class TestComment < Nokogiri::TestCase
let(:doc) { Nokogiri::HTML4(html) }
let(:subject) { doc.at_css("div#under-test") }

if Nokogiri.uses_libxml?
if Nokogiri.uses_libxml?(">= 2.14.0")
it "behaves as if the comment is closed immediately before the end of the input stream" do # COMPLIANT
assert_pattern do
subject => {
name: "div",
attributes: [{ name: "id", value: "under-test" }],
children: [
{ name: "comment", content: "start of unterminated comment" }
]
}
end
end
elsif Nokogiri.uses_libxml?
it "behaves as if the comment is unterminated and doesn't exist" do # NON-COMPLIANT
assert_equal 0, subject.children.length
assert_equal 1, doc.errors.length
Expand Down Expand Up @@ -132,8 +144,12 @@ class TestComment < Nokogiri::TestCase
assert_equal inner_div, subject.children[1]
assert_predicate subject.children[2], :comment?
assert_equal "bar", subject.children[2].content
assert_equal 1, doc.errors.length
assert_match(/Comment incorrectly closed/, doc.errors.first.to_s)
if Nokogiri.uses_libxml?(">= 2.14.0")
assert_empty doc.errors
else
assert_equal 1, doc.errors.length
assert_match(/Comment incorrectly closed/, doc.errors.first.to_s)
end
end
else # jruby, or libxml2 system lib less than 2.9.11
it "behaves as if the comment encompasses the inner div" do # NON-COMPLIANT
Expand Down Expand Up @@ -161,7 +177,22 @@ class TestComment < Nokogiri::TestCase
let(:body) { doc.at_css("body") }
let(:subject) { doc.at_css("div#under-test") }

if Nokogiri.uses_libxml?("= 2.9.14")
if Nokogiri.uses_libxml?(">= 2.14.0")
it "parses as comments" do # COMPLIANT
assert_pattern do
body.children => [
{
name: "div",
children: [
{ name: "comment", content: " comment <div id=do-i-exist" },
{ name: "text", content: "inner content" },
]
},
{ name: "text", content: "-->hello" },
]
end
end
elsif Nokogiri.uses_libxml?("= 2.9.14")
it "parses as PCDATA" do # NON-COMPLIANT
assert_equal 1, body.children.length
assert_equal subject, body.children.first
Expand Down Expand Up @@ -212,7 +243,21 @@ class TestComment < Nokogiri::TestCase
let(:body) { doc.at_css("body") }
let(:subject) { doc.at_css("div#under-test") }

if Nokogiri.uses_libxml?("= 2.9.14")
if Nokogiri.uses_libxml?(">= 2.14.0")
it "parses the <! tags as comments" do
assert_pattern do
body.children => [
{
name: "div", children: [
{ name: "comment", content: "[if foo]" },
{ name: "div", attributes: [{name: "id", value: "do-i-exist"}] },
{ name: "comment", content: "[endif]" },
]
}
]
end
end
elsif Nokogiri.uses_libxml?("= 2.9.14")
it "parses the <! tags as PCDATA" do
assert_equal(1, body.children.length)
assert_equal(subject, body.children.first)
Expand Down
21 changes: 11 additions & 10 deletions test/html4/test_document.rb
Original file line number Diff line number Diff line change
Expand Up @@ -363,10 +363,8 @@ def test_document_has_error
html = Nokogiri::HTML4(<<~HTML)
<html>
<body>
<div awesome="asdf>
<p>inside div tag</p>
</div>
<p>outside div tag</p>
<div>
</foo>
</body>
</html>
HTML
Expand Down Expand Up @@ -660,14 +658,15 @@ def test_capturing_nonparse_errors_during_document_clone

def test_capturing_nonparse_errors_during_node_copy_between_docs
# Errors should be emitted while parsing only, and should not change when moving nodes.
doc1 = Nokogiri::HTML4("<html><body><diva id='unique'>one</diva></body></html>")
doc2 = Nokogiri::HTML4("<html><body><dive id='unique'>two</dive></body></html>")
doc1 = Nokogiri::HTML4("<html><body><div id='unique'>one</foo1></body></html>")
doc2 = Nokogiri::HTML4("<html><body><div id='unique'>two</foo2></body></html>")
node1 = doc1.at_css("#unique")
node2 = doc2.at_css("#unique")
original_errors1 = doc1.errors.dup
original_errors2 = doc2.errors.dup
assert(original_errors1.any? { |e| e.to_s.include?("Tag diva invalid") }, "it should complain about the tag name")
assert(original_errors2.any? { |e| e.to_s.include?("Tag dive invalid") }, "it should complain about the tag name")

refute_empty(original_errors1)
refute_empty(original_errors2)

node1.add_child(node2)

Expand Down Expand Up @@ -734,6 +733,8 @@ def test_silencing_nonparse_errors_during_attribute_insertion_1262
doc = Nokogiri::HTML4::Document.parse(html)
expected = if Nokogiri.jruby?
[Nokogiri::XML::Node::COMMENT_NODE, Nokogiri::XML::Node::PI_NODE]
elsif Nokogiri.uses_libxml?(">= 2.14.0")
[Nokogiri::XML::Node::COMMENT_NODE, Nokogiri::XML::Node::COMMENT_NODE]
elsif Nokogiri.uses_libxml?(">= 2.10.0")
[Nokogiri::XML::Node::COMMENT_NODE]
else
Expand Down Expand Up @@ -802,7 +803,7 @@ def test_silencing_nonparse_errors_during_attribute_insertion_1262
end

describe "read memory" do
let(:input) { "<html><body><div" }
let(:input) { "<html><body><div></foo>" }

describe "strict parsing" do
let(:parse_options) { html_strict }
Expand All @@ -824,7 +825,7 @@ def test_silencing_nonparse_errors_during_attribute_insertion_1262
end

describe "read io" do
let(:input) { StringIO.new("<html><body><div") }
let(:input) { StringIO.new("<html><body><div></foo>") }

describe "strict parsing" do
let(:parse_options) { html_strict }
Expand Down
2 changes: 1 addition & 1 deletion test/html4/test_document_encoding.rb
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@ def binopen(file)
end

describe "error handling" do
RAW = "<html><body><div"
RAW = "<html><body><div></foo>"

{ "read_memory" => RAW, "read_io" => StringIO.new(RAW) }.each do |flavor, input|
it "#{flavor} should handle errors" do
Expand Down
Loading