Skip to content

Commit

Permalink
fix, feat, docs: improve sax parser entity handling (#3265)
Browse files Browse the repository at this point in the history
**What problem is this PR intended to solve?**

#1926 described an issue wherein the SAX parser was not correctly
resolving and replacing internal entities, and was instead reporting an
error for each entity reference. This PR includes a fix for that
problem.

I've removed the unnecessary "SAX tuple" from the SAX implementation,
replacing it with the `_private` struct member that libxml2 makes
available. Then I set up the parser context structs so that we can use
libxml2's standard SAX callbacks where they're useful (which is how I
addressed the above issue).

This PR also introduces a new feature, a SAX handler callback
`Document#reference` which allows callers to get entity-specific name
and replacement text information (rather than relying on the
`Document#characters` callback). This can be used to solve the original
issue in #1926 with this code: searls/eiwa#11

The behavior of the SAX parser with respect to entities is complex
enough that I wrote up a short doc in the `XML::SAX::Document` docstring
with a table and explanation. I've also added warnings to remind users
that `#replace_entities` is not safe to set when parsing untrusted
documents.

In the Java implementation, I've fixed the `#replace_entities` option in
the SAX parser context and set it to the proper default (`false`),
fixing #614. I've also corrected the value of the URI argument to
`Document#start_element_namespace` which was a blank string when it
should have been `nil`.

I've added quite a bit of testing around the SAX parser's handling of
entities.

I added and clarified quite a bit of documentation around SAX parsing
generally. Exception messages have been clarified in a couple of places,
and made consistent between the C and Java implementations. This should
address questions asked in issues #1500 and #1284.

Finally, I cleaned up some of the C code that implements SAX parsing,
naming functions more explicitly (and moving towards some kind of
standard naming convention).

Closes #1926.
Closes #614.


**Have you included adequate test coverage?**

Yes!


**Does this change affect the behavior of either the C or the Java
implementations?**

Yes, but the implementations are much more consistent with each other
now.
  • Loading branch information
flavorjones authored Jul 2, 2024
2 parents 7d5ef68 + ca5480e commit 1486c81
Show file tree
Hide file tree
Showing 24 changed files with 1,253 additions and 729 deletions.
14 changes: 6 additions & 8 deletions .rubocop_todo.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# This configuration was generated by
# `rubocop --auto-gen-config --exclude-limit 50`
# on 2024-02-21 16:15:49 UTC using RuboCop version 1.60.2.
# on 2024-07-02 02:15:28 UTC using RuboCop version 1.64.1.
# The point is for the user to remove these configuration records
# one by one as the offenses are removed from the code base.
# Note that changes in the inspected code, or installation of new
Expand Down Expand Up @@ -55,10 +55,9 @@ Lint/RedundantRequireStatement:
- 'lib/nokogiri/css/parser_extras.rb'
- 'test/helper.rb'

# Offense count: 6
# Offense count: 5
Lint/ReturnInVoidContext:
Exclude:
- 'lib/nokogiri/css/parser_extras.rb'
- 'lib/nokogiri/html4/document.rb'
- 'lib/nokogiri/html4/document_fragment.rb'
- 'lib/nokogiri/html5/document_fragment.rb'
Expand All @@ -70,7 +69,7 @@ Lint/SelfAssignment:
Exclude:
- 'test/xml/test_xpath.rb'

# Offense count: 17
# Offense count: 19
# This cop supports safe autocorrection (--autocorrect).
# Configuration parameters: EnforcedStyle.
# SupportedStyles: strict, consistent
Expand All @@ -80,7 +79,7 @@ Lint/SymbolConversion:
- 'test/html5/test_encoding.rb'
- 'test/html5/test_serialize.rb'

# Offense count: 86
# Offense count: 82
# This cop supports safe autocorrection (--autocorrect).
# Configuration parameters: EnforcedStyle.
# SupportedStyles: lowercase, uppercase
Expand All @@ -103,7 +102,6 @@ Naming/HeredocDelimiterCase:
- 'test/xml/test_element_decl.rb'
- 'test/xml/test_entity_decl.rb'
- 'test/xml/test_namespace.rb'
- 'test/xml/test_node_attributes.rb'
- 'test/xml/test_node_encoding.rb'
- 'test/xml/test_node_reparenting.rb'
- 'test/xml/test_node_set.rb'
Expand Down Expand Up @@ -151,7 +149,7 @@ Style/EmptyBlockParameter:

# Offense count: 14
# This cop supports safe autocorrection (--autocorrect).
# Configuration parameters: EnforcedStyle.
# Configuration parameters: AutoCorrect, EnforcedStyle.
# SupportedStyles: compact, expanded
Style/EmptyMethod:
Exclude:
Expand All @@ -160,7 +158,7 @@ Style/EmptyMethod:
- 'test/xml/test_node_set.rb'
- 'test/xml/test_xpath.rb'

# Offense count: 52
# Offense count: 53
# This cop supports safe autocorrection (--autocorrect).
Style/Encoding:
Enabled: false
Expand Down
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,19 +17,23 @@ Nokogiri follows [Semantic Versioning](https://semver.org/), please see the [REA

### Added

* Introduce support for a new SAX callback `XML::SAX::Document#reference`, which is called to report some parsed XML entities when `SAX::ParserContext#replace_entities` is set to the default value `false`. This is necessary functionality for some applications that were previously relying on incorrect entity error reporting which has been fixed (see below). For more information, read the docs for `Nokogiri::XML::SAX::Document`. [#1926] @flavorjones
* [CRuby] `Nokogiri::HTML5::Builder` is similar to `HTML4::Builder` but returns an `HTML5::Document`. [#3119] @flavorjones
* [CRuby] Attributes in an HTML5 document can be serialized individually, something that has always been supported by the HTML4 serializer. [#3125, #3127] @flavorjones
* [CRuby] Introduce a compile-time option, `--disable-xml2-legacy`, to remove from libxml2 its dependencies on `zlib` and `liblzma` and disable implicit `HTTP` network requests. These all remain enabled by default, and are present in the precompiled native gems. This option is a precursor for removing these libraries in a future major release, but may be interesting for the security-minded who do not need features like automatic decompression and would like to remove these dependencies. You can read more and give feedback on these plans in #3168. [#3247] @flavorjones


### Improved

* Documentation has been improved for `CSS.xpath_for`. [#3224] @flavorjones
* Documentation for the SAX parsing classes has been greatly improved, including the complex entity-handling behavior. [#3265] @flavorjones
* `XML::Schema#read_memory` and `XML::RelaxNG#read_memory` are now Ruby methods that call `#from_document`. Previously these were native functions, but they were buggy on both CRuby and JRuby (but worse on JRuby) and so this is now useful, comparable in performance, and simpler code that is easier to maintain. [#2113, #2115] @flavorjones
* [CRuby] When compiling packaged libraries from source, allow users' `AR` and `LD` environment variables to set the archiver and linker commands, respectively. This augments the existing `CC` environment variable to set the compiler command. [#3165] @ziggythehamster
* [CRuby] The HTML5 parse methods accept a `:parse_noscript_content_as_text` keyword argument which will emulate the parsing behavior of a browser which has scripting enabled. [#3178, #3231] @stevecheckoway
* [CRuby] `HTML5::DocumentFragment.parse` and `.new` accept a `:context` keyword argument that is the parse context node or element name. Previously this could only be passed in as a positional argument to `.new` and not at all to `.parse`. @flavorjones
* [CRuby] The update to libxml v2.13 improves "in context" fragment parsing recovery. We removed our hacky workaround for recovery that led to silently-degraded functionality when parsing fragments with parse errors. Specifically, malformed XML fragments that used implicit namespace prefixes will now "link up" to the namespaces in the parent document or node, where previously they did not. [#2092] @flavorjones
* [CRuby] When multiple errors could be detected by the parser and there's no obvious document to save them in (for example, when parsing a document with the recovery parse option turned off), the libxml2 errors are aggregated into a single `Nokogiri::XML::SyntaxError`. Previously, only the last error recorded by libxml2 was raised, which might be misleading if it's merely a warning and not the fatal error preventing the operation. [#2562] @flavorjones
* [CRuby] The SAX parser context and handler implementation has been simplified and now takes advantage of some of libxml2's default SAX handlers for entities and DTD management. [#3265] @flavorjones


### Fixed
Expand All @@ -40,10 +44,13 @@ Nokogiri follows [Semantic Versioning](https://semver.org/), please see the [REA
* [CRuby] libgumbo (the HTML5 parser) treats reaching max-depth as EOF. This addresses a class of issues when the parser is interrupted in this way. [#3121] @stevecheckoway
* [CRuby] Update node GC lifecycle to avoid a potential memory leak with fragments in libxml 2.13.0 caused by changes in `xmlAddChild`. [#3156] @flavorjones
* [CRuby] libgumbo correctly prints nonstandard element names in error messages. [#3219] @stevecheckoway
* [CRuby] SAX parsing no longer registers errors when encountering external entity references. [#1926] @flavorjones
* [JRuby] Fixed some bugs in how `Node#attributes` handles attributes with namespaces. [#2677, #2679] @flavorjones
* [JRuby] Fix `Schema#validate` to only return the most recent Document's errors. Previously, if multiple documents were validated, this method returned the accumulated errors of all previous documents. [#1282] @flavorjones
* [JRuby] Fix `Schema#validate` to not clobber the `@errors` instance variable. [#1282] @flavorjones
* [JRuby] Empty documents fail schema validation as they should. [#783] @flavorjones
* [JRuby] SAX parsing now respects the `#replace_entities` attribute, which defaults to `false`. Previously this flag defaulted to `true` and was completely ignored. [#614] @flavorjones
* [JRuby] The SAX callback `Document#start_element_namespace` received a blank string for the URI when a namespace was not present. It now receives `nil` (as does the CRuby impl). [#3265] @flavorjones


### Changed
Expand Down
13 changes: 12 additions & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -362,7 +362,18 @@ You can auto-format the C, Java, and Ruby code with `rake format`.

There are some pending Rubocop rules in `.rubocop_todo.yml`. If you'd like to fix them up, I will happily merge your pull request.

No, I don't want to debate any of the style choices.
For C code, naming is currently inconsistent, but I am generally moving towards some guidelines that will make stack traces more readable and usable:

- Public functions and functions bound to Ruby methods should start with `noko_` followed by the snake case class name.
- e.g., `noko_xml_sax_parser_context_...`
- Static functions (file scope) do not need the "noko" prefix, but should be named with the snake case class name.
- e.g., `xml_sax_parser_context_...`
- Ruby singleton methods should have `_s_` before the method name
- e.g., `noko_xml_sax_parser_context_s_io` for `Nokogiri::XML::SAX::ParserContext.io`
- Ruby instance methods should have `__` before the method name
- e.g., `noko_xml_sax_parser_context__line` for `Nokogiri::XML::SAX::ParserContext#line`
- Ruby attribute getters and setters should have `_get` or `_set` as a suffix
- e.g., `noko_xml_sax_parser_context__recovery_set` for `Nokogiri::XML::SAX::ParserContext#recovery=`


## How Continuous Integration ("CI") is configured
Expand Down
7 changes: 3 additions & 4 deletions ext/java/nokogiri/Html4SaxParserContext.java
Original file line number Diff line number Diff line change
Expand Up @@ -56,10 +56,9 @@ public class Html4SaxParserContext extends XmlSaxParserContext
SAXParser parser = new SAXParser();

try {
parser.setProperty(
"http://cyberneko.org/html/properties/names/elems", "lower");
parser.setProperty(
"http://cyberneko.org/html/properties/names/attrs", "lower");
parser.setProperty("http://cyberneko.org/html/properties/names/elems", "lower");
parser.setProperty("http://cyberneko.org/html/properties/names/attrs", "lower");
parser.setFeature("http://cyberneko.org/html/features/report-errors", true);

// NekoHTML should not try to guess the encoding based on the meta
// tags or other information in the document. This is already
Expand Down
21 changes: 12 additions & 9 deletions ext/java/nokogiri/XmlSaxParserContext.java
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,7 @@ public class XmlSaxParserContext extends ParserContext

protected NokogiriHandler handler;
protected NokogiriErrorHandler errorHandler;
private boolean replaceEntities = true;
private boolean replaceEntities = false;
private boolean recovery = false;

public
Expand Down Expand Up @@ -222,9 +222,12 @@ public class XmlSaxParserContext extends ParserContext

/* TODO: how should we pass in parse options? */
ParserContext.Options options = defaultParseOptions(context);
if (replaceEntities) {
options.noEnt = true;
}

errorHandler = new NokogiriStrictErrorHandler(runtime, options.noError, options.noWarning);
handler = new NokogiriHandler(runtime, handlerRuby, errorHandler);
handler = new NokogiriHandler(runtime, handlerRuby, errorHandler, options.noEnt);

preParse(runtime, handlerRuby, handler);
parser.setContentHandler(handler);
Expand All @@ -233,6 +236,7 @@ public class XmlSaxParserContext extends ParserContext

try {
parser.setProperty("http://xml.org/sax/properties/lexical-handler", handler);
parser.setProperty("http://xml.org/sax/properties/declaration-handler", handler);
} catch (Exception ex) {
throw runtime.newRuntimeError("Problem while creating XML SAX Parser: " + ex.toString());
}
Expand All @@ -241,16 +245,15 @@ public class XmlSaxParserContext extends ParserContext
try {
do_parse();
} catch (SAXParseException ex) {
// A bad document (<foo><bar></foo>) should call the
// error handler instead of raising a SAX exception.

// However, an EMPTY document should raise a RuntimeError.
// This is a bit kludgy, but AFAIK SAX doesn't distinguish
// between empty and bad whereas Nokogiri does.
// An EMPTY document should raise a RuntimeError. This is a bit kludgy, but AFAIK SAX
// doesn't distinguish between empty and bad whereas Nokogiri does.
String message = ex.getMessage();
if (message != null && message.contains("Premature end of file.") && stringDataSize < 1) {
throw runtime.newRuntimeError("couldn't parse document: " + message);
throw runtime.newRuntimeError("input string cannot be empty");
}

// A bad document (<foo><bar></foo>) should call the
// error handler instead of raising a SAX exception.
handler.error(ex);
}
} catch (SAXException ex) {
Expand Down
74 changes: 59 additions & 15 deletions ext/java/nokogiri/internals/NokogiriHandler.java
Original file line number Diff line number Diff line change
Expand Up @@ -42,27 +42,26 @@ public class NokogiriHandler extends DefaultHandler2 implements XmlDeclHandler

private Locator locator;
private boolean needEmptyAttrCheck;
private boolean replaceEntities;
private Set<String> entities = new HashSet<String>();

public
NokogiriHandler(Ruby runtime, IRubyObject object, NokogiriErrorHandler errorHandler)
NokogiriHandler(Ruby runtime,
IRubyObject object,
NokogiriErrorHandler errorHandler,
boolean replaceEntities)
{
assert object != null;
this.runtime = runtime;
this.attrClass = (RubyClass) runtime.getClassFromPath("Nokogiri::XML::SAX::Parser::Attribute");
this.object = object;
this.errorHandler = errorHandler;
this.replaceEntities = replaceEntities;
charactersBuilder = new StringBuilder();
String objectName = object.getMetaClass().getName();
if ("Nokogiri::HTML4::SAX::Parser".equals(objectName)) { needEmptyAttrCheck = true; }
}

@Override
public void
skippedEntity(String skippedEntity)
{
call("error", runtime.newString("Entity '" + skippedEntity + "' not defined\n"));
}

@Override
public void
setDocumentLocator(Locator locator)
Expand All @@ -88,7 +87,7 @@ public class NokogiriHandler extends DefaultHandler2 implements XmlDeclHandler
public void
endDocument()
{
populateCharacters();
flushCharacters();
call("end_document");
}

Expand Down Expand Up @@ -161,12 +160,12 @@ public class NokogiriHandler extends DefaultHandler2 implements XmlDeclHandler
}

if (localName == null || localName.isEmpty()) { localName = getLocalPart(qName); }
populateCharacters();
flushCharacters();
call("start_element_namespace",
stringOrNil(runtime, localName),
rubyAttr,
stringOrNil(runtime, getPrefix(qName)),
stringOrNil(runtime, uri),
uri.length() > 0 ? stringOrNil(runtime, uri) : runtime.getNil(),
rubyNSAttr);
}

Expand Down Expand Up @@ -204,7 +203,7 @@ public class NokogiriHandler extends DefaultHandler2 implements XmlDeclHandler
public void
endElement(String uri, String localName, String qName)
{
populateCharacters();
flushCharacters();
call("end_element_namespace",
stringOrNil(runtime, localName),
stringOrNil(runtime, getPrefix(qName)),
Expand All @@ -218,19 +217,64 @@ public class NokogiriHandler extends DefaultHandler2 implements XmlDeclHandler
charactersBuilder.append(ch, start, length);
}

@Override
public void
externalEntityDecl(java.lang.String name, java.lang.String publicId, java.lang.String systemId)
throws SAXException
{
entities.add(name);
}

@Override
public void
internalEntityDecl(java.lang.String name, java.lang.String value)
throws SAXException
{
entities.add(name);
}

@Override
public void
skippedEntity(String name)
{
call("error", runtime.newString("Entity '" + name + "' not defined\n"));
if (!replaceEntities) {
call("reference", runtime.newString(name), runtime.getNil());
}
}

@Override
public void
startEntity(String name)
{
flushCharacters();
}

@Override
public void
endEntity(String name)
{
IRubyObject content = charactersBuilder.length() > 0 ? runtime.newString(charactersBuilder.toString()) :
runtime.getNil();
if (!replaceEntities && entities.contains(name)) {
call("reference", runtime.newString(name), content);
}
flushCharacters();
}

@Override
public void
comment(char[] ch, int start, int length)
{
populateCharacters();
flushCharacters();
call("comment", runtime.newString(new String(ch, start, length)));
}

@Override
public void
startCDATA()
{
populateCharacters();
flushCharacters();
}

@Override
Expand Down Expand Up @@ -329,7 +373,7 @@ public class NokogiriHandler extends DefaultHandler2 implements XmlDeclHandler
}

protected void
populateCharacters()
flushCharacters()
{
if (charactersBuilder.length() > 0) {
call("characters", runtime.newString(charactersBuilder.toString()));
Expand Down
41 changes: 22 additions & 19 deletions ext/java/nokogiri/internals/ParserContext.java
Original file line number Diff line number Diff line change
Expand Up @@ -63,7 +63,7 @@ public abstract class ParserContext extends RubyObject
Ruby ruby = context.getRuntime();

if (!(data.respondsTo("read"))) {
throw ruby.newTypeError("must respond to :read");
throw ruby.newTypeError("argument expected to respond to :read");
}

source.setByteStream(new IOInputStream(data));
Expand All @@ -80,8 +80,11 @@ public abstract class ParserContext extends RubyObject

Ruby ruby = context.getRuntime();

if (data.isNil()) {
throw ruby.newTypeError("wrong argument type nil (expected String)");
}
if (!(data instanceof RubyString)) {
throw ruby.newTypeError("must be kind_of String");
throw ruby.newTypeError("wrong argument type " + data.getMetaClass() + " (expected String)");
}

RubyString stringData = (RubyString) data;
Expand Down Expand Up @@ -179,23 +182,23 @@ public static class Options
protected static final long NOCDATA = 16384;
protected static final long NOXINCNODE = 32768;

public final boolean strict;
public final boolean recover;
public final boolean noEnt;
public final boolean dtdLoad;
public final boolean dtdAttr;
public final boolean dtdValid;
public final boolean noError;
public final boolean noWarning;
public final boolean pedantic;
public final boolean noBlanks;
public final boolean sax1;
public final boolean xInclude;
public final boolean noNet;
public final boolean noDict;
public final boolean nsClean;
public final boolean noCdata;
public final boolean noXIncNode;
public boolean strict;
public boolean recover;
public boolean noEnt;
public boolean dtdLoad;
public boolean dtdAttr;
public boolean dtdValid;
public boolean noError;
public boolean noWarning;
public boolean pedantic;
public boolean noBlanks;
public boolean sax1;
public boolean xInclude;
public boolean noNet;
public boolean noDict;
public boolean nsClean;
public boolean noCdata;
public boolean noXIncNode;

protected static boolean
test(long options, long mask)
Expand Down
Loading

0 comments on commit 1486c81

Please sign in to comment.