JVM | Platform | Status |
---|---|---|
OpenJDK (Temurin) Current | Linux | |
OpenJDK (Temurin) LTS | Linux | |
OpenJDK (Temurin) Current | Windows | |
OpenJDK (Temurin) LTS | Windows |
A library that handles the processing of, and extraction of data from, validated
XML. Data is processed from a standard streaming XMLReader
and so doesn't
require holding the entire parsed AST in memory.
- Functional API for composing together handlers for elements.
- Stream-based: Documents do not require an expensive intermediate AST in memory.
- Type-safe: Build data structures during parsing without losing type-safety.
- No reflection: No need to worry about reflective serialization-based security issues.
- Integrated with jxe for transparent, hardened, validation-enforcing SAX parsers.
- Written in pure Java 17.
- OSGi ready.
- JPMS ready.
- ISC license.
- High-coverage automated test suite.
Java provides many different ways to process XML. If you are working with XML that has a well-defined schema (as you should be), it is desirable to have a system that allows for efficiently extracting data from the XML without any kind of manual validation.
One approach, included as part of the JDK, is to use the W3C DOM package. This is possibly the worst of all of the options as the API is error-prone and requires reading the entire document into memory, parsing it as an AST, and then creating your application-specific domain objects on top. It also, by default, does not preserve lexical information like line and column numbers, so if your application needs to display an error that relates to a specific element in the XML source document... You can't. There are some rather type-unsafe ways of getting lexical information into the resulting AST, as long as you are willing to write your own implementations of the low-level content handling components.
Another approach is to use XOM. This is better, but still suffers from the same issue of requiring an intermediate AST to hold the parsed document before the application can make use of it, and suffers from the same limitations with regards to lexical information. There are no known methods to get lexical information into XOM ASTs.
Another approach is to use xstream. This, at the time of writing, has issues with the Java module system, and requires a dangerous amount of core reflection to work. This is a fertile ground for serialization-based security vulnerabilities. It also loses lexical information.
Another approach is to use Jakarta XML Binding. This has the advantage of generating a complete set of Java domain objects from an existing XSD schema, and providing code to parse data into those structures. Unfortunately, doing so also requires core reflection, has unclear performance characteristics with regard to requiring an intermediate AST, and loses lexical information.
Another approach, included as part of the JDK, is to use the
SAX API. This has the advantages of not requiring
parsing the entire document into memory before use, not requiring core
reflection and therefore not being subject to serialization-based security
issues, and makes lexical information available through a DocumentLocator
interface. Unfortunately, the API is difficult to work with; the user is
presented with a series of events published to callbacks, and the user must
make sense of the document structure themselves.
The blackthorne
package provides a small API built on top of the SAX
API that allows for extracting data from XML documents by composing together
handler functions that can be used to build whatever data structures are
required as the document is parsed. The API is fully namespace-aware, and
assumes that documents will be validated against a schema during parsing (and
so users are free of the burden of performing manual validation in their own
code). This appears to be the best of all worlds; memory-efficiency,
type-safety, security, preserves lexical information, and provides an API
that aims for minimal suffering.
$ mvn clean verify
The core of the blackthorne
API is the BTElementHandlerType
interface.
An instance of BTElementHandlerType
is responsible for extracting data from
the current XML element (perhaps from attributes), creating child instances
of the BTElementHandlerType
to handle XML child elements, receiving data
from those child elements, and finally returning data to whatever is the
parent handler of the current handler.
Consider an incredibly simple XSD schema:
<?xml version="1.0" encoding="UTF-8" ?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:ts="urn:tests"
targetNamespace="urn:tests">
<xsd:element name="integer">
<xsd:simpleType>
<xsd:restriction base="xsd:integer"/>
</xsd:simpleType>
</xsd:element>
</xsd:schema>
A document conforming to this schema might look like this:
<?xml version="1.0" encoding="UTF-8" ?>
<integer xmlns="urn:tests>23</integer>
An element handler that can handle a document like this, might look like this:
class IntHandler
implements BTElementHandlerType<Object, BigInteger>
{
private BigInteger result;
IntHandler(
final BTElementParsingContextType context)
{
}
@Override
public void onCharacters(
final BTElementParsingContextType context,
final char[] data,
final int offset,
final int length)
throws SAXParseException
{
try {
final var text = String.valueOf(data, offset, length).trim();
this.result = new BigInteger(text);
} catch (final Exception e) {
throw context.parseException(e);
}
}
@Override
public BigInteger onElementFinished(
final BTElementParsingContextType context)
{
return this.result;
}
}
There are several things to note here:
-
The
BTElementHandlerType
is parameterized by the types of values it can receive from child handlers, and the type of values it returns to its parent handler. Because the handler will never create a child handler, we can claim the handler accepts everything (because it will never receive anything) and so the type of child values isObject
. The handler returns a value of typeBigInteger
. -
The handler implements the (optional)
onCharacters
method that will be called by the XML stream parser in order to allow the handler to process text nodes. This handler parses the given range of characters as an integer. -
The
onElementFinished
method returns the result parsed earlier.
The Blackthorne
class contains numerous convenience methods that can be
used to succinctly implement handlers such as the above without requiring
entire class definitions. The above could be reduced to:
Blackthorne.forScalarFromString(..., BigInteger::new);
Some handlers will handle XML elements that have child elements. Consider the following schema:
<?xml version="1.0" encoding="UTF-8" ?>
<xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema"
xmlns:ts="urn:tests"
targetNamespace="urn:tests">
<xsd:element name="integer">
<xsd:simpleType>
<xsd:restriction base="xsd:integer"/>
</xsd:simpleType>
</xsd:element>
<xsd:element name="integers">
<xsd:sequence minOccurs="0" maxOccurs="unbounded">
<xsd:element ref="ts:integer"/>
</xsd:sequence>
</xsd:element>
</xsd:schema>
A handler that can handle the integers
element must be able to tell the
blackthorne
API that there will be integer
child elements, and it must
be prepared to receive child values. For example:
class IntsHandler
implements BTElementHandlerType<BigInteger, List<BigInteger>>
{
private ArrayList<BigInteger> results;
IntsHandler(
final BTElementParsingContextType context)
{
this.results = new ArrayList<BigInteger>();
}
@Override
public Map<BTQualifiedName, BTElementHandlerConstructorType<?, BigInteger>>
onChildHandlersRequested(
final BTElementParsingContextType context)
{
return Map.ofEntries(
Map.entry(elementName("integer"), IntHandler::new)
);
}
@Override
public void onChildValueProduced(
final BTElementParsingContextType context,
final BigInteger result)
{
this.results.add(result);
}
@Override
public List<BigInteger> onElementFinished(
final BTElementParsingContextType context)
{
return this.results;
}
}
The onChildHandlersRequested
method is implemented and returns a map
of handler constructors. When the blackthorne
API encounters a child
element with name integers
, it will instantiate an IntHandler
to parse
it. The onChildValueProduced
method is implemented and will receive
an integer value each time a child element is successfully processed.
Again, the Blackthorne
class contains a method that can provide a simple
implementation for monomorphic lists such as the above. The above can be
reduced to:
Blackthorne.forListMono(
elementName("integers"),
elementName("integer"),
IntHandler::new,
DO_NOT_IGNORE_UNRECOGNIZED_ELEMENTS
);
Many parts of the blackthorne
API accept a parameter of type
BTIgnoreUnrecognizedElements
. This can be used to ignore unrecognized
elements in XML documents if the schema being used to validate the XML is
allows for extra elements (such as through the use of the any
element).
If the parameter is set to DO_NOT_IGNORE_UNRECOGNIZED_ELEMENTS
, an
error will be produced on unknown elements.
It is the responsibility of the programmer and schema designer to design
documents that can be sensibly typed. A given element handler
BTElementHandlerType<T, ...>
can only receive child elements of type T
.
If the child elements of the handler can be of a set of different types, then
it would be desirable (for the sake of type-safety and readability) for those
child elements to have a common supertype that is not Object
. The best way to
do this, schema allowing, is to use a sealed interface:
sealed interface T { }
class Child0 implements T {}
class Child1 implements T {}
class Child2 implements T {}
class IntsHandler
implements BTElementHandlerType<T, ...>
{
private ArrayList<T> results;
IntsHandler(
final BTElementParsingContextType context)
{
this.results = new ArrayList<T>();
}
@Override
public Map<BTQualifiedName, BTElementHandlerConstructorType<?, T>>
onChildHandlersRequested(
final BTElementParsingContextType context)
{
return Map.ofEntries(
Map.entry(elementName("child0"), Child0Handler::new),
Map.entry(elementName("child1"), Child1Handler::new),
Map.entry(elementName("child2"), Child2Handler::new)
);
}
@Override
public void onChildValueProduced(
final BTElementParsingContextType context,
final T result)
{
this.results.add(result);
}
@Override
public List<T> onElementFinished(
final BTElementParsingContextType context)
{
return this.results;
}
}
Element handlers are functors. Given a function f
of type B → C
with a
handler h
that accepts child values of type A
and produces values of type
B
, the Blackthorne.map
function can produce a handler of type A → C
:
Function<B, C> f;
BTElementHandlerType<A, B> h;
BTElementHandlerType<A, C> g = Blackthorne.map(h, f);
The Blackthorne
class contains a convenient parse
method that takes
a set of handlers (one per expected root element), and returns the sum of
the parsed data from the tree of handlers:
var data = Blackthorne.parse(
sourceURI,
inputStream,
PRESERVE_LEXICAL_INFORMATION,
() -> createXMLReader(),
handlers
);
The sourceURI
is the URI of the input stream, used for lexical/diagnostic
information. The inputStream
is the input stream containing the document.
The PRESERVE_LEXICAL_INFORMATION
enumeration value instructs the Blackthorne
API to leave the lexical information delivered by the stream parser untouched.
The handlers
parameter is the map of handlers, and the lambda expression
is a function that returns a value of type XMLReader
on demand. It is
recommended that jxe be used as the factory
of XML readers as it configures the readers using secure defaults
(preventing entity expansion attacks and other problems), and enforces that
all incoming documents must validate according to one or more given schemas.
The type of the value parsed by the parse
function is dependent upon the
common supertype of the return types of the passed handlers
.
The blackthorne
package provides a com.io7m.blackthorne.jxe
module that
contains a slightly more convenient API:
var data = BlackthorneJXE.parse(
sourceURI,
inputStream,
handlers,
Optional.empty(),
XINCLUDE_DISABLED,
PRESERVE_LEXICAL_INFORMATION,
mappings
);
The sourceURI
, inputStream
, handlers
, and PRESERVE_LEXICAL_INFORMATION
arguments mean the same as in the example above. The
XINCLUDE_DISABLED
argument disables XInclude processing. The mappings
parameter provides a set of schemas against which the inputStream
will be
validated. The Optional.empty()
parameter is actually the base directory
used to resolve relative document names (which is irrelevant, because XInclude
is disabled).