A fast XML parser / scanner written in Typescript
- Small, fast, zero dependencies
- Supports multiple runtimes (Node.js, Deno, Bun, browsers)
- SAX based parser with DOM like event structure
- Supports standard XML features (element names, attributes, text, comments, CDATA, processing instructions, xml entities). No support for DTDs.
Given an XML string, the xml scanner will parse the XML and invoke the defined events on the specified paths.
Platform | Command |
---|---|
npm | npm install @continuit/xmlscanner |
deno | deno add jsr:@continuit/xmlscanner |
For other platforms see jsr.io for more information.
There are two basic types of XML parsers:
- SAX (event based)
- DOM (tree based)
This library implements a SAX parser on steroids thus giving a bit of the ease of use of the DOM parsers while still allowing for the speed of SAX parsers. The XML scanner allows for scanning XML strings. Streaming is not supported (yet).
The xml scanner is built around the concept of events and paths. There are several types of events (like open tag, close tag, text, comment etc.) each fired when the scanner encounters the corresponding part of the XML document. Paths which are used to define on which document elements the events need to be invoked. The paths are similar to the XPath absolute path expressions. An event tree can be filled with events that need to be invoked on specific paths. Once defined it can be used to scan XML strings.
Internally the xml scanner uses a tree structure with nodes of type XmlEvents
. All fields in the XmlEvents
object are optional, so you can and should define only the events you are interested in. An empty tree structure can be created by calling emptyXmlEvents()
.
The tree can be expanded by calling addElement(events, path)
. This will return the XmlEvents
object where that path points to.
On the XmlEvents
object you can define the events you are interested in. See the XmlEvents type for more information.
When the whole tree structure is set up, the xmlScanner(xml, events)
function is called with the XML string and the tree structure.
For the examples below, an example XML document is used testfile.xml:
<company>
<employees>
<employee>
<id>1</id>
<name>John Doe</name>
<position>Software Engineer</position>
<department>Engineering</department>
</employee>
... many more employee records ...
</employees>
</company>
This example extracts all the names from the XML document:
import { addElement, emptyXmlEvents, xmlScanner } from '@continuit/xmlscanner';
async function extractNames(xml: string) {
// This will collect all the names found in the XML
const names: string[] = [];
// Setup the xml scanner.
// Listen for the text event on elements on the following path:
// company/employees/employee/name
const events = emptyXmlEvents();
addElement(events, 'company/employees/employee/name').text = (text) => names.push(text);
// Scan the XML and extract the names
xmlScanner(xml, events);
// Return the names
return names;
}
A bit more complex example would be to extract all the employees as objects.
When an employee
element is opened, create a new empty Employee
object activeEmployee
.
Since not all fields are available in each employee element, the activeEmployee
object is of type Partial<Employee>
.
Then when the properties of the employee record are matched they are filled in the activeEmployee
object.
When the employee
element is closed, the activeEmployee
object is pushed to the result array.
type Employee = {
id: number;
name: string;
position: string;
department: string;
};
async function extractEmployees(xml: string): Employee[] {
// This will collect all the employees found in the XML
const result: Employee[] = [];
// This will hold the current employee object as employee record is being parsed
let activeEmployee: Partial<Employee> = {};
// Setup the xml scanner.
const events = emptyXmlEvents();
// Get the event object for the path company/employees/employee
const employee = addElement(events, 'company/employees/employee');
// Setup the event listeners for the employee object.
employee.tagopen = () => activeEmployee = {};
addElement(employee, 'name').text = (text) => activeEmployee.name = text;
addElement(employee, 'id').text = (text) => activeEmployee.id = Number.parseInt(text);
addElement(employee, 'position').text = (text) => activeEmployee.position = text;
addElement(employee, 'department').text = (text) => activeEmployee.department = text;
employee.tagclose = () => result.push(activeEmployee as Employee);
// Scan the XML and extract the employees
xmlScanner(xml, events);
// Return the employees
return result;
}
The xml scanner is designed to be as fast as possible.
Care has been taken to ensure that the library performs well. After trying out different approaches to fast parsing of the XML string we used the two basic fast string operations are indexOf
and charCodeAt
.
In order to allow for events to be defined on different paths a tree structure was used. The tree is traversed while scanning the XML string so when events need to be invoked the right node is already at hand. Since all events are defined on the nodes the memory usage is minimal.
No tokens are created while scanning, just scanning the XML string.
When an event is specified on a path that need to pass some content (like text and attribute events) only then is the content extracted and entities unescaped. Element and Attribute names are converted to strings for lookup of events / tree traversal.
Since V8 and the like allow for allocation of substrings very fast there is only a minimal penalty in terms of performance, especially if there are no entities to unescape.
This is the reason strings are used as the source of the XML document instead of an UInt8Array
.
The xml scanner currently does not support all XML features. Namespaces are not parsed, so if you have namespaces in your XML, you will need to handle those yourself.
<root xmlns:x="http://example.com/namespace">
<x:child>text</x:child>
</root>
In the above case the path root/x:child
will be matched, but if the namespace name is changed from x
to y
, the path root/y:child
will not be matched.
The xml scanner does not support DTDs. If you have DTDs in your XML, a not supported error will be thrown.
The xml scanner does not support streaming. The entire XML document must be present in a string before it can be scanned.
There are already a number of XML parsers available for Typescript, so why build another one.
The main reason for this parser is the need for a fast XML parser for XML files wrapped inside zip files varying in size from 30MB to 200MB. No existing solution was both performant and easy to use (and not running out of memory).
The code quality is measured using unit tests and code coverage. See the Test report for more information.
This project is licensed under the MIT license. See the LICENSE file for details.