NOTE: I recommend fast-xml2js in place of this project.
It performs well and is extremely convenient to use for the xml-to-json conversion.
xml-to-es
was originally written to translate [David Lewis's Reuters collection in SGML](http://www.daviddlewis
.com/resources/testcollections/reuters21578/) intocleaned-up JSON for ElasticSearch.
It has been improved to translate XML into JSON, HTML, raw text. Output is to one file per XML document or one file N XML documents or one file for the output of all the input XML documents.
In version 0.2.0, xml-to-es
can accept a generator
argument that supports output to any kind of sink,
including a stream. There is an example of this in examples/db-config.js
.
Using examples/convert.js
, documents can be submitted as a comma-delimited list or as a directory name with
Translates XML (or SGML) documents into JSON documents suitable for ElasticSearch -- or into plain text suitable for
OpenNLP program input. Additionally, there is an HTML output option intended for use with Chiliad Discovery and an
open-ended option that allows such things as pushing the documents to a database.
There is also a module examples/indexFiles.js
and a mapping.json file that can be used as examples to submit
the JSON-ized files to ElasticSearch.
xml-to-es uses libxml-to-js (and the underlying libxmljs) and then massages the resulting JSON object to make it meaningful for search engines. Elasticsearch can handle nested JSON, but the nested JSON produced automatically from XML/SGML is almost never what you want for indexing or any other purpose. In addition, most XML is by nature very noisy. xml-to-es gives you some fine-grained control over what is produced in JSON.
A simple config file lets you
- preProcess the JSON to get it in shape for subsequent processing as described in the remaining bullet items
- un-nest the XML by promoting important elements to top level
- turn attributes (represented as '@' properties) into elements using promotion
- flatten arrays which are rendered as
[{ '#' : value1}, {'#' : value2},...]
by libxml*. - delete elements you don't need
- rename elements
XML/SGML structure in the wild is not always perfect, so xml-to-es handles some anomalies:
* missing closing tag
* missing opening tag
xml-to-es handles these by examining the XML/SGML input before it is JSON-ized.
It will not currently handle a sequence where the first document has a missing closing tag and the second document
has a missing opening tag. That would be possible but is left for future work as needed. (The input file
test/data/twoDocsNoSepTagsTest.xml
is available for
experimentation on that problem. Note: extension is xml
instead of sgm
to protect tests.)
With npm, just do:
npm install xml-to-es
Then, cd
to xml-to-es directory and run:
npm install
For github:
git clone http://github.com/imbroglioj/xml-to-es.git
The examples
and test
directories show a number of ways to use and control these modules.
convert.js
: useslib/xml-to-es.js
to convert XML/SGML to massaged JSONindexFiles.js
: using an index config file, will submit a set of JSON files for indexing to ElasticSearch (requires that ElasticSearch be running!)*-config.js
: various input and output configuration files to be used withconvert.js
andindexFiles.js
- If you want to copy
convert.js
orindexFiles.js
to top level to experiment with modifying them, you will have to- change
require(path.resolve(__dirname,'index.js'))
torequire('xml-to-es')
- install the
optimist
module (usingnpm
).
- change
Some JSON tweaks are provided using the config file input
property:
preProcess
: modify the JSON object (config.json) in any waypromote
: move a nested element/object-property to be a top level object propertydelete
: remove unneeded properties from the resulting JSONrename
: rename property keys in the JSONflatten
: Somewhat likepromote
, but typically used to remove noise from what should be an array. Some SGML kludges can create complicated object nesting in the JSON which can completely obscure the fact that we have an array as a property value. (To see a before example, temporarily remove theflatten
property from a copy oflewis-input-config.js
.) Once you identify the offending XML tag ('d' for the Reuters collection), xml-to-es will remove the extra tag and flatten the array value by removing place-holder property names like '#'.
The output-config must require
the input-config you want.
The output config file (examples: json-config.js
, db-config.js
, text-only-config.js
) give
examples of the output options.
- fmt: JSON|HTML or whatever formats you might add to Generation.js
- noFile: true if there is a user-supplied generator and it creates its own output sink
- fileExt: extension of the output file (the output file name is created from the input file name, the JSON id property and the output.fileExt). IF REGEX, terminate with "$"
- destDir: directory for output files
- docsPerFile:
- 0: put everything in one output file
- 1: 1 output file per XML/SGML input
- 100: 1 output file for every 100 XML/SGML input files
- leadChar, sepChar, trailChar: in case of multiple documents per output file, how to group them (
[ , ]
for JSON - generator: an object for supplying your own output handler (see also
examples/db-config
) {type: /* SAME as fmt value */, fn: if (noFile == true) fn: function(data, cb) else fn : function(stream, data, cb) setConfig: function(conf) that sets the closure var config to the actual current complete config } note: type value must match fmt property value due to internal issues. (If 'fmt' is undefined it will be set to the 'type' value)
node examples/convert.js FILES_OR_DIRECTORY --config CONFIG_FILE [--level LOGLEVEL]
- FILES_OR_DIRECTORY: a filename, directory name or comma-delimited list of filenames and/or directory names leading to files that containone or more XML (or SGML) documents. The config file may contain config.input.fileExt to control which files are read.
- CONFIG_FILE: a node.js module that controls the structure of the input and output
- LOGLEVEL: default is DEBUG
Example: node examples/convert.js /data/reuters --config examples/json.config --level WARN
Make sure you have ElasticSearch running and listening at the server and port listed in the index config file.
node examples/indexFiles.js FILES_OR_DIRECTORY --config INDEX_CONFIG
- FILES_OR_DIRECTORY: filename, directory name or comma-delimited list of file/directory names which designate one
or more files of JSON documents. Unlike ElasticSearch,
indexFiles.js
can handle files containing JSON arrays of documents.
NOTE: the 'id' field of each document will be submitted as the ID to ElasticSearch.
Example: node examples/indexFiles.js jsonFiles/ --config examples/index-es-config
In you have mocha installed globally, cd
to xml-to-es directory and run:
mocha --reporter spec