Skip to content

lueck/standoff-tools

Repository files navigation

License GPLv3 test status

Standoff Tools - tools for handling standoff annotations

Standoff Tools (standoff-tools) offer generic services for building annotation pipelines for enriching XML, e.g. TEI-XML, using taggers for plain text analysis. They help to bridge between the land of XML hierarchies and the land of processing a stream of tokens.

In detail, StandOff offer two services, that are concerted to each other.

  • Extractor E: extracts plain text from XML

  • Internalizer I: merges results back into XML so that the result is wellformed XML

Information flow in a standoff pipeline

Slides for the TEI 2022 conference

Requirements for the tagger

To use these services, the tagger for plain text analysis has to provide records with character offsets. Just a list of strings is not enough. E.g. imagine a tagger for named entity recognition (NER), that returns CSV, one row for each found name, with offsets of the start and end characters of the found names and maybe other features like persistent identifiers of the named entities.

start,end,string,id
1051,1055,Locke,...
1073,1082,Descartes,...
2033,2037,Locke,...
3451,3455,Wolff,...
...

CSV files suitable for Standoff Tools must provide at least the two columns start and end, or start and length.

There are many tools and libraries out there, that provide offset information: Spacy, ANTLR-based grammar parsers, Python's regex library, (WebLicht), ...

Features in the other columns of the CSV can be mapped to attribute values in the XML output of the internalizer I. You can either define a fixed/constant element name, that is used for wrapping tags around the portions of the document as described in the CSV file. Or you can also define a column to get the element name from.

The spans described in the CSV may overlap each other.

Usage

Wiki is about to come.

Internalizing StandOff Annotations, e.g. Web Annotations (OA)

The internalizer can also be used stand-alone to internalize manually produced standoff annotations into the source document. The result is wellformed XML even when the annotations overlap each other and overlap the internal markup of the source document. If annotation start inside an opening or a closing tag or a character reference, etc. they are silently repaired.

Features

  • no language model introduced, e.g. the notion of word
  • library abstracts away XML and can be used for every hierarchical markup language
  • no TEI-specific knowledge in the code base, but can be added by config
  • can be used stand-alone for internalizing OA-based standoff annotations into the source document
  • standoff annotations may reference the source document using character offsets, pairs of start offset and length
  • offsets may be given as scalars or pairs of line and column numbers
  • define how tags are shrinked by YAML config
  • mappings of annotation features (key-value pairs) to XML attributes defined in YAML
    • Special features of each split can be used to provide the internalized splits with a unique ID and with a pointer to the previous split, e.g. for TEI's @prev.
    • add prefixes to annotation features that go into attribute values, e.g. for making correct @xml:ids from UUIDs
  • define a constant element name for internalized splits or use an annotation feature to determine the element name
  • commands for inspecting the annotations
  • commands for inspecting the source document

History

standoff-tools was first developed in 2015 in order to internalize assertive standoff annotations on TEI documents, which were produced with standoff-mode, a tagger for GNU Emacs. This tool works with annotation schemes defined in RDFS/OWL, lets you make discontinuous markup, relations of text runs, and free text comments. standoff-tools enabled us to visualize our annotations in a browser.

The aim since spring 2021 is to use standoff-tools in various annotation pipelines, either with human or machine-driven annotators, where annotations have to be internalized into the TEI source document.

Road-map

  • choose tag name from a feature
  • mute output of subtrees in shrinked text, e.g. for <tei:teiHeader> or <tei:rdg>
  • make it a webservice
  • add support for DTD and entity definition parsing

Installation

standoff-tools is written in the Haskell programming language. To compile and run it, stack, the haskell build tool, is required. After having installed stack, you have to clone this repository, cd into the working copy and compile the program in a sandboxed environment:

git clone https://github.com/lueck/standoff-tools.git
cd standoff-tools
stack setup
stack build

To install it use:

stack install

If you want to try it first, without installation, you can use all the program's features by executing it through stack from the sandbox:

stack exec -- standoff --help

To run tests do stack test :unit-tests. There is also a testsuite with real world tests, which require TEI-P5 input files. If you want to run these tests, too, then don't hesitate to contact me for getting the files.

Usage

stack build generates an executable named standoff, which offers some sub-commands. Run standoff with the --help option like follow:

standoff --help

You will see internalized in the list of available sub-commands. Each sub-command offers it's own help message:

standoff internalize --help

Attribute Mappings

The parser for annotations given in CSV makes key-value pairs from the header names and the values in each row. The keys are mapped to a triple of XML prefix, XML name, and XML namespace. There are also special keys for each text range and split:

  • __standoff_special__splitId: The value is a concatenation of the id feature and the split number (but for the first split the id onyl). This can be used for xml:id.
  • __standoff_special__prevId: A pointer to the @xml:id of the previous split. It can be used in TEI's @prev.
  • __standoff_special__ns: It has the constant value "unknown" and can be used to set the namespace of the inserted element. Note, that you can use a prefixed element name!

See mappings/som-tei.yaml for an example.

Implementation

If you are interested in the internalizer's implementation, which is based on position-based splitting instead of a look-ahead parser, have a look at Internalize.hs.

License

GPL V3

About

tools for handling external markup

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published