ItsdbDerivations

Overview

The itsdb environment records information about derivations (the 'recipes' of linguistic analyses) in its database. Combined with the grammar originally used to derive each analysis, the derivation structure needs to provide complete information for re-building the analysis. In other words, the derivation can serve as an oracle to a process that one can conceptualize as deterministic parsing: the derivation records exactly which steps the original itsdb client processor had taken in producing its analysis. Deterministically re-building (or re-playing) an analysis, thus, will give rise to the exact same structure as was associated with the original result.

In principle, the itsdb derivation format is applicable to any kind of processing client (be it the LKB, PET, TRALE, or the XLE) and all types of processing (e.g. parsing, generation, transfer, or translation). However, in practice (as of early 2009) only parsing and generation derivations produced by either the LKB or PET are fully supported.

This page documents the format used internally by itsdb to record derivations (this specification is sometimes half-jokingly referred to as Unified Derivation Format or UDF). This page was predominantly authored by StephanOepen, who jointly with UlrichCallmeier developed the original UDF 1.0 specification. Please do not make substantial changes unless you (a) are reasonably sure of the technical correctness of your revisions and (b) believe strongly that your changes are compatible with the general design and recommended use patterns for itsdb, and of course with the goals of this page.

An Example

Following is an example derivation taken from the WeScience treebank. This derivation is the result of parsing item WS01/10021300: Many terms are ambiguous.

  (root_strict
   (515 subjh 5.63927 0 4
    (511 bare_np 0.986543 0 2
     (510 adjn -0.115529 0 2
      (44 many_a1 -0.657932 0 1
       ("many" 41
        "token [ +FORM \"many\" +FROM \"0\" +TO \"3\" ... ]"))
      (509 noptcomp 0.277526 1 2
       (508 plur_noun_orule 0.274656 1 2
        (54 term_n1 0 1 2
         ("terms" 30
          "token [ +FORM \"terms\" +FROM \"5\" +TO \"9\" ... ]"))))))
    (514 hcomp 3.121 2 4
     (72 be_c_are -0.558293 2 3
      ("are" 32
       "token [ +FORM \"are\" +FROM \"11\" +TO \"13\" ... ]"))
     (513 hoptcomp 1.8935 3 4
      (512 punct_period_orule 0 3 4
       (80 generic_adj 0 3 4
        ("ambiguous." 40
         "token [ +FORM \"ambiguous.\" +FROM \"15\" +TO \"24\" ... ]")))))))

The derivation is a tree whose core is comprised of identifiers for grammar entities, i.e. the names of grammar rules and lexical entries. In our example, subjh, bare_np, adjn, noptcomp, and so forth name grammar rules of the ERG. Conversely, many_a1, term_n1, be_c_are, and generic_adj are the identifiers of the lexical entries used in this derivation.

All internal nodes but the topmost one name grammar rules, and all preterminal nodes name lexical entries. The topmost (or root) node is special, in that it identifies the start symbol ('root' feature structure) used to license the derivation. Also, this element of the derivation tree is optional, i.e. may or may not be present. Finally, the terminal (aka leaf) nodes of the tree correspond to the actual input to the parser, i.e. a sequence of input tokens. Note that UDF allows some format variation in recording leaf nodes, as token feature structures will only be available in some configuration, viz. when parsing in PET, using the so-called chart mapping machinery (see the PetInput page for background). Thus, a legitimate variation of the example above would be representing terminal nodes by just the token string, e.g.

  (...
   (54 term_n1 0 1 2
    ("terms")))

Further note that the example above abbreviates some of the information in the actual token feature structures.

Unified Derivation Format — More Formally

All regular nodes of derivation trees provide five fields of information, in the format

( id entity score start end daughter⁺ )

The id is an integer uniquely identifying the corresponding chart edge (or corresponding objects, in non-chart universes); when analyzing an ambiguous input, where a sub-structure may be shared across distinct analyses, it is expected that such shared nodes will have the same id across derivations (but only relative to one input, of course).

The entity field is the most important part of the information recorded in the derivation, naming the grammar entity that gave rise to this node.

The score field is a floating point number, recording the probabilistic score (or heuristic weight, or whatever) of the node, where applicable. In parsing with PET or the LKB, for example, the score field will contain the unnormalized MaxEnt score assigned to the underlying chart edge, i.e. the sum of all weights λ_i for features f_i present in the edge, including its daughters.

Finally, the start and end fields record the sub-string corresponding to each node, measured in inter-word positions, for example chart vertices. Strictly speaking, this information is redundant, as it could be derived from the nesting of nodes, relative to the sequence of preterminals.

For the purpose of recording the exact 'recipe' used in deriving an analysis, all but the entity fields are optional. However, the UDF syntax requires all five fields to be instantiated on each (non-top and non-leaf) node. By convention, numeric fields (especially the score) can be underspecified by virtue of a value of '-1'.

Known Bugs

As of early 2009, PET is known to sometimes 'forget' to include the root node in derivations returned to itsdb; as the top-most node (with only one field, which is how it is recognized) is optional in the syntax specification, this is only a problem of missing information. However, PET is also known to sometimes output root nodes (one field, naming a root feature structure, not a grammar rule or lexical entry) as internal nodes in derivation trees, which violates the UDF syntax. itsdb works around these divergences silently, i.e. the derivation reader will ignore such superfluous, internal nodes with only one field of information. The current theory is that both bugs are related to the same chart edge being used both as the root of one analysis, and simultaneously as an internal node in another derivation.

Home | Forum | Discussions | Events

ItsdbDerivations

Overview

An Example

Unified Derivation Format — More Formally

Known Bugs

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!