Skip to content

SaarlandValence

LarsHellan edited this page Sep 4, 2013 · 4 revisions

DELPH-IN meeting 2013: SIG on cross-linguistic Valence repositories

Lars Hellan

Delph-In-grammars typically have ‘complete’ verb-valence repositories, with a shared design, albeit not necessarily with the same type inventories – of course reflecting linguistic differences, but also differences in notational systems.

It will be an interesting linguistic resource if these repositories can be aligned, and useful for many purposes.

A way of 'operationalizing' such an alignment may be the creation of a multilingual valence database; the link below leads to a newly created such (technically constructed by Tore Bruland and me, using independently created resources), so far with two members, Norwegian and Ga:

http://regdili.idi.ntnu.no:8080/multilanguage_valence_demo/multivalence

The best strategy for creating an aligned repository is presumably through the design of a ‘neutral’ notation, to which all the systems can map in equal fashion, and which is more friendly to the ‘human eye’ than many of the grammar internal codes; in particular it need not be restricted to the tdl format. The demo tries to implement this idea. Aside from the demo itself, the enclosed files illustrate one attempt at finding such codes; the enclosed files are for Norwegian – those for Ga have the same design:

The file ‘SAS types No’ is an inventory of syntactic argument structures, in a format fairly easy to read, and similar to what some of the existing mono-lingual valence banks use. It was first made for use in TypeCraft, by Dorothee and me, and slightly extended for the use in the valence demo.

The file ‘funct types No’ supplements the SAS types file with traditional style labels like ‘intransitive’, ‘intransitiveWithOblique’, etc., being less detailed in information, but perhaps better accessible to many people, and applicable across various styles of SAS-files, see below. ‘sit types No’ is a rudimentary assortment of situation type labels, a few of them listing ‘-arity’, others trying to give classifications in terms of ‘content’. The latter reflect labels actually used in Norsource, but they are sparsely used, so here mainly providing a pre-taste of how such a system could be built.

The file ‘sit types_with_hierarchy No’ present the same types as above in a mildly hierarchically ordered manner, and contains in addition a larger list used in Hellan and Dakubu 2010, which are candidates for inclusion in later developments.

CorrList…’ shows the conversion from the grammar internal types, listed in ‘Lex-types_No, to the format of ‘SAS’, ‘funct’, ‘sit’ and – so far minimally employed, since it is not revealed in lexical type labels in Norsource - aspect.

The demo is created from the conversions mentioned, together with the lexicon files of Norwegian. Similarly for Ga.

For ‘new’ languages, these types of files are all that is needed. (To create the input for the demo itself, only the Corr-list for the language is actually needed, together with its lexicon file(s). From the Corr-list, the other files can be generated. However, in order to consistently build up the Corr-list, it is helpful to have the other lists already compiled.)

For ‘new’ languages, some code extensions will be called for – for instance, for Spanish, phenomena that immediately came up were ‘pro-drop’, clitic doubling, and the use of ‘à’ as a marker of objects. Also not encountered in Norwegian and Ga is case, and free word order. Code for such parameters obviously has to be designed in connection with languages where they are substantially instantiated.

(For instance, in the present SAS code, ‘”NP+NP+PP”’ reflects fixed word order in the sense that in its instantiations, the order could not be different. To represent that the order of the NP and PP were arbitrary, one could perhaps write “… NP, PP” – but that would need to be more closely considered.)

To be done:

  • In the Ga demo, each lexical item has an example, since this lexicon file comes from a Toolbox

project. In the Norwegian demo, there only is a representative example for the type in question; which is a bit bleak. I would be good if the slot Examples can be either populated, or occupied by hyperlinks to repositories with annotated examples (like TypeCraft), or even to parsers.

  • The format of the demo does of course not require that the lexical types and resources come from

a grammar; it just is the enterprise considered presently. (Indeed, if filled in via some other route, a grammar might be partially constructible from the information there.) Among potentially interesting alternative provenance routes is the database of the ‘Leipzig Valency Classes project’ (acronym ‘ValPaL’), which will be accessible to free download from coming October on; interoperability between the systems has not been explored yet. Toolbox projects are another type of ‘sources’, also when not sifted via a grammar (as it is in the present case – Bender et al 2012 describe another Toolbox-LKB conversion that perhaps could be tried in the present setting of ‘being sifted through a grammar’).

An additional thing that can be done, once valence codes of different grammars are mapped onto a common code:

  • Developing parallel testsuites, indexed according to the common code.

Something that perhaps could be done:

  • Using the valence-bank in MT, for instance via a strategy of first identifying possible meaning

equivalents (via aligned WordNet, for instance), then search if any of them match in valence, and then – if so - do ‘isomorphic’ transfer as a first choice.

Well, these are possibilities, among many more …

Lars 020913

Clone this wiki locally