Skip to content

Commit

Permalink
Update overview.md
Browse files Browse the repository at this point in the history
  • Loading branch information
gcelano committed Aug 22, 2017
1 parent 51e7204 commit 485257e
Showing 1 changed file with 4 additions and 6 deletions.
10 changes: 4 additions & 6 deletions statistics/overview.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,11 @@
# Overview

The corpus consists of
* 24540015 tokens
* 1318860 sentences
* 25750588 tokens
* 1384550 sentences

All tokens have been automatically assigned a POS tag by the Mate tagger (accuracy 88%; the tagger guesses a POS tag even when the word form is not known)

15272085 tokens have a lemma: i.e., non empty ```<l/>```, with at least one ```<l1/>``` and/or at least one ```<l2/>```.
16002732 tokens have a lemma: i.e., non empty ```<l/>```, with at least one ```<l1/>``` and/or at least one ```<l2/>```.

9267127 tokens have no lemmas: i.e., empty ```<l/>``` (803 should be added to this number, which are tokens that exceptionally do not have an ```<l/>``` element: they invariably show an erroneous form "#"). In the majority of cases, lemmas are missing because both Morpheus and PerseusUnderPhilologic do not contain the corresponding word forms. In a few cases, however, they contain the word form but its morphological analysis does not correspond to the one automatically assigned by the tagger (which is therefore to be considered, most likely, wrong):
* of the 9267127 tokens with missing lemmas 657627 are unique word forms (see values in ```<f/>```): Morpheus does not know 580415 of these unique forms, while PerseusUnderPhilologic 555970. 545011 unique word forms are missing both in Morpheus AND PerseusUnderPhilologic. As a consequence,
112616 (= 657627 - 545011) are unique word forms that are present in Morpheus and/or PerseusUnderPhilologic, but whose lemmas have not been retrieved because their morphological analyses differ from that assigned by the tagger.
9747856 tokens have no lemmas: i.e., empty ```<l/>```. In the majority of cases, lemmas are missing because both Morpheus and PerseusUnderPhilologic do not contain the corresponding word forms. In a few cases, however, they contain the word form but its morphological analysis does not correspond to the one automatically assigned by the tagger (which is therefore to be considered, most likely, wrong).

0 comments on commit 485257e

Please sign in to comment.