Skip to content

A repository for converting popular NER datasets to CoNLL format.

Notifications You must be signed in to change notification settings

mgupta1410/NER-data-standardization

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

A repository for converting popular NER datasets to CoNLL format.

Ontonotes 5.0

Ontonotes from LDC

  • Obtain the Ontonotes 5.0 from Penn LDC Language Resources
  • Login to Penn LDC through your institutional login or register - https://catalog.ldc.upenn.edu/signup
  • Request the resource from the link - https://catalog.ldc.upenn.edu/LDC2013T19
  • The link to the resource would be mailed to you. Download and unpack the Ontonotes release.
  • Verify the directory structure by running the command -
$ tree -L 3 -d ontonotes-release-5.0/data/files/data/
ontonotes\_5/data/files/data/
├── arabic
│   ├── annotations
│   │   └── nw
│   └── metadata
│       ├── frames
│       └── sense-inventories
├── chinese
│   ├── annotations
│   │   ├── bc
│   │   ├── bn
│   │   ├── mz
│   │   ├── nw
│   │   ├── tc
│   │   └── wb
│   └── metadata
│       ├── frames
│       └── sense-inventories
├── english
│   ├── annotations
│   │   ├── bc
│   │   ├── bn
│   │   ├── mz
│   │   ├── nw
│   │   ├── pt
│   │   ├── tc
│   │   └── wb
│   └── metadata
│       ├── context
│       ├── frames
│       └── sense-inventories
└── ontology
    └── sense-pools
    └── ...

ConLL03 formatted Ontonotes

$ tree -L 7 -d conll-formatted-ontonotes-5.0-12/data/files/data
conll-formatted-ontonotes-5.0-12
└── conll-formatted-ontonotes-5.0
    └── data
        ├── conll-2012-test
        │   └── data
        │       └── english
        │           └── annotations
        │               ├── bc
        │               ├── bn
        │               ├── mz
        │               ├── nw
        │               ├── pt
        │               ├── tc
        │               └── wb
        ├── development
        │   └── data
        │       └── english
        │           └── annotations
        │               ├── bc
        │               ├── bn
        │               ├── mz
        │               ├── nw
        │               ├── pt
        │               ├── tc
        │               └── wb
        ├── test
        │   └── data
        │       └── english
        │           └── annotations
        │               ├── bc
        │               ├── bn
        │               ├── mz
        │               ├── nw
        │               ├── pt
        │               ├── tc
        │               └── wb
        └── train
            └── data
                └── english
                    └── annotations
                        ├── bc
                        ├── bn
                        ├── mz
                        ├── nw
                        ├── pt
                        ├── tc
                        └── wb

Recovering words

  • The orginal repo only contains the *_skel files, which have the orginal words masked. The script skeleton2shell.sh produces *_gold files that have contains unmasks words recovered from OntoNotes 5.0. Run the follwing command -
$ ./scripts/skeleton2conll.sh -D ontonotes-release-5.0/data/files/data conll-formatted-ontonotes-5.0-12/conll-formatted-ontonotes-5.0
  • The script assumes that python 2.x in your system is run by python2. If that is not the case, please make the corresponding change in the script.

Coming Soon!

TempEval-3.0 to CoNLL format

KBP 2018 to CoNLL format

Converting to BIO scheme

About

A repository for converting popular NER datasets to CoNLL format.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published