GitHub - dereckson/extract-proper-nouns: Extract proper nouns from an English text with NLTK POS tagging

dereckson / extract-proper-nouns Public

Notifications You must be signed in to change notification settings
Fork 6
Star 21

Extract proper nouns from an English text with NLTK POS tagging

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
README		README
eliminate-common-nouns		eliminate-common-nouns
extract-proper-nouns		extract-proper-nouns

Repository files navigation

This script allows to extract proper nouns from an English text with NTLK.

Install dependencies
--------------------
* Install NTLK according your OS (pkg install ntlk on FreeBSD for example)
* Install numpy (pkg install py27-numpy)
* Download the needed NLTK resources with ntlk.download():
** averaged_perceptron_tagger
** maxent_treebank_pos_tagger
** punkt
** treebank

Source text
-----------
You need a copy of the text you want to extract from as plain text.

Source English word list
------------------------
The expected format is a list in lowercase, each line a substantive word.
Filename should be wordsEn.txt or modified in eliminate-common-nouns script.

Such file was available at [SIL](http://web.archive.org/web/20141122213941/http://www-01.sil.org/linguistics/wordlists/english/).

Usage
-----
./extract-proper-nouns source.txt > nouns.txt

To sort them and eliminate duplicates:
./extract-proper-nouns source.txt | sort | uniq > nouns.txt

To discard known English words:
./eliminate-common-nouns nouns.txt

Acknowledgment
--------------

Thank you to Rama for NLTK suggestion and some brief guidance.

The original code idea is from Alvations, and could be seen at http://stackoverflow.com/a/17672491/1930997.