Skip to content

dereckson/extract-proper-nouns

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

This script allows to extract proper nouns from an English text with NTLK.

Install dependencies
--------------------
* Install NTLK according your OS (pkg install ntlk on FreeBSD for example)
* Install numpy (pkg install py27-numpy)
* Download the needed NLTK resources with ntlk.download():
** averaged_perceptron_tagger
** maxent_treebank_pos_tagger
** punkt
** treebank

Source text
-----------
You need a copy of the text you want to extract from as plain text.

Source English word list
------------------------
The expected format is a list in lowercase, each line a substantive word.
Filename should be wordsEn.txt or modified in eliminate-common-nouns script.

Such file was available at [SIL](http://web.archive.org/web/20141122213941/http://www-01.sil.org/linguistics/wordlists/english/).

Usage
-----
./extract-proper-nouns source.txt > nouns.txt

To sort them and eliminate duplicates:
./extract-proper-nouns source.txt | sort | uniq > nouns.txt

To discard known English words:
./eliminate-common-nouns nouns.txt

Acknowledgment
--------------

Thank you to Rama for NLTK suggestion and some brief guidance.

The original code idea is from Alvations, and could be seen at http://stackoverflow.com/a/17672491/1930997.

About

Extract proper nouns from an English text with NLTK POS tagging

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages