This experiment uses the MED (Lavenshtein distance) algorithm to find the correct spelling of misspelled words in the Birkbeck corpus from the WordNet dictionary. Where k={1, 5, 10}, the average success at k, is calculated.
Keywords: Spell correction, Lavenshtein distance, Corpus, Dictionary, Natural Language Processing.
Two files, SHEFFIELDDAT.643 and FAWTHROP1DAT.643, out of the Birkbeck spelling error corpus by Roger Mitton was used for this experiment. They contain 1,193 words misspelled words in total and the correct equivalent of these words.
The WordNet dictionary contains 147,306 words.
You can find the modules and libraries used in this project in the requirement.txt file. You can also run the code below.
pip install -r requirements.txt
-
Data: contains the Birbeck corpus files used for this project.
-
images: contains the bar graph showing the average success at k.
-
utils: contains the essential functions for this project.
-
Assignment_#1.ipynb and Assignment_#1.py are python notebook and script that uses the functions in the utils folder to generate the results.
Glory Odeyemi is currently undergoing her Master's program in Computer Science, Artificial Intelligence specialization at the University of Windsor, Windsor, ON, Canada. You can connect with her on LinkedIn.