This project contains source code for research into the automation of literature reviews using Python and NLTK. The CosIng-Toxicity case study in particular uses data about cosmetic ingredients to search for research into their toxicity.
This project was carried out in collaboration with the Kanazawa University Practical Pharmacology Laboratory. The goal was to verify an automated literature review process using natural language processing (NLP).
The data for this project comes from the following resources:
- Cosmetic ingredient database (Cosing) - Ingredients and Fragrance inventory for an inclusive list of cosmetic ingredients
- PubChem PUG View web service to collect data on each ingredient's therapeutic uses and toxicity
- PubMed E-utilities to search research papers for adverse effects on skin related to each compound
- Natural Language Toolkit (NLTK) to process the acquired paper abstracts for relevance
A presentation slideshow of this research is available on Slideshare at the link below.
A Natural Language Processing Approach to Reviewing Research Abstracts from Robert Songer
Research literature reviews have largely moved online and researchers must search through large quantities of digital documents to find research related to their academic pursuits. With recent developments in Natural Language Processing (NLP), computers can perform most of the searching and reduce the amount of time it takes researchers to find the papers they need. In this report, we introduce three basic NLP techniques (tokenization, frequency distributions, and in-sentence collocations) for searching the written texts of research abstracts downloaded from an online database. Real examples written in the Python programming language are provided along with a discussion of their efficacy in a project at Kanazawa University where an online research database was searched for research related to the adverse effects of hundreds of pharmaceutical compounds.