This project is the implementation of an NLP chatbot answering questions about science
The purpose of this project is to create an NLP-based chatbot that uses an extractive QA system to provide answers to questions. The chatbot is designed to handle statements in multiple languages, determine the intent of the statement and provide a response based on the intent.
When a statement is entered into the chatbot, the language is first detected. If the statement is in a language other than English, the statement is translated into English before being processed. This allows the chatbot to operate more efficiently and accurately, as it is trained to understand and respond in English. The translation is performed using the googletrans
library.
After the language detection and translation, the statement intent is determined. The chatbot can recognize and respond to greetings, goodbyes, and questions. For greetings and goodbyes, the chatbot provides a random response based on a set of pre-defined rules. For questions, the statement is passed on to the extractive QA system.
The extractive QA system is trained using a model that is optimized to automatically identify and extract answers from a provided database. In this case, the model is a pretrained deepset/tinyroberta-squad2
model, which was fine-tuned and trained on a science-based dataset called "Sciq". The model takes in a 'question-context' input and is labeled by an answer. The context serves as the document from which the answer is to be automatically extracted.
The "Sciq" dataset is combined with the quartz and openbookqa science based datasets (found on Huggingface hub) by merging the provided contexts into a large single context. Due to the large size of the context, the NLP framework Haystack
is used to do the question/answering at scale. The Haystack framework retrieves the document and reads the data based on the model that was trained earlier. Once the system provides an answer, the response is translated back to the input language if required using the googletrans
library.
- Due to the large size of the model, it was trained and uploaded to the Huggingface hub from where it will be called. However, the training script can be found in the 'chat' folder.
- The corpus used for our language context is also provided in the folder, but the code to generate the data is present in the 'chat' folder.
- Enter statement to chatbot
- Determine language of statement and translate to English if necessary
- Determine intent of statement (greeting, goodbye, or question)
- If greeting or goodbye, provide random pre-defined response
- If question, pass the statement to the extractive QA system
- Retrieve document and read data based on the model trained on "Sciq", "Openbookqa", "quartz" datasets
- Provide answer and translate back to the input language if required
- googletrans==3.1.0a0
- nltk
- datasets
- transformers
- spacy
- tensorflow
- farm-haystack
- chat - Folder containing all custom files except for documentation.md
- nlp - Subfolder containing the English language model for the spaCy library, used to determine phrase similarity
- sci_wiki.txt - Document used as a large corpus for our QA context, the chatbot performance can be improved by increasing the size of this reference document
- train_model.py - Script used to train the model used for inference
- documentation.md - Documentation
- predictor.py - Script for intent inference and answer prediction
- utils.py - Provides general utility functions