Skip to content

hellohaptik/chatbot_ner

Repository files navigation

Named Entity Recognition for chatbots

chatbotner logo

Chatbot NER is an open source framework custom built to supports entity recognition in text messages. After doing thorough research on existing NER systems, team at Haptik felt the strong need of building a framework which is tailored for Conversational AI and also supports Indian languages. Currently Chatbot-ner supports English, Hindi, Gujarati, Marathi, Bengali and Tamil and their code mixed form. Currently this framework uses common patterns along with few NLP techniques to extract necessary entities from languages with sparse data. API structure of Chatbot ner is designed keeping in mind usability for Conversational AI applications. Team at Haptik is continuously working towards porting this framework for all Indian languages and their respective local dialects.

Installation

Detailed documentation on how to setup Chatbot NER on your system using docker is available here.

Supported Entities

Entity type Code reference Description example Supported languages - ISO 639-1 code
Time TimeDetector Detect time from given text. tomorrow morning at 5, कल सुबह ५ बजे, kal subah 5 baje 'en', 'hi', 'gu', 'bn', 'mr', 'ta'
Date DateAdvancedDetector Detect date from given text next monday, agle somvar, अगले सोमवार 'en', 'hi', 'gu', 'bn', 'mr', 'ta'
Number NumberDetector Detect number and respective units in given text 50 rs per person, ५ किलो चावल, मुझे एक लीटर ऑइल चाहिए 'en', 'hi', 'gu', 'bn', 'mr', 'ta'
Phone number PhoneDetector Detect phone number in given text 9833530536, +91 9833530536, ९८३३४३०५३५ 'en', 'hi', 'gu', 'bn', 'mr', 'ta'
Email EmailDetector Detect email in text [email protected] 'en'
Text TextDetector Detect custom entities in text string using full text search in Datastore or based on contextual model Order me a pizza, मुंबई में मौसम कैसा है Search supported for 'en', 'hi', 'gu', 'bn', 'mr', 'ta', Contextual model supported for 'en' only
PNR PNRDetector Detect PNR (serial) codes in given text. My flight PNR is 4SGX3E 'en'
regex RegexDetector Detect entities using custom regex patterns My flight PNR is 4SGX3E NA

There are other custom detectors such as city, budget shopping size which are derived from above mentioned primary detectors but they are supported currently in English only and limited to Indian users only. We are currently in process of restructuring them to scale them across languages and geography and their current versions might be deprecated in future. So for applications already in production, we would recommend you to use only primary detectors mentioned in the table above.

API structure

Detailed documentation of APIs for all entity types is available here. Current API structure is built for ease of accessing it from conversational AI applications. However, it can be used for other applications also.

Framework Overview

In any conversational AI application, there are several entities to be identified and logic for detection on one entity might be different from other. We have organised this repository as shown below

entity hierarchy

We have classified entities into four main types i.e. numeral, pattern, temporal and textual.

  • numeral: This type will contain all the entities that deal with the numeral or numbers. For example, number detection, budget detection, size detection, etc.

  • pattern: This will contain all the detection logics where identification can be done using patterns or regular expressions. For example, email, phone_number, pnr, etc.

  • temporal: It will contain detection logics for detecting time and date.

  • textual: It identifies entities by looking at the dictionary. This detection mainly contains detection of text (like cuisine, dish, restaurants, etc.), the name of cities, the location of a user, etc.

Numeral, temporal and pattern have been moved to ner_v2 for language portability with more flexible detection logic. In ner_v1, currently only text entity has language support. We will be moving it to ner_v2 without any major API changes.

Contribution Guidelines

Currently, you can contribute to ner_v2 in Chatbot NER either by adding Training Data or by contributing Detection Patterns in form of regex. We will work on removing few architectural limitations which will ease out process of adding ML models and New Entities in future.

  • Adding Training Data: You can significantly improve detection capabilities of Chatbot NER by simply adding data in csv files. For example, date detection in Hindi and Hinglish can be improved by adding data in csv files mentioned in the image below. You can refer to documentation for date, time and numbers respectively if you wish to contribute. Date Contribution
  • Adding Detection Pattern: You can simply add custom language patterns for different languages by adding simple functions. An example of adding custom pattern for detecting number of people can be referred here.

Please refer to general steps of contribution, approval and coding guidelines mentioned here.