Entity linking consists of several steps:
It is the process of identifying the key information in the text. Each piece of information is classified into a set of predefined categories such as people, organizations, and locations. An entity is the thing that is consistently talked about or referred to in the particular text.
NER consists of three main steps:
- Tokenization, which involves breaking the text into individual words or phrases.
- Part-of-speech tagging, which assigns a grammatical tag to each word.
- Entity recognition, which identifies and classifies the named entities in the text.
Suppose we have the following sentence: "Apple Inc. was founded by Steve Jobs in Cupertino, California." The process of NER would look like this:
- Tokenization: "Apple" "Inc." "was" "founded" "by" "Steve" "Jobs" "in" "Cupertino" "California"
- Part-of-speech tagging: "Apple" - Noun (specifically, a proper noun) "Inc." - Noun "was" - Verb "founded" - Verb "by" - Preposition "Steve" - Noun (a person's name) "Jobs" - Noun (a person's name) "in" - Preposition "Cupertino" - Noun (a location name) "California" - Noun (a location name)
- Entity recognition: "Apple Inc." - This is an organization name. "Steve Jobs" - This is a person's name. "Cupertino" - This is a location name. "California" - This is another location name.
In this project Deep Learning Based NER is used. It uses machine learning algorithms to analyze text and identify patterns that indicate the presence of named entities. These algorithms are trained on large datasets of annotated text, where human annotators have labeled the named entities in the text. SpaCy NER uses a method called word embedding, that is capable of understanding the semantic and syntactic relationship between various words
SpaCy allows using their Trained Models & Pipelines. Currently, in this project the model en_core_web_lg is used. It can be downloaded using the following command:
python -m spacy download en_core_web_lg
After that the package is loaded in the Python file via
spacy.load('en_core_web_sm')
Below is a list and the meaning of SpaCy entity tags, which are used during the process of classification of the entities:
For each entity the systems returns:
- the surface form (text) of the entity
- the number of the starting character
- the number of the ending character
- the label, which is the name of one of the predefined classes
For example for the text:
Nikola Tesla (Serbian Cyrillic: Никола Тесла) was a Serbian-American inventor, electrical engineer, mechanical engineer, and futurist best known for his contributions to the design of the modern alternating current (AC) electricity supply system.
The entities would be:
[('Nikola Tesla', 0, 12, 'PERSON'),
('Serbian', 14, 21, 'NORP'),
('Тесла', 39, 44, 'PERSON'),
('Serbian', 52, 59, 'NORP')]
(Nikola Tesla, Serbian, Тесла, Serbian)
The labelled text looks like this:
The documentation of DBpedia ontology is available online here. Visualised ontology is available online here.
SPARQL can be used to query the ontology. SPARQLWrapper allows connecting to a specific URL and query the data.