Skip to content

Commit ce29d90

Browse files
committed
Add NER tutorial and update documentation configuration
1 parent 76daa5b commit ce29d90

File tree

4 files changed

+60
-3
lines changed

4 files changed

+60
-3
lines changed

doc/conf.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -68,7 +68,7 @@
6868
#
6969
# This is also used if you do content translation via gettext catalogs.
7070
# Usually you set "language" from the command line for these cases.
71-
language = None
71+
language = "en"
7272

7373
# List of patterns, relative to source directory, that match files and
7474
# directories to ignore when looking for source files.
@@ -171,6 +171,6 @@
171171

172172
# Example configuration for intersphinx: refer to the Python standard library.
173173
intersphinx_mapping = {
174-
'https://docs.python.org/': None,
174+
'python': ('https://docs.python.org/', None),
175175
'pyobo': ('https://pyobo.readthedocs.io/en/latest/', None),
176176
}

doc/modules/index.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -36,6 +36,13 @@ Process
3636
:members:
3737
:show-inheritance:
3838

39+
Named Entity Recognition
40+
------------------------
41+
.. automodule:: gilda.ner
42+
:members:
43+
:show-inheritance:
44+
45+
3946
Pandas Utilities
4047
----------------
4148
.. automodule:: gilda.pandas_utils

doc/requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
sphinx
1+
sphinx<7.0
22
sphinx_autodoc_typehints
33
sphinx_rtd_theme
44
mock

gilda/ner.py

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,59 @@
1+
"""
2+
Gilda implements a simple dictionary-based named entity
3+
recognition (NER) algorithm. It can be used as follows:
4+
5+
>>> from gilda.ner import annotate
6+
>>> text = "MEK phosphorylates ERK"
7+
>>> results = annotate(text)
8+
9+
The results are a list of 4-tuples containing:
10+
- the text string matched
11+
- a :class:`gilda.ScoredMatch` instance containing the _best_ match
12+
- the position in the text string where the entity starts
13+
- the position in the text string where the entity ends
14+
15+
In this example, the two concepts are grounded to FamPlex entries.
16+
17+
>>> results[0][0], results[0][1].term.get_curie(), results[0][2], results[0][3]
18+
('MEK', 'fplx:MEK', 0, 3)
19+
>>> results[1][0], results[1][1].term.get_curie(), results[1][2], results[1][3]
20+
('ERK', 'fplx:ERK', 19, 22)
21+
22+
If you directly look in the second part of the 4-tuple, you get a full
23+
description of the match itself:
24+
25+
>>> results[0][1]
26+
ScoredMatch(Term(mek,MEK,FPLX,MEK,MEK,curated,famplex,None,None,None),\
27+
0.9288806431663574,Match(query=mek,ref=MEK,exact=False,space_mismatch=\
28+
False,dash_mismatches=set(),cap_combos=[('all_lower', 'all_caps')]))
29+
30+
BRAT
31+
----
32+
Gilda implements a way to output annotation in a format appropriate for the
33+
`BRAT Rapid Annotation Tool (BRAT) <https://brat.nlplab.org/index.html>`_
34+
35+
>>> from gilda.ner import get_brat
36+
>>> from pathlib import Path
37+
>>> brat_string = get_brat(results)
38+
>>> Path("results.ann").write_text(brat_string)
39+
>>> Path("results.txt").write_text(text)
40+
41+
For brat to work, you need to store the text in a file with
42+
the extension `.txt` and the annotations in a file with the
43+
same name but extension `.ann`.
44+
"""
45+
146
from nltk.corpus import stopwords
247
from nltk.tokenize import sent_tokenize
348

449
from gilda import ScoredMatch, get_grounder
550
from gilda.process import normalize
651

52+
__all__ = [
53+
"annotate",
54+
"get_brat",
55+
]
56+
757
stop_words = set(stopwords.words('english'))
858

959

0 commit comments

Comments
 (0)