A simple converter from SpaCy Entities (Spans) to Huggingface BILOU formatted data (tokens and ner_tags)
I've always struggled to convert my spacy formatted spans into data that can be trained on using huggingface transformers. But Spacy's Entity format is the most intuitive format for tagging entities for NER.
This repo is a simple converter that leverages spacy.gold.biluo_tags_from_offsets
and the SpaCy tokenizations
repo that
creates a 1-line function to convert spacy
formatted spans to tokens
and ner_tags
that can be fed into any
Token Classification Transformer
You can demo the functionality on streamlit or spaces
Spacy format simply means having a text input and character level span assignments.
For example:
text = "Hello, my name is Ben"
spans = [{"start": 18, "end": 21, "label": "person"}, ...]
This is the common structure of output data from labeling tools like LabelStudio or LabelBox, because it's easy and human interpretable.
Huggingface format refers to the BIO/BILOU/BIOES tagging format commonly used for fine-tuning transformers. The input text is tokenized, and each token
is given a tag to denote whether or not it's a label (and it's location, Beginning, Inside etc). Here's an example: https://huggingface.co/datasets/wikiann
For more information about this tagging system, see wikipedia
This format is tricky, though, because it is entirely dependant on the tokenizer used. Tokens are not simply space separated words. Each tokenizer has a specific vocabulary of tokens that break down works into unique sub-words. So moving from character level spans to token level tags is a very manual process. That's a core reason I built this tool.
pip install spacy-to-hf
python -m spacy download en_core_web_sm
from spacy_to_hf import spacy_to_hf
span_data = [
"text": "I have a BSc (Bachelors of Computer Sciences) from NYU",
"spans": [
{"start": 9, "end": 12, "label": "degree"},
{"start": 14, "end": 44, "label": "degree"},
{"start": 51, "end": 54, "label": "university"}
hf_data = spacy_to_hf(span_data, "bert-base-cased")
print(list(zip(hf_data["tokens"][0], hf_data["ner_tags"][0])))
Or, if you want to immediately start fine-tuning or upload this to huggingface, you can run
ds = spacy_to_hf(span_data, "bert-base-cased", as_hf_dataset=True)
This will return your data as a HuggingFace Dataset
and will automatically
string-index your ner_tags
into a ClassLabel
Project setup is credited to @anthonycorletti and his awesome project template repo