[Question]: Issue with NER Model: BIO-format Labels Not Recognized #3317

SPVillacorta · 2023-09-13T08:25:16Z

Question

Hi Flair Community, I'm attempting to train a NER model using Flair but my BIO-formatted labels are not recognised. I've converted my CSV annotations to CoNLL format and checked for correct loading and this is the code I tried to use:

# Imports and other setup
import flair
import glob
import nltk
import os
import pandas as pd
import pdfplumber
from flair.data import Sentence, Corpus
from flair.datasets import ColumnCorpus
from flair.embeddings import FlairEmbeddings, StackedEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer

nltk.download("punkt")

MODEL_DIR = "./model"
DATA_DIR = "./data"
PDF_DIR = "./pdfs"


# Function to convert CSV to CoNLL
def csv_to_conll(csv_file, conll_file):
    df = pd.read_csv(csv_file)

    with open(conll_file, 'w') as f:
        for index, row in df.iterrows():
            # Check if the row is entirely composed of NaN values
            if pd.isna(row['text']) and pd.isna(row['label']):
                f.write("\n")
                continue

            word = row['text']
            label = row['label']
            
            # This checks if either 'text' or 'label' is NaN, and skips that row with a warning
            if pd.isna(word) or pd.isna(label):
                print(f"Warning: Skipping row {index} due to NaN value.")
                continue

            f.write(f"{word}\t{label}\n")

            
# Convert CSV files to CoNLL format
csv_to_conll(f"{DATA_DIR}/train.csv", f"{DATA_DIR}/train.conll")
csv_to_conll(f"{DATA_DIR}/dev.csv", f"{DATA_DIR}/dev.conll")
csv_to_conll(f"{DATA_DIR}/test.csv", f"{DATA_DIR}/test.conll")

# Function to convert PDF to CoNLL
def pdf_to_conll(pdf_dir: str, data_dir: str):
    pdf_paths = glob.glob(os.path.join(pdf_dir, "*.pdf"))
    texts = []

    for pdf_path in pdf_paths:
        with pdfplumber.open(pdf_path) as pdf:
            text = "\n".join([page.extract_text() for page in pdf.pages])
            texts.append(text)
            
    with open(os.path.join(data_dir, "pdfs.conll"), "w") as f:
        for text in texts:
            sentences = nltk.sent_tokenize(text)
            for sentence in sentences:
                sentence = sentence.replace("\n", " ").replace("\t", " ")
                f.write(f"{sentence}\tO\n")
            f.write("\n")
    return texts


# Function to train the model
def train(data_dir: str, model_dir: str):
    pdf_to_conll(PDF_DIR, DATA_DIR)
    
    # Assuming CoNLL formatted CSV files are named as train.conll, dev.conll, test.conll
    columns = {0: 'text', 1: 'ner'}
    corpus: Corpus = ColumnCorpus(data_dir, columns,
                                  train_file='train.conll',
                                  dev_file='dev.conll',
                                  test_file='test.conll')

    label_type = 'ner'
    tag_dictionary = corpus.make_label_dictionary(label_type=label_type)

    embeddings: StackedEmbeddings = StackedEmbeddings(
        [
            FlairEmbeddings("mix-forward"),
            FlairEmbeddings("mix-backward"),
        ]
    )

    tagger: SequenceTagger = SequenceTagger(
        hidden_size=256,
        embeddings=embeddings,
        tag_dictionary=tag_dictionary,
        tag_type=label_type,
        use_crf=True,
    )

    trainer: ModelTrainer = ModelTrainer(tagger, corpus)
    trainer.train(
        model_dir,
        learning_rate=0.2,
        mini_batch_size=30,
        max_epochs=100,
    )


# Call your train function
train(DATA_DIR, MODEL_DIR)

When executing, the F-score, precision, and recall are all zero. Any ideas on what could be going wrong?

The text was updated successfully, but these errors were encountered:

nvenkat94 · 2023-10-20T08:56:55Z

I'm having same issue

alanakbik · 2023-10-24T14:49:41Z

Sorry for the late reply! @SPVillacorta did you solve the problem? If not, could you share a snippet of the dataset you are loading?

@nvenkat94 could you expand on your problem?

SPVillacorta · 2023-10-26T06:37:10Z

ok the "train.conll" looks like the following:

matching O
i.e. O
presumably O
from O
Mamba O
These O
since O
prospectivity I-PROCESS
fibrous O
base O
ore O
the O
20 O
based O
Andy O
simply O
martite B-MINERAL
Bungaroo B-PLACE
The O
on O
between O
250 O
the O
The O
the O
below O
are O
virtually O
oxides O
skin O
Gole O
to O
all O
published O
southern O
deposits B-ORE_DEPOSIT
sorted O

nvenkat94 · 2023-10-26T06:57:14Z

Thanks for your valuable response @alanakbik
My issue has been fixed. Earlier my data has "O" before "I-", after revised input data issue has been fixed. @SPVillacorta Input Data has issue with "I-" tag. If there is "I-", Their previous tag should be "B-".

Tag Details:
B-: Beginning
I- : Intermediate
O-: outside

your data should be in following format

`matching O
i.e. O
presumably O
from O
Mamba O
These O
since O
prospectivity B-PROCESS
fibrous O
base O
ore O
the O
20 O
based O
Andy O
simply O
martite B-MINERAL
Bungaroo B-PLACE
`

alanakbik · 2023-10-26T12:34:25Z

Thanks for sharing the info! Yes, in IOB2 the first tag should be a B-. @SPVillacorta does this fix your issue?

adambuttrick · 2024-01-22T01:45:55Z

I just ran into this issue attempting to load training data like so, based on an example I found elsewhere:

from flair.data import Corpus
from flair.datasets import ColumnCorpus
import torch

columns = {0: 'text', 1: 'ner'}
tag_type = 'ner'
corpus = ColumnCorpus('/content/drive/MyDrive/training_data/flair/', columns)
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary)

and then noticed the deprecation message about make_tag_dictionary being replaced with make_label_dictionary and so switched to:

tag_dictionary = corpus.make_label_dictionary(label_type=tag_type)

...at which point the data loaded successfully.

The behavior around the deprecated loader and message make it seem as if it still works, especially if you don't check the tag dictionary itself, but it does not appear to do so. Just commenting to flag and in case anyone else comes across this issue, looking to resolve.

SPVillacorta added the question Further information is requested label Sep 13, 2023

alanakbik added the Awaiting Response Waiting for new input from the author label Oct 24, 2023

github-actions bot removed the Awaiting Response Waiting for new input from the author label Jan 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: Issue with NER Model: BIO-format Labels Not Recognized #3317

[Question]: Issue with NER Model: BIO-format Labels Not Recognized #3317

SPVillacorta commented Sep 13, 2023 •

edited

Loading

nvenkat94 commented Oct 20, 2023

alanakbik commented Oct 24, 2023

SPVillacorta commented Oct 26, 2023

nvenkat94 commented Oct 26, 2023 •

edited

Loading

alanakbik commented Oct 26, 2023

adambuttrick commented Jan 22, 2024 •

edited

Loading

[Question]: Issue with NER Model: BIO-format Labels Not Recognized #3317

[Question]: Issue with NER Model: BIO-format Labels Not Recognized #3317

Comments

SPVillacorta commented Sep 13, 2023 • edited Loading

Question

nvenkat94 commented Oct 20, 2023

alanakbik commented Oct 24, 2023

SPVillacorta commented Oct 26, 2023

nvenkat94 commented Oct 26, 2023 • edited Loading

alanakbik commented Oct 26, 2023

adambuttrick commented Jan 22, 2024 • edited Loading

SPVillacorta commented Sep 13, 2023 •

edited

Loading

nvenkat94 commented Oct 26, 2023 •

edited

Loading

adambuttrick commented Jan 22, 2024 •

edited

Loading