Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: Issue with NER Model: BIO-format Labels Not Recognized #3317

Open
SPVillacorta opened this issue Sep 13, 2023 · 6 comments
Open
Labels
question Further information is requested

Comments

@SPVillacorta
Copy link

SPVillacorta commented Sep 13, 2023

Question

Hi Flair Community, I'm attempting to train a NER model using Flair but my BIO-formatted labels are not recognised. I've converted my CSV annotations to CoNLL format and checked for correct loading and this is the code I tried to use:

# Imports and other setup
import flair
import glob
import nltk
import os
import pandas as pd
import pdfplumber
from flair.data import Sentence, Corpus
from flair.datasets import ColumnCorpus
from flair.embeddings import FlairEmbeddings, StackedEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer

nltk.download("punkt")

MODEL_DIR = "./model"
DATA_DIR = "./data"
PDF_DIR = "./pdfs"


# Function to convert CSV to CoNLL
def csv_to_conll(csv_file, conll_file):
    df = pd.read_csv(csv_file)

    with open(conll_file, 'w') as f:
        for index, row in df.iterrows():
            # Check if the row is entirely composed of NaN values
            if pd.isna(row['text']) and pd.isna(row['label']):
                f.write("\n")
                continue

            word = row['text']
            label = row['label']
            
            # This checks if either 'text' or 'label' is NaN, and skips that row with a warning
            if pd.isna(word) or pd.isna(label):
                print(f"Warning: Skipping row {index} due to NaN value.")
                continue

            f.write(f"{word}\t{label}\n")

            
# Convert CSV files to CoNLL format
csv_to_conll(f"{DATA_DIR}/train.csv", f"{DATA_DIR}/train.conll")
csv_to_conll(f"{DATA_DIR}/dev.csv", f"{DATA_DIR}/dev.conll")
csv_to_conll(f"{DATA_DIR}/test.csv", f"{DATA_DIR}/test.conll")

# Function to convert PDF to CoNLL
def pdf_to_conll(pdf_dir: str, data_dir: str):
    pdf_paths = glob.glob(os.path.join(pdf_dir, "*.pdf"))
    texts = []

    for pdf_path in pdf_paths:
        with pdfplumber.open(pdf_path) as pdf:
            text = "\n".join([page.extract_text() for page in pdf.pages])
            texts.append(text)
            
    with open(os.path.join(data_dir, "pdfs.conll"), "w") as f:
        for text in texts:
            sentences = nltk.sent_tokenize(text)
            for sentence in sentences:
                sentence = sentence.replace("\n", " ").replace("\t", " ")
                f.write(f"{sentence}\tO\n")
            f.write("\n")
    return texts


# Function to train the model
def train(data_dir: str, model_dir: str):
    pdf_to_conll(PDF_DIR, DATA_DIR)
    
    # Assuming CoNLL formatted CSV files are named as train.conll, dev.conll, test.conll
    columns = {0: 'text', 1: 'ner'}
    corpus: Corpus = ColumnCorpus(data_dir, columns,
                                  train_file='train.conll',
                                  dev_file='dev.conll',
                                  test_file='test.conll')

    label_type = 'ner'
    tag_dictionary = corpus.make_label_dictionary(label_type=label_type)

    embeddings: StackedEmbeddings = StackedEmbeddings(
        [
            FlairEmbeddings("mix-forward"),
            FlairEmbeddings("mix-backward"),
        ]
    )

    tagger: SequenceTagger = SequenceTagger(
        hidden_size=256,
        embeddings=embeddings,
        tag_dictionary=tag_dictionary,
        tag_type=label_type,
        use_crf=True,
    )

    trainer: ModelTrainer = ModelTrainer(tagger, corpus)
    trainer.train(
        model_dir,
        learning_rate=0.2,
        mini_batch_size=30,
        max_epochs=100,
    )


# Call your train function
train(DATA_DIR, MODEL_DIR)

When executing, the F-score, precision, and recall are all zero. Any ideas on what could be going wrong?

@SPVillacorta SPVillacorta added the question Further information is requested label Sep 13, 2023
@nvenkat94
Copy link

I'm having same issue

@alanakbik
Copy link
Collaborator

Sorry for the late reply! @SPVillacorta did you solve the problem? If not, could you share a snippet of the dataset you are loading?

@nvenkat94 could you expand on your problem?

@alanakbik alanakbik added the Awaiting Response Waiting for new input from the author label Oct 24, 2023
@SPVillacorta
Copy link
Author

ok the "train.conll" looks like the following:

matching O
i.e. O
presumably O
from O
Mamba O
These O
since O
prospectivity I-PROCESS
fibrous O
base O
ore O
the O
20 O
based O
Andy O
simply O
martite B-MINERAL
Bungaroo B-PLACE
The O
on O
between O
250 O
the O
The O
the O
below O
are O
virtually O
oxides O
skin O
Gole O
to O
all O
published O
southern O
deposits B-ORE_DEPOSIT
sorted O

@nvenkat94
Copy link

nvenkat94 commented Oct 26, 2023

Thanks for your valuable response @alanakbik
My issue has been fixed. Earlier my data has "O" before "I-", after revised input data issue has been fixed. @SPVillacorta Input Data has issue with "I-" tag. If there is "I-", Their previous tag should be "B-".

Tag Details:
B-: Beginning
I- : Intermediate
O-: outside

your data should be in following format

`matching O
i.e. O
presumably O
from O
Mamba O
These O
since O
prospectivity B-PROCESS
fibrous O
base O
ore O
the O
20 O
based O
Andy O
simply O
martite B-MINERAL
Bungaroo B-PLACE
`

@alanakbik
Copy link
Collaborator

Thanks for sharing the info! Yes, in IOB2 the first tag should be a B-. @SPVillacorta does this fix your issue?

@adambuttrick
Copy link

adambuttrick commented Jan 22, 2024

I just ran into this issue attempting to load training data like so, based on an example I found elsewhere:

from flair.data import Corpus
from flair.datasets import ColumnCorpus
import torch

columns = {0: 'text', 1: 'ner'}
tag_type = 'ner'
corpus = ColumnCorpus('/content/drive/MyDrive/training_data/flair/', columns)
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
print(tag_dictionary)

and then noticed the deprecation message about make_tag_dictionary being replaced with make_label_dictionary and so switched to:

tag_dictionary = corpus.make_label_dictionary(label_type=tag_type)

...at which point the data loaded successfully.

The behavior around the deprecated loader and message make it seem as if it still works, especially if you don't check the tag dictionary itself, but it does not appear to do so. Just commenting to flag and in case anyone else comes across this issue, looking to resolve.

@github-actions github-actions bot removed the Awaiting Response Waiting for new input from the author label Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants