Skip to content

idiap/bert-text-diarization-atc

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BERTraffic: BERT-based Joint Speaker Role and Speaker Change Detection for Air Traffic Control Communications

GitHub GitHub GitHub Black

Code for the paper BERTraffic: BERT-based Joint Speaker Role and Speaker Change Detection for Air Traffic Control Communications. To appear at IEEE Spoken Language Technology Workshop (SLT 2022)

Automatic speech recognition (ASR) allows transcribing the communications between air traffic controllers (ATCOs) and aircraft pilots...
The transcriptions are used later to extract ATC named entities, e.g.,
aircraft callsigns. One common challenge is speech activity detection (SAD)
and speaker diarization (SD). In the failure condition, two or more segments
remain in the same recording, jeopardizing the overall performance. (SEE
FIGURE BELOW) We propose a system that combines SAD and a BERT model to
perform speaker change detection and speaker role detection (SRD) by chunking
ASR transcripts, i.e., SD with a defined number of speakers together with
SRD. The proposed model is evaluated on real-life public ATC databases. Our
BERT SD model baseline reaches up to 10% and 20% token-based Jaccard error
rate (JER) in public and private ATC databases. We also achieved relative
improvements of 32% and 7.7% in JERs and SD error rate (DER), respectively,
compared to VBx, a well-known SD system.

Our system

Pipeline for BERT-based text diarization.

Token classification fine-tuned on UWB-ATCC dataset: 1) Fine-tuned BERT-base-uncased on UWB-ATCC data: https://huggingface.co/Jzuluaga/bert-base-token-classification-for-atc-en-uwb-atcc | GitHub

UWB-ATCC corpus prepared in datasets library format, on HuggingFace hub: https://huggingface.co/datasets/Jzuluaga/uwb_atcc | GitHub

Repository written by: Juan Pablo Zuluaga.


Table of Contents

Preparing Environment

The first step is to create your environment with the required packages for data preparation, formatting, and to carry out the experiments. You can run the following commands to create the conda environment (assuming CUDA - 11.7):

  • Step 1: Using python 3.10: install python and the requirements
git clone https://github.com/idiap/bert-text-diarization-atc
conda create -n diarization python==3.10
conda activate diarization
python -m pip install -r requirements.txt

Before running any script, make sure you have en_US locale set and PYTHONPATH in repository root folder.

export LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8
export PYTHONPATH=$PYTHONPATH:$(pwd) #assuming you are in root repository folder

Usage

There are several steps to replicate/use our proposed models:

Download the Data

For our experiments, we used 3 public databases and 2 private databases (see Table 1 on paper). We provide scripts to replicate some of the results ONLY for the public databases.

Go to the data folder and follow the step-by-step process (easy) in the README file.

TL;DR for 1 public & free corpus:

conda activate diarization
bash data/databases/uwb_atcc/data_prepare_uwb_atcc_corpus.sh
bash data/databases/uwb_atcc/exp_prepare_uwb_atcc_corpus.sh

-- The output folder should be in experiments/data/uwb_atcc/{train,test} --

Training one model

Here, we describe how to train one model with the UWB-ATCC, which is free!!!

Most of the training and evaluation scripts are in the src/ folder. The training procedure is very simple. You can train a baseline model with UWB-ATCC by calling the high-level script:

bash src/train_one_model.sh \
  --dataset "uwb_atcc" \
  --train-data experiments/data/uwb_atcc/train/diarization/utt2text_tags \
  --test-data experiments/data/uwb_atcc/test/diarization/utt2text_tags \
  --output-dir "experiments/results/baseline"

Additionally, you can modify some training hyperparameters by calling train_diarization.py (which is call internally in src/train_one_model.sh) directly and passing values from the CLI, e.g., --train-batch-size 64 (instead of default=32), or use another encoder, --input-model "bert-large-uncased"...

python3 src/train_diarization.py \
    --report-to none \
    --epochs 4 \
    --seed 1234 \
    --max-train-samples -1 \
    --train-batch-size 32 \
    --eval-batch-size 16 \
    --warmup-steps 500 \
    --logging-steps 1000 \
    --save-steps 10000 \
    --eval-steps 500 \
    --max-steps 3000 \
    --input-model bert-base-uncased \
    --test-data experiments/data/uwb_atcc/test/diarization/utt2text_tags \
    experiments/data/uwb_atcc/train/diarization/utt2text_tags \
    experiments/results/baseline/bert-base-uncased/1234/uwb_atcc

Train baselines

We have prepared some scripts to replicate some baselines from our paper.

  1. Script to run and evaluate the baseline BERT models for UWB-ATCC and LDC-ATCC (see Table 3 on paper):
bash train_baselines.sh
  1. Script to run and evaluate the BERT models with DATA augmentation for UWB-ATCC and LDC-ATCC (see Section 3.4 and Table 4 on paper).

You can either train only one model (example for UWB-ATCC and LDC-ATCC corpus):

bash ablations/train_uwb_atcc_baseline_augmentation.sh
# or, for LDC-ATCC,
bash ablations/train_ldc_atcc_baseline_augmentation.sh

or you can train 5 models (per corpus) with different seeds:

bash ablations/train_uwb_atcc_5seeds_augmentation.sh
# or, for LDC-ATCC,
bash ablations/train_ldc_atcc_5seeds_augmentation.sh

Evaluate models (optional)

We have prepared two scripts to evaluate and perform inference with a defined model, e.g., train and evaluate on UWB-ATCC corpus:

  • To evaluate the model and print the metrics in the training folder:
bash src/eval_model.sh \
  --DATA "experiments/data" \
  --batch-size 16 \
  --dataset "uwb_atcc" \
  --output-dir "experiments/results/baseline"
  • To get outputs in the utt2text_tags format:
bash src/run_inference.sh \
  --DATA "experiments/data" \
  --batch-size 16 \
  --dataset "uwb_atcc" \
  --output-dir "experiments/results/baseline"

If you want to do something more specific, like, use UWB-ATCC corpus for training and evaluate on ATCO2-test-set corpus, you can use the python script directly as:

# this is the folder where the model is located
EXP_FOLDER=experiments/results/baseline/bert-base-uncased/1234/uwb_atcc/

python3 src/eval_diarization.py \
    --input-model "$EXP_FOLDER/final_checkpoint" \
    --batch-size 32 \
    --input-files "experiments/data/atco2_corpus/test/diarization/utt2text_tags" \
    --test-names "atco2_corpus" \
    --output-folder "$EXP_FOLDER/evaluations"

That will generate inference outputs on the $EXP_FOLDER/evaluations, OR in $EXP_FOLDER/inference if you use inference_diarization.py instead of eval_diarization.py


Evaluate DER outputs of your model

Here, we describe briefly how to evaluate the outputs of your model with standard acoustic-based metrics, e.g., DER and JER.

This is of special usage when evaluating the model on ASR transcripts. Here, you need to first perform force alignment to align text tokens to acoustic timing.

  1. You need to get the force alignment beween speech/transcription pairs using some force alignment toolkits e.g. Kaldi-aligner to get a CTM file.

Which looks like this:

uwb_atcc_augmented_00000_C 1 0.09 0.05 wizz 1.00 
uwb_atcc_augmented_00000_C 1 0.14 0.04 air 1.00 
uwb_atcc_augmented_00000_C 1 0.19 0.07 four 1.00 
uwb_atcc_augmented_00000_C 1 0.26 0.05 nine 1.00 
uwb_atcc_augmented_00000_C 1 0.31 0.05 one 1.00 
uwb_atcc_augmented_00000_C 1 0.36 0.09 contact 1.00 
uwb_atcc_augmented_00000_C 1 0.45 0.07 praha 1.00 
uwb_atcc_augmented_00000_C 1 0.52 0.12 radar 1.00 
uwb_atcc_augmented_00000_C 1 0.64 0.06 one 1.00 
uwb_atcc_augmented_00000_C 1 0.70 0.05 two 1.00 
uwb_atcc_augmented_00000_C 1 0.75 0.08 zero 1.00 
uwb_atcc_augmented_00000_C 1 0.83 0.12 decimal 1.00 
uwb_atcc_augmented_00000_C 1 0.95 0.04 two 1.00 
uwb_atcc_augmented_00000_C 1 1.00 0.07 seven 1.00 
uwb_atcc_augmented_00000_C 1 1.08 0.09 five 1.00 
uwb_atcc_augmented_00000_C 1 1.17 0.03 good 1.00 
uwb_atcc_augmented_00000_C 1 1.20 0.09 bye 0.95 
  1. To evaluate the DER for subset a of uwb_atcc corpus, you can check the required files in experiments/data/uwb_atcc_subset. For computing the DER on this subset, you can run:
bash src/eval_der.sh

We share this folder which contains only some examples for computing the acoustic-based DER.

Get the metrics

We prepared one script (get_metrics.py that list all the performances produced in the $EXP_FOLDER/evaluations for a given model. For instance, if you run:

  • MODEL with UWB-ATCC corpus is trained in: experiments/results/baseline/bert-base-uncased/1234/uwb_atcc
bash src/get_metrics.sh --evaluation-folder experiments/results/baseline/bert-base-uncased/1234/uwb_atcc/evaluations

Related work

Here is a list of papers that are somehow related to AI/ML targeted to Air traffic control communications:

Some other papers:

How to cite us

If you use this code for your research, please cite our paper with:

Zuluaga-Gomez, J., Sarfjoo, S. S., Prasad, A., Nigmatulina, I., Motlicek, P., Ondrej, K., Ohneiser, O., & Helmke, H. (2021). BERTraffic: BERT-based Joint Speaker Role and Speaker Change Detection for Air Traffic Control Communications. 2022 IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar.

or use the bibtex item:

@article{zuluaga2022bertraffic,
  title={BERTraffic: BERT-based Joint Speaker Role and Speaker Change Detection for Air Traffic Control Communications},
  author={Zuluaga-Gomez, Juan and Sarfjoo, Seyyed Saeed and Prasad, Amrutha and Nigmatulina, Iuliia and Motlicek, Petr and Ondre, Karel and Ohneiser, Oliver and Helmke, Hartmut},
  journal={IEEE Spoken Language Technology Workshop (SLT), Doha, Qatar},
  year={2022}
  }

and,

@article{zuluaga2022atco2,
  title={ATCO2 corpus: A Large-Scale Dataset for Research on Automatic Speech Recognition and Natural Language Understanding of Air Traffic Control Communications},
  author={Zuluaga-Gomez, Juan and Vesel{\`y}, Karel and Sz{\"o}ke, Igor and Motlicek, Petr and others},
  journal={arXiv preprint arXiv:2211.04054},
  year={2022}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published