Skip to content

alan-turing-institute/Turing-RSS-Health-Data-Lab-Biomedical-Acoustic-Markers

Repository files navigation

DOI

Turing-RSS Health Data Lab Biomedical Acoustic Markers

This repository details the code required to replicate the results in the following three papers:

  • Audio-based AI classifiers show no evidence of improved COVID-19 diagnosis over simple symptoms checkers
  • A large-scale and PCR-referenced bioacoustics dataset for COVID-19
  • Statistical Design and Analysis for Robust Machine Learning: A Case Study from COVID-19

note: In order to replicate our findings one must first download the The UK COVID-19 Vocal Audio Dataset, pelase see below for details

note: as both the code to generate the SSAST experiments and perform openSMILE feature extration exist as a git submodule when cloning this repo, if you are intending on running the above analysis make sure to add the recursive submodule flag to the clone command:

git clone --recurse-submodules <repo.git> 

Contents

Data Paper --> notebook to produce summary statistics and plotly figures in UK COVID-19 Vocal Audio Dataset data descriptor.

SVM Baseline --> code used to generate the openSMILE-SVM baseline results along with weak-robust and nearest neighbour mapping ablation studies.

BNN Baseline --> code used to generate the ResNet-50 BNN baseline results and uncertainty metrics.

Code for plotting --> code used to generate the plots for the three papers.

Utilities --> helper functions + main dataset class for machine learning training.

Unit Tests --> unit tests for checking validitiy of train/val/test splits and other functionality.

Self Supervised Audio Spectrogram Transformer --> folder ssast_ciab/ is a git submodule pointer to a particular commit in the ssast transformer repo used to generate the main results of the study

Docker --> code used to create the docker image for the experimental environment (also contains the requirements.txt repo if a python virtual environment is preferred)

Docker

To make the replication of results easy we have provided a docker image of the experimental environment. To boot up a docker container run:

docker run -it --name <name_for_container> -v <location_of_git_repo>:/workspace/ --gpus=all --ipc=host harrycoppock/ciab:ciab_v4

This will open a new terminal inside the docker. Do not worry about having to download the docker image from the hub, the above command with handle this.

If you are on macOS please add the flag --platform=linux/amd64

The UK COVID-19 Vocal Audio Dataset

The open access version of the UK COVID-19 Vocal Audio Dataset has been deposited in a Zenodo repository https://doi.org/10.5281/zenodo.10043977, and is available under an Open Government License (v3.0).

The full UK COVID-19 Vocal Audio Dataset is not publicly available as is classed as 'Special Category Personal Data'. Access may be requested from UKHSA ([email protected]), and will be granted subject to approval and a data sharing contract. To learn about how to apply for UKHSA data, visit: https://www.gov.uk/government/publications/accessing-ukhsa-protected-data/accessing-ukhsa-protected-data

The open access version of the dataset does not contain the 'sentence' modality, which has been removed, leaving behind the 'cough', 'three cough' and 'exahaltion' modalities. In addition, to meet open access requirements, some select attributes of the meta data have been aggregated (to prevent groups of individuals of smaller than 3 being singled out on selection of attributes). This means that the 'sentence' modality results are not replicable or the creation of the train-test splits. We note that this just applies for the the open access version of the data and that our full stack is replicable with the original dataset which can be accessed following the instructions above. We note that we provide the train-test splits in .csv form so that the machine learning experiments can be replicated with the open access data.

Demo!

To easily run the code yourself using your own voice recordings, (no need to download the data), we have provided a short demo hosted on google colab. Please follow this link to have a go yourself!

SSAST results

Warning preprocessing and training take a considerable amount of time and require access to a V100 GPU or equivalent.

To replicate the SSAST results first the audio files need to be preprocessed:

cd ssast_ciab/src/finetune/ciab/
python prep_ciab.py

Once this is complete then training can begin:

sh run_ciab.sh

BNN results

For more more detailed description please consult the BNN README.

Warning Please note that the full run is very compute intensive, and was performed on a K4 Tesla GPU/V100 GPU with at least 64 GB of system RAM. There are options to train on sub-samples of the dataset provided in the appropriate files. The code is configured with the config file in BNNBaseline/lib/config.py.

To replicate BNN results, first cd BNNBaseline/lib and extract features with:

python extract_feat.py

Once complete, train the model with

python train.py

To evaluate results and save to the folder specified in BNNBaseline/lib/config.py, run

python evaluate.py

SVM-Opensmile baseline

To run OpenSmile feature extraction first build the OpenSmile audio feature extraction package from source by following these instructions. Then run:

python SvmBaseline/opensmile_feat_extraction.py

This will extract opensmile features for the test and train sets in the s3 bucket. It will save them in features/opensmile/

To run SVM classificaiton on extracted features:

python SvmBaseline/svm.py

Dummy config

To run experiments please fill in the fields in ./dummy_config.yaml

Replicate experimental splits [optional]

To replicate the creation of the 3 training sets, 3 validation sets and 5 testing sets the following commands can be run:

The pipeline for generating splits is as follows:

  1. Execute all cells in analysis_splits/Exploratory Analysis and Split Generation.ipynb (generates train + test splits)
  2. Execute: (generates the validation sets for train).
cd utils
python dataset_stats.py --create_meta=yes
cd .. 
  1. Execute all cells in notebooks/matching/matching_final.ipynb (generates the matched training and test sets).
  2. Execute:
cd utils
python dataset_stats.py --create_matched_validation=yes
cd .. 

(creates the matched validation set)

Tests

There are no unit tests for this code base. Assert statements however feature throughout the codebase to test for expected functionality. There are a set of tests which should be run once train-test splits are created. This tests for over lapping splits, duplicate results and much more.

Citations

This repository details the code used to create the results presented in the following three papers. Please cite.

@article{coppock2024audio,
 author = {Coppock, Harry and Nicholson, George and Kiskin, Ivan and Koutra, Vasiliki and Baker, Kieran and Budd, Jobie and Payne, Richard and Karoune, Emma and Hurley, David and Titcomb, Alexander and Egglestone, Sabrina and Cañadas, Ana Tendero and Butler, Lorraine and Jersakova, Radka and Mellor, Jonathon and Patel, Selina and Thornley, Tracey and Diggle, Peter and Richardson, Sylvia and Packham, Josef and Schuller, Björn W. and Pigoli, Davide and Gilmour, Steven and Roberts, Stephen and Holmes, Chris},
 title = {Audio-based AI classifiers show no evidence of improved COVID-19 screening over simple symptoms checkers},
 journal = {Nature Machine Intelligence},
 year = {2024},
 doi = {https://doi.org/10.1038/s42256-023-00773-8}
}

@article{budd2022,
    author={Jobie Budd and Kieran Baker and Emma Karoune and Harry Coppock and Selina Patel and Ana Tendero Cañadas and Alexander Titcomb and Richard Payne and David Hurley and Sabrina Egglestone and Lorraine Butler and George Nicholson and Ivan Kiskin and Vasiliki Koutra and Radka Jersakova and Peter Diggle and Sylvia Richardson and Bjoern Schuller and Steven Gilmour and Davide Pigoli and Stephen Roberts and Josef Packham Tracey Thornley Chris Holmes},
    title={A large-scale and PCR-referenced vocal audio dataset for COVID-19},
    year={2022},
    journal={arXiv},
    doi = {10.48550/ARXIV.2212.07738}
}

@article{Pigoli2022,
    author={Davide Pigoli and Kieran Baker and Jobie Budd and Lorraine Butler and Harry Coppock
        and Sabrina Egglestone and Steven G.\ Gilmour and Chris Holmes and David Hurley and Radka Jersakova and Ivan Kiskin and Vasiliki Koutra and George Nicholson and Joe Packham and Selina Patel and Richard Payne and Stephen J.\ Roberts and Bj\"{o}rn W.\ Schuller and Ana Tendero-Ca$\tilde{n}$adas and Tracey Thornley and Alexander Titcomb},
    title={Statistical Design and Analysis for Robust Machine Learning: A Case Study from Covid-19},
    year={2022},
    journal={arXiv},
    doi = {10.48550/ARXIV.2212.08571}
}