ArabCeleb is an audio dataset collected in the wild that specifically focuses on arabic language. The proposed dataset contains 1930 utterances from 100 celebrities taken from video on YouTube.com. The dataset might be used for several speaker recognition tasks: identification, verification, gender recognition as well as multimodal recognition tasks thus integrating audio and video tracks.
To allow the training of methods for speaker identification that can then be reused for speaker verification, we generate the development and test sets making sure that there is no overlap between the speakers of the development and test sets. The development set is further divided into training, validation, and test sets for speaker identification.
- Python 3.8
- pytube 11.0.1
- ffmpeg 4.2.4
In order to successfully run the code, install the packages listed in requirements.txt
as follows:
pip install -r requirements.txt
We provide Youtube URLs, timestamps for utterances, and speaker metadata.
We provide URLs for each YouTube video and timestamps for utterances into the file utterance_info.json
.
The audio files can be downloaded using the information provided into the file info.json
running the script prepare_dataset.py
as follows:
python prepare_dataset.py
The script:
- Download the video at the given Youtube URL
- Cut the entire video into video sequences
- Extract and save the audio signal into wav a file
Full names, year of born, and gender labels for all the speakers in the dataset can be found in speaker_info.csv
.
Dataset split for identification
List of trial pairs for verification
The ArabCeleb dataset is available to download for commercial/research purposes under a Creative Commons Attribution 4.0 International License. The copyright remains with the original owners of the video. A complete version of the license can be found here.
Caution: We note that the distribution of identities in the ArabCeleb datasets may not be representative of the global human population. Please be careful of unintended societal, gender, racial and other biases when training or deploying models trained on this data.
Please contact the authors below if you have any queries regarding the dataset.
Please cite the following if you make use of the dataset:
- Simone Bianco, Luigi Celona, Intissar Khalifa, Paolo Napoletano, Alexey Petrovsky, Flavio Piccoli, Raimondo Schettini, and Ivan Shanin. ArabCeleb: Speaker Recognition in Arabic. In AIxIA 2021 - Advances in Artificial Intelligence, Springer, pp. 338-347, 2022.
@inproceedings{bianco2022arabceleb,
author = {Bianco, Simone and Celona, Luigi and Khalifa, Intissar and Napoletano, Paolo and Petrovsky, Alexey and Piccoli, Flavio and Schettini, Raimondo and Shanin, Ivan},
booktitle="AIxIA 2021 -- Advances in Artificial Intelligence",
year="2022",
publisher="Springer International Publishing",
address="Cham",
pages="338--347",
title = {ArabCeleb: Speaker Recognition in Arabic},
isbn="978-3-031-08421-8"
}