ArabCeleb

ArabCeleb is an audio dataset collected in the wild that specifically focuses on arabic language. The proposed dataset contains 1930 utterances from 100 celebrities taken from video on YouTube.com. The dataset might be used for several speaker recognition tasks: identification, verification, gender recognition as well as multimodal recognition tasks thus integrating audio and video tracks.

To allow the training of methods for speaker identification that can then be reused for speaker verification, we generate the development and test sets making sure that there is no overlap between the speakers of the development and test sets. The development set is further divided into training, validation, and test sets for speaker identification.

Dependencies

Python 3.8
pytube 11.0.1
ffmpeg 4.2.4

In order to successfully run the code, install the packages listed in requirements.txt as follows:

pip install -r requirements.txt

Downloads

We provide Youtube URLs, timestamps for utterances, and speaker metadata.

URLs and timestamps

We provide URLs for each YouTube video and timestamps for utterances into the file utterance_info.json.

Audio files

The audio files can be downloaded using the information provided into the file info.json running the script prepare_dataset.py as follows:

python prepare_dataset.py

The script:

Download the video at the given Youtube URL
Cut the entire video into video sequences
Extract and save the audio signal into wav a file

Metadata

Full names, year of born, and gender labels for all the speakers in the dataset can be found in speaker_info.csv.

Dataset split for identification

List of trial pairs for verification

License

The ArabCeleb dataset is available to download for commercial/research purposes under a Creative Commons Attribution 4.0 International License. The copyright remains with the original owners of the video. A complete version of the license can be found here.

Caution: We note that the distribution of identities in the ArabCeleb datasets may not be representative of the global human population. Please be careful of unintended societal, gender, racial and other biases when training or deploying models trained on this data.

Please contact the authors below if you have any queries regarding the dataset.

Citation

Please cite the following if you make use of the dataset:

Simone Bianco, Luigi Celona, Intissar Khalifa, Paolo Napoletano, Alexey Petrovsky, Flavio Piccoli, Raimondo Schettini, and Ivan Shanin. ArabCeleb: Speaker Recognition in Arabic. In AIxIA 2021 - Advances in Artificial Intelligence, Springer, pp. 338-347, 2022.

@inproceedings{bianco2022arabceleb,
 author = {Bianco, Simone and Celona, Luigi and Khalifa, Intissar and Napoletano, Paolo and Petrovsky, Alexey and Piccoli, Flavio and Schettini, Raimondo and Shanin, Ivan},
 booktitle="AIxIA 2021 -- Advances in Artificial Intelligence",
 year="2022",
 publisher="Springer International Publishing",
 address="Cham",
 pages="338--347",
 title = {ArabCeleb: Speaker Recognition in Arabic},
 isbn="978-3-031-08421-8"
}

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
iden_split.txt		iden_split.txt
prepare_dataset.py		prepare_dataset.py
requirements.txt		requirements.txt
speaker_info.csv		speaker_info.csv
utterance_info.json		utterance_info.json
veri_test.txt		veri_test.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ArabCeleb

Dependencies

Downloads

URLs and timestamps

Audio files

Metadata

License

Citation

About

Releases

Packages

Contributors 2

Languages

License

CeLuigi/ArabCeleb

Folders and files

Latest commit

History

Repository files navigation

ArabCeleb

Dependencies

Downloads

URLs and timestamps

Audio files

Metadata

License

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages