Yes, (V1, V2, V3) |
Wikipedia |
Yes, however! |
Articles |
HF_Wikipedia dumps.wikipedia, wili_2018, Leipzig |
CC BY-NC-SA 3.0 |
Yes, (V1, V2, V3) |
SETIMES |
Yes |
News |
Opus_SETIMES |
CC-BY-SA 3.0 |
Yes, (V1, V2, V3) |
Tatoeba |
Yes, mostly |
Crowdsourcing |
Tatoeba |
CC-BY |
Yes, (V1, V2, V3) |
Global Voices |
Yes |
News stories |
Opus_GlobalVoices |
CC BY 3.0 |
Yes, (V1, V2, V3) |
XL-Sum |
Yes |
BBC News |
HF_xlsum, Github_xlsum |
CC BY-NC-SA 4.0 |
Yes, (V1, V2, V3) |
Leipzig - News and Newscrawl |
Yes |
News |
Leipzig |
CC BY-NC-SA 3.0 |
Yes, (V1, V2, metadata check for V3) |
NLLB_seed |
Yes |
Professionally-translated sentences (Wikipedia domain) |
NLLB_Seed |
CC-BY-SA 4.0 |
Yes, (V1, V2, clean for V3) |
MT-560v1 |
?, openlid version is much cleaner. |
Multiple domain |
OpenLID compilation |
Apache License 2.0 |
Yes, (V1, V2, V3) |
Autshumato |
Yes |
Government domain |
HF_autshumato |
CC Attribution 2.5 South Africa License |
Yes, (V1, V2, metadata check for V3) |
Open Bibles |
Yes, but some closely related languages (or similar name) might be confused. |
Bible versions |
1000Langs, PBC, CorpusCrawler, PNG, Open-Bibles, Bible.is, ebible, biblenlp-corpus, JHUBC, bible.com |
Mostly CC BY-NC-ND |
Yes (partly), (V1, V2, more clean for V3) |
JW |
Yes, except sign language codes |
New World Bible |
Masakhane-Mt, JW, JW300 |
Usage for your own personal and non-commercial purposes is permitted. However, distribution is not allowed. |
Yes (partly), (V1, V2, V3) |
LTI |
? |
Multiple domain |
whatlang |
Custom License/ Open partly |
Yes (partly), (V1, V2, delete some languages for V3) |
Arabic (DART, SHAMI, TSAC, PADIC, AOC, Arabic Dialects Dataset, MADAR) |
? |
Multiple domain |
IADD, MADAR, Arabic Dialects |
Multiple open licenses |
Yes, (V1, V2, V3) |
Persian (TEP, MIZAN) |
Yes |
literature, subtitle |
TEP, MIZAN |
TEP: GNU General Public License, MIZAN: CC BY 4.0 |
Yes, (V1, V2, V3) |
TIL Corpus |
? |
Multiple domain |
TIL-MT |
CC BY-NC-SA 4.0 |
Yes, (V1, V2, V3) |
bho-resources |
Yes |
News |
bho-resources |
CC BY-NC-SA 4.0 |
Yes, (V1, V2, V3) |
Guaraní Parallel Set |
Yes |
News |
Guaraní Parallel Set |
No explicit license |
Yes, (V1, V2, V3) |
HKCanCor |
Yes |
Transcribed conversations |
hkcancor |
CC BY 4.0 |
Yes, (V2, V3) |
ai4d challenge (Nyanja) |
Yes |
News |
ai4d-malawi-news |
No explicit license |
Yes, (V2, V3) |
Wanca 2016 |
? |
Web |
wanca2016 |
CC - BY |
No, (V2, delete for V3) |
smugri |
No, after training V2 model we find this data is not clean |
News |
smugri-data |
CC BY 4.0 |
No, (V2, delete for V3) |
finno-ugric |
? |
? |
finno-ugric-train |
CC BY 4.0 |
Yes, (V2, V3) |
smugri-flores |
Yes |
Human Translation |
smugri-flores-testset |
CC BY 4.0 |
Yes, (V2, V3) |
Abkhaz National Corpus |
Yes |
grammatically annotated text (linguistics, literary studies, history, political and social sciences) |
Abkhaz National Corpus, abkhaz_text |
Public domain (cc0-1.0) |
Yes, (V2, V3) |
Luo News (Radio Ramogi) |
Yes |
News |
Luo |
CC BY 4.0 |
Yes, (V2, more metadata check for V3) |
Lyrics |
Yes, but there might be some issues with the translation part. |
Song lyrics |
lyricstranslate |
Copyright issues prevent distributing the original lyrics. Lyrics on Lyricstranslate.com are licensed through Musixmatch. Translations on Lyricstranslate.com belong to their authors. |
Yes, (V2, V3) |
GlotSparse |
Yes |
News and Articles |
HF_GlotSparse, Github_GlotSparse |
Public domain (cc0-1.0) |
Yes, (V2, more metadata check for V3) |
GlotStoryBook |
Yes |
StoryBooks |
HF_GlotStoryBook, Github_GlotStoryBook |
Public domain (cc0-1.0) |
Yes (partly), (V2, more clean for V3) |
Universal Dependencies v2.12 |
Yes, mostly |
Multiple domain |
UD |
CC family |
Yes, (V2, V3) |
CommonVoice v11 |
Yes, mostly |
crowdsourcing |
CommonVoice v11 |
Public domain (cc0-1.0) |
Yes, (V2, V3) |
GOUD.MA |
?, only the headlines are definitely written in Moroccan Darija, but some noises exist. |
News |
Goud-sum |
No explicit license |
Yes, (V2, V3) |
Vuk'uzenzele |
Yes |
government domain |
vukuzenzele-monolingual |
License for Data - CC BY 4.0 |
Yes, (V2, V3) |
Masakhanews |
Yes |
News |
masakhane/masakhanews |
CC 4.0 Non-Commercial |
Yes, (V2, V3) |
AfriQA |
Yes |
Human Translation |
masakhane/afriqa |
CC 4.0 Non-Commercial |
Todo |
African News Corpus |
Yes |
News |
African News Corpus |
Non-Commercial Government Licence |
Todo |
AfriSenti |
?, location tags converted to language tags followed by annotation |
Tweets |
AfriSenti |
CC BY 4.0 |
Todo |
Bambara Dataset - Sentiment |
?, followed by annotation |
CommonCrawl |
Bambara Dataset |
No explicit license |
Todo |
TUNIZI Dataset - Sentiment |
?, followed by annotation |
YouTube videos comments |
TUNIZI Dataset, HF_TUNIZI |
No explicit license |
Todo |
CTAB |
?, followed by manual verification |
Facebook Public Pages |
Zenodo_CTAB |
CC BY 4.0 |
Todo |
Open Subtitles |
? |
Movie subtitles |
OpenSubtitles |
|
Todo |
GNOME |
Yes |
Human Translation |
opus_GNOME, HF_opus_GNOME, gnome/releases |
|
Todo |
KDE4 |
Yes |
Human Translation |
opus_kde4 HF_kde4 kde4 |
|
Todo |
Ubuntu |
Yes |
Human Translation |
HF_opus_ubuntu |
|
Todo |
Web Inventory of Transcribed & Translated (WIT) Ted Talks |
? |
Ted talks |
ted_talks_iwslt |
|
Todo |
igbo_monolingual |
Yes |
News, Radio, Books |
igbo_monolingual |
|
Todo |
QADI |
?, location tags can be converted to language tags. However, we cannot deny the existence of such a resource, even if some level of noise exists. Arabic dialects are close to each other, and even news websites might be written in standard Arabic. |
Tweets |
QADI Tweet IDs |
Apache License 2.0 |
Todo |
Mot |
Yes |
VOA News |
mot |
MIT License |
Todo |
gov-za |
Yes |
government domain |
gov-za-monolingual |
License for Data - CC BY 4.0 |
Yes, (V3) |
NusaT, NusaP, NusaX |
Yes |
Human Translation |
indonlp, nusa-writes, NusaX |
CC-BY-SA 4.0 or Apache License 2.0 |
Yes, (V3) |
MT Shared Task American NLP |
|
|
americasnlp2024 |
|
Yes, (V3) |
ShaShiYaYi |
|
|
multilingual-data-peru |
|
No, (delete for v3, it's code-switch) |
Hinglish collection |
|
|
english-to-hinglish |
|
Yes, (V3) |
Indic corpora |
|
|
In22Conv, In22Gen |
CC-BY-SA 4.0 |
Yes, (V3) |
Yue parallel text |
|
|
yue-cmn-eng |
|
Todo |
Dakshina |
|
|
dakshina |
|
Yes, (V3) |
Bhasha-Abhijnaanam |
|
|
Bhasha-Abhijnaanam |
|
Yes, (V3) |
Bloom Library |
|
|
bloom-lm |
CC-BY-(NC?-ND?-SA?)-4.0 |