Skip to content

Latest commit

 

History

History
2101 lines (2044 loc) · 84.5 KB

README.md

File metadata and controls

2101 lines (2044 loc) · 84.5 KB

Zabanshenas 🕵

DOI



A Transformer-based solution for identifying the most likely language of a written document/text. Zabanshenas is a Persian word that has two meanings:

  • A person who studies linguistics.
  • A way to identify the type of written language.

Introduction

In this repository, I will use another perspective in creating a language detection model using Transformers. Nowadays, Transformers have played a massive role in Natural Language Processing fields. In short, Transformers uses an attention mechanism to boost the speed and extract a high level of information (abstraction).

There are plenty of ways, solutions, and packages to find the language of a written piece of text or document. All of them have their pros and cons. Some able to detect faster and support as many languages as possible. However, in this case, I intend to use Transformers to understand similar groups of languages and cover 235 languages thanks to WiLI-2018 and the Transformer architecture.

This model can detect a written language in three different stages: paragraph, sentence, and subset of text between three and four tokens.

WilI-2018 (Cleaned version)

It is a benchmark for monolingual written natural language identification (high covering of a specific language). It contains 229,095 paragraphs that cover 235 languages. Language distribution includes 122 Indo-European languages, 22 Austronesian languages, 17 Turkic languages, 14 Uralic languages, 11 Niger-Congo languages, 10 Sino-Tibetan languages, 9 Afro-Asiatic languages, 6 constructed languages, and 24 languages of smaller families. It also consists of groups of similar languages and dialects:

  • Arabic, Egyptian Arabic
  • English, Old English, Scots
  • Standard Chinese, Min Nan Chinese, Hakka Chinese, Literary Chinese, Wu Chinese
  • German, Bavarian, Low German, Palatine German, Ripuarisch, Alemannic German, Pennsylvania German
  • Belarusian, Belarusian (Taraschkewiza)
  • Kurdish, Central Kurdish
  • Indonesian, Minangkabau, Banyumasan, Banjar, Sundanese, Javanese
  • Languages are spoken in India:
    • Maithili, Bhojpuri
    • Bengali, Bishnupriya
    • Konkani, Marathi
  • Russian, Komi-Permyak
  • Persian, Gilaki, Mazanderani

On the other hand, this dataset addresses low-resources languages, as shown in Fig 1:

  • Chechen
  • Haitian Creole
  • Newari
  • Pampanga
More info:


Fig 1: The distribution of samples per language.

As well as minor textual snippet languages (word level, character level), as shown in Fig 2, 3:

  • Word Level
    • Literary Chinese
    • Japanese
    • Standard Chinese
    • Wu Chinese
    • Chechen
  • Character Level:
    • Chechen
    • Haitian Creole
    • Newari
    • Minangkabau
    • Pampanga


Fig 2: The distribution of words per language.


Fig 3: The distribution of characters per language.

Instead of using a word or character level of tokenization, I used subword tokenization BPE with max sequence length of 512. It allows the Transformer to have a rich vocabulary size while learning meaningful context-independent representations. The distribution of vocabulary in WiLI-2018 is shown in Fig 4.


Fig 4: The distribution of BPE-tokens per language.

Model (Architecture)

RoBERTa as a candidate model is used for this specific task with the following procedure. Firstly, the model is trained on the WILI-2018 corpus and then fine-tuned as a SequenceClassification task to detect independent and similar languages and dialects. The data is divided into three segments, 70% of the whole data (train + test) as paragraph choices, 15% tokenized into sentences, and what remains, split into the subset of three or five tokens per record to achieve better results.

In total, the dataset consists of a 1M training set, 11K validation set, and 11K testing set.

Evaluation

Precision, recall, F1 scores for each language and level are presented in the following tables.

Paragraph level
language precision recall f1-score
Achinese (ace) 1.000000 0.982143 0.990991
Afrikaans (afr) 1.000000 1.000000 1.000000
Alemannic German (als) 1.000000 0.946429 0.972477
Amharic (amh) 1.000000 0.982143 0.990991
Old English (ang) 0.981818 0.964286 0.972973
Arabic (ara) 0.846154 0.982143 0.909091
Aragonese (arg) 1.000000 1.000000 1.000000
Egyptian Arabic (arz) 0.979592 0.857143 0.914286
Assamese (asm) 0.981818 0.964286 0.972973
Asturian (ast) 0.964912 0.982143 0.973451
Avar (ava) 0.941176 0.905660 0.923077
Aymara (aym) 0.964912 0.982143 0.973451
South Azerbaijani (azb) 0.965517 1.000000 0.982456
Azerbaijani (aze) 1.000000 1.000000 1.000000
Bashkir (bak) 1.000000 0.978261 0.989011
Bavarian (bar) 0.843750 0.964286 0.900000
Central Bikol (bcl) 1.000000 0.982143 0.990991
Belarusian (Taraschkewiza) (be-tarask) 1.000000 0.875000 0.933333
Belarusian (bel) 0.870968 0.964286 0.915254
Bengali (ben) 0.982143 0.982143 0.982143
Bhojpuri (bho) 1.000000 0.928571 0.962963
Banjar (bjn) 0.981132 0.945455 0.962963
Tibetan (bod) 1.000000 0.982143 0.990991
Bosnian (bos) 0.552632 0.375000 0.446809
Bishnupriya (bpy) 1.000000 0.982143 0.990991
Breton (bre) 1.000000 0.964286 0.981818
Bulgarian (bul) 1.000000 0.964286 0.981818
Buryat (bxr) 0.946429 0.946429 0.946429
Catalan (cat) 0.982143 0.982143 0.982143
Chavacano (cbk) 0.914894 0.767857 0.834951
Min Dong (cdo) 1.000000 0.982143 0.990991
Cebuano (ceb) 1.000000 1.000000 1.000000
Czech (ces) 1.000000 1.000000 1.000000
Chechen (che) 1.000000 1.000000 1.000000
Cherokee (chr) 1.000000 0.963636 0.981481
Chuvash (chv) 0.938776 0.958333 0.948454
Central Kurdish (ckb) 1.000000 1.000000 1.000000
Cornish (cor) 1.000000 1.000000 1.000000
Corsican (cos) 1.000000 0.982143 0.990991
Crimean Tatar (crh) 1.000000 0.946429 0.972477
Kashubian (csb) 1.000000 0.963636 0.981481
Welsh (cym) 1.000000 1.000000 1.000000
Danish (dan) 1.000000 1.000000 1.000000
German (deu) 0.828125 0.946429 0.883333
Dimli (diq) 0.964912 0.982143 0.973451
Dhivehi (div) 1.000000 1.000000 1.000000
Lower Sorbian (dsb) 1.000000 0.982143 0.990991
Doteli (dty) 0.940000 0.854545 0.895238
Emilian (egl) 1.000000 0.928571 0.962963
Modern Greek (ell) 1.000000 1.000000 1.000000
English (eng) 0.588889 0.946429 0.726027
Esperanto (epo) 1.000000 0.982143 0.990991
Estonian (est) 0.963636 0.946429 0.954955
Basque (eus) 1.000000 0.982143 0.990991
Extremaduran (ext) 0.982143 0.982143 0.982143
Faroese (fao) 1.000000 1.000000 1.000000
Persian (fas) 0.948276 0.982143 0.964912
Finnish (fin) 1.000000 1.000000 1.000000
French (fra) 0.710145 0.875000 0.784000
Arpitan (frp) 1.000000 0.946429 0.972477
Western Frisian (fry) 0.982143 0.982143 0.982143
Friulian (fur) 1.000000 0.982143 0.990991
Gagauz (gag) 0.981132 0.945455 0.962963
Scottish Gaelic (gla) 0.982143 0.982143 0.982143
Irish (gle) 0.949153 1.000000 0.973913
Galician (glg) 1.000000 1.000000 1.000000
Gilaki (glk) 0.981132 0.945455 0.962963
Manx (glv) 1.000000 1.000000 1.000000
Guarani (grn) 1.000000 0.964286 0.981818
Gujarati (guj) 1.000000 0.982143 0.990991
Hakka Chinese (hak) 0.981818 0.964286 0.972973
Haitian Creole (hat) 1.000000 1.000000 1.000000
Hausa (hau) 1.000000 0.945455 0.971963
Serbo-Croatian (hbs) 0.448276 0.464286 0.456140
Hebrew (heb) 1.000000 0.982143 0.990991
Fiji Hindi (hif) 0.890909 0.890909 0.890909
Hindi (hin) 0.981481 0.946429 0.963636
Croatian (hrv) 0.500000 0.636364 0.560000
Upper Sorbian (hsb) 0.955556 1.000000 0.977273
Hungarian (hun) 1.000000 1.000000 1.000000
Armenian (hye) 1.000000 0.981818 0.990826
Igbo (ibo) 0.918033 1.000000 0.957265
Ido (ido) 1.000000 1.000000 1.000000
Interlingue (ile) 1.000000 0.962264 0.980769
Iloko (ilo) 0.947368 0.964286 0.955752
Interlingua (ina) 1.000000 1.000000 1.000000
Indonesian (ind) 0.761905 0.872727 0.813559
Icelandic (isl) 1.000000 1.000000 1.000000
Italian (ita) 0.861538 1.000000 0.925620
Jamaican Patois (jam) 1.000000 0.946429 0.972477
Javanese (jav) 0.964912 0.982143 0.973451
Lojban (jbo) 1.000000 1.000000 1.000000
Japanese (jpn) 1.000000 1.000000 1.000000
Karakalpak (kaa) 0.965517 1.000000 0.982456
Kabyle (kab) 1.000000 0.964286 0.981818
Kannada (kan) 0.982143 0.982143 0.982143
Georgian (kat) 1.000000 0.964286 0.981818
Kazakh (kaz) 0.980769 0.980769 0.980769
Kabardian (kbd) 1.000000 0.982143 0.990991
Central Khmer (khm) 0.960784 0.875000 0.915888
Kinyarwanda (kin) 0.981132 0.928571 0.954128
Kirghiz (kir) 1.000000 1.000000 1.000000
Komi-Permyak (koi) 0.962264 0.910714 0.935780
Konkani (kok) 0.964286 0.981818 0.972973
Komi (kom) 1.000000 0.962264 0.980769
Korean (kor) 1.000000 1.000000 1.000000
Karachay-Balkar (krc) 1.000000 0.982143 0.990991
Ripuarisch (ksh) 1.000000 0.964286 0.981818
Kurdish (kur) 1.000000 0.964286 0.981818
Ladino (lad) 1.000000 1.000000 1.000000
Lao (lao) 0.961538 0.909091 0.934579
Latin (lat) 0.877193 0.943396 0.909091
Latvian (lav) 0.963636 0.946429 0.954955
Lezghian (lez) 1.000000 0.964286 0.981818
Ligurian (lij) 1.000000 0.964286 0.981818
Limburgan (lim) 0.938776 1.000000 0.968421
Lingala (lin) 0.980769 0.927273 0.953271
Lithuanian (lit) 0.982456 1.000000 0.991150
Lombard (lmo) 1.000000 1.000000 1.000000
Northern Luri (lrc) 1.000000 0.928571 0.962963
Latgalian (ltg) 1.000000 0.982143 0.990991
Luxembourgish (ltz) 0.949153 1.000000 0.973913
Luganda (lug) 1.000000 1.000000 1.000000
Literary Chinese (lzh) 1.000000 1.000000 1.000000
Maithili (mai) 0.931034 0.964286 0.947368
Malayalam (mal) 1.000000 0.982143 0.990991
Banyumasan (map-bms) 0.977778 0.785714 0.871287
Marathi (mar) 0.949153 1.000000 0.973913
Moksha (mdf) 0.980000 0.890909 0.933333
Eastern Mari (mhr) 0.981818 0.964286 0.972973
Minangkabau (min) 1.000000 1.000000 1.000000
Macedonian (mkd) 1.000000 0.981818 0.990826
Malagasy (mlg) 0.981132 1.000000 0.990476
Maltese (mlt) 0.982456 1.000000 0.991150
Min Nan Chinese (nan) 1.000000 1.000000 1.000000
Mongolian (mon) 1.000000 0.981818 0.990826
Maori (mri) 1.000000 1.000000 1.000000
Western Mari (mrj) 0.982456 1.000000 0.991150
Malay (msa) 0.862069 0.892857 0.877193
Mirandese (mwl) 1.000000 0.982143 0.990991
Burmese (mya) 1.000000 1.000000 1.000000
Erzya (myv) 0.818182 0.964286 0.885246
Mazanderani (mzn) 0.981481 1.000000 0.990654
Neapolitan (nap) 1.000000 0.981818 0.990826
Navajo (nav) 1.000000 1.000000 1.000000
Classical Nahuatl (nci) 0.981481 0.946429 0.963636
Low German (nds) 0.982143 0.982143 0.982143
West Low German (nds-nl) 1.000000 1.000000 1.000000
Nepali (macrolanguage) (nep) 0.881356 0.928571 0.904348
Newari (new) 1.000000 0.909091 0.952381
Dutch (nld) 0.982143 0.982143 0.982143
Norwegian Nynorsk (nno) 1.000000 1.000000 1.000000
Bokmål (nob) 1.000000 1.000000 1.000000
Narom (nrm) 0.981818 0.964286 0.972973
Northern Sotho (nso) 1.000000 1.000000 1.000000
Occitan (oci) 0.903846 0.839286 0.870370
Livvi-Karelian (olo) 0.982456 1.000000 0.991150
Oriya (ori) 0.964912 0.982143 0.973451
Oromo (orm) 0.982143 0.982143 0.982143
Ossetian (oss) 0.982143 1.000000 0.990991
Pangasinan (pag) 0.980000 0.875000 0.924528
Pampanga (pam) 0.928571 0.896552 0.912281
Panjabi (pan) 1.000000 1.000000 1.000000
Papiamento (pap) 1.000000 0.964286 0.981818
Picard (pcd) 0.849057 0.849057 0.849057
Pennsylvania German (pdc) 0.854839 0.946429 0.898305
Palatine German (pfl) 0.946429 0.946429 0.946429
Western Panjabi (pnb) 0.981132 0.962963 0.971963
Polish (pol) 0.933333 1.000000 0.965517
Portuguese (por) 0.774648 0.982143 0.866142
Pushto (pus) 1.000000 0.910714 0.953271
Quechua (que) 0.962963 0.928571 0.945455
Tarantino dialect (roa-tara) 1.000000 0.964286 0.981818
Romansh (roh) 1.000000 0.928571 0.962963
Romanian (ron) 0.965517 1.000000 0.982456
Rusyn (rue) 0.946429 0.946429 0.946429
Aromanian (rup) 0.962963 0.928571 0.945455
Russian (rus) 0.859375 0.982143 0.916667
Yakut (sah) 1.000000 0.982143 0.990991
Sanskrit (san) 0.982143 0.982143 0.982143
Sicilian (scn) 1.000000 1.000000 1.000000
Scots (sco) 0.982143 0.982143 0.982143
Samogitian (sgs) 1.000000 0.982143 0.990991
Sinhala (sin) 0.964912 0.982143 0.973451
Slovak (slk) 1.000000 0.982143 0.990991
Slovene (slv) 1.000000 0.981818 0.990826
Northern Sami (sme) 0.962264 0.962264 0.962264
Shona (sna) 0.933333 1.000000 0.965517
Sindhi (snd) 1.000000 1.000000 1.000000
Somali (som) 0.948276 1.000000 0.973451
Spanish (spa) 0.739130 0.910714 0.816000
Albanian (sqi) 0.982143 0.982143 0.982143
Sardinian (srd) 1.000000 0.982143 0.990991
Sranan (srn) 1.000000 1.000000 1.000000
Serbian (srp) 1.000000 0.946429 0.972477
Saterfriesisch (stq) 1.000000 0.964286 0.981818
Sundanese (sun) 1.000000 0.977273 0.988506
Swahili (macrolanguage) (swa) 1.000000 1.000000 1.000000
Swedish (swe) 1.000000 1.000000 1.000000
Silesian (szl) 1.000000 0.981481 0.990654
Tamil (tam) 0.982143 1.000000 0.990991
Tatar (tat) 1.000000 1.000000 1.000000
Tulu (tcy) 0.982456 1.000000 0.991150
Telugu (tel) 1.000000 0.920000 0.958333
Tetum (tet) 1.000000 0.964286 0.981818
Tajik (tgk) 1.000000 1.000000 1.000000
Tagalog (tgl) 1.000000 1.000000 1.000000
Thai (tha) 0.932203 0.982143 0.956522
Tongan (ton) 1.000000 0.964286 0.981818
Tswana (tsn) 1.000000 1.000000 1.000000
Turkmen (tuk) 1.000000 0.982143 0.990991
Turkish (tur) 0.901639 0.982143 0.940171
Tuvan (tyv) 1.000000 0.964286 0.981818
Udmurt (udm) 1.000000 0.982143 0.990991
Uighur (uig) 1.000000 0.982143 0.990991
Ukrainian (ukr) 0.963636 0.946429 0.954955
Urdu (urd) 1.000000 0.982143 0.990991
Uzbek (uzb) 1.000000 1.000000 1.000000
Venetian (vec) 1.000000 0.982143 0.990991
Veps (vep) 0.982456 1.000000 0.991150
Vietnamese (vie) 0.964912 0.982143 0.973451
Vlaams (vls) 1.000000 0.982143 0.990991
Volapük (vol) 1.000000 1.000000 1.000000
Võro (vro) 0.964286 0.964286 0.964286
Waray (war) 1.000000 0.982143 0.990991
Walloon (wln) 1.000000 1.000000 1.000000
Wolof (wol) 0.981481 0.963636 0.972477
Wu Chinese (wuu) 0.981481 0.946429 0.963636
Xhosa (xho) 1.000000 0.964286 0.981818
Mingrelian (xmf) 1.000000 0.964286 0.981818
Yiddish (yid) 1.000000 1.000000 1.000000
Yoruba (yor) 0.964912 0.982143 0.973451
Zeeuws (zea) 1.000000 0.982143 0.990991
Cantonese (zh-yue) 0.981481 0.946429 0.963636
Standard Chinese (zho) 0.932203 0.982143 0.956522
accuracy 0.963055 0.963055 0.963055
macro avg 0.966424 0.963216 0.963891
weighted avg 0.966040 0.963055 0.963606
Sentence level
language precision recall f1-score
Achinese (ace) 0.754545 0.873684 0.809756
Afrikaans (afr) 0.708955 0.940594 0.808511
Alemannic German (als) 0.870130 0.752809 0.807229
Amharic (amh) 1.000000 0.820000 0.901099
Old English (ang) 0.966667 0.906250 0.935484
Arabic (ara) 0.907692 0.967213 0.936508
Aragonese (arg) 0.921569 0.959184 0.940000
Egyptian Arabic (arz) 0.964286 0.843750 0.900000
Assamese (asm) 0.964286 0.870968 0.915254
Asturian (ast) 0.880000 0.795181 0.835443
Avar (ava) 0.864198 0.843373 0.853659
Aymara (aym) 1.000000 0.901961 0.948454
South Azerbaijani (azb) 0.979381 0.989583 0.984456
Azerbaijani (aze) 0.989899 0.960784 0.975124
Bashkir (bak) 0.837209 0.857143 0.847059
Bavarian (bar) 0.741935 0.766667 0.754098
Central Bikol (bcl) 0.962963 0.928571 0.945455
Belarusian (Taraschkewiza) (be-tarask) 0.857143 0.733333 0.790419
Belarusian (bel) 0.775510 0.752475 0.763819
Bengali (ben) 0.861111 0.911765 0.885714
Bhojpuri (bho) 0.965517 0.933333 0.949153
Banjar (bjn) 0.891566 0.880952 0.886228
Tibetan (bod) 1.000000 1.000000 1.000000
Bosnian (bos) 0.375000 0.323077 0.347107
Bishnupriya (bpy) 0.986301 1.000000 0.993103
Breton (bre) 0.951613 0.893939 0.921875
Bulgarian (bul) 0.945055 0.877551 0.910053
Buryat (bxr) 0.955556 0.843137 0.895833
Catalan (cat) 0.692308 0.750000 0.720000
Chavacano (cbk) 0.842857 0.641304 0.728395
Min Dong (cdo) 0.972973 1.000000 0.986301
Cebuano (ceb) 0.981308 0.954545 0.967742
Czech (ces) 0.944444 0.915385 0.929687
Chechen (che) 0.875000 0.700000 0.777778
Cherokee (chr) 1.000000 0.970588 0.985075
Chuvash (chv) 0.875000 0.836957 0.855556
Central Kurdish (ckb) 1.000000 0.983051 0.991453
Cornish (cor) 0.979592 0.969697 0.974619
Corsican (cos) 0.986842 0.925926 0.955414
Crimean Tatar (crh) 0.958333 0.907895 0.932432
Kashubian (csb) 0.920354 0.904348 0.912281
Welsh (cym) 0.971014 0.943662 0.957143
Danish (dan) 0.865169 0.777778 0.819149
German (deu) 0.721311 0.822430 0.768559
Dimli (diq) 0.915966 0.923729 0.919831
Dhivehi (div) 1.000000 0.991228 0.995595
Lower Sorbian (dsb) 0.898876 0.879121 0.888889
Doteli (dty) 0.821429 0.638889 0.718750
Emilian (egl) 0.988095 0.922222 0.954023
Modern Greek (ell) 0.988636 0.966667 0.977528
English (eng) 0.522727 0.784091 0.627273
Esperanto (epo) 0.963855 0.930233 0.946746
Estonian (est) 0.922222 0.873684 0.897297
Basque (eus) 1.000000 0.941176 0.969697
Extremaduran (ext) 0.925373 0.885714 0.905109
Faroese (fao) 0.855072 0.887218 0.870849
Persian (fas) 0.879630 0.979381 0.926829
Finnish (fin) 0.952830 0.943925 0.948357
French (fra) 0.676768 0.943662 0.788235
Arpitan (frp) 0.867925 0.807018 0.836364
Western Frisian (fry) 0.956989 0.890000 0.922280
Friulian (fur) 1.000000 0.857143 0.923077
Gagauz (gag) 0.939024 0.802083 0.865169
Scottish Gaelic (gla) 1.000000 0.879121 0.935673
Irish (gle) 0.989247 0.958333 0.973545
Galician (glg) 0.910256 0.922078 0.916129
Gilaki (glk) 0.964706 0.872340 0.916201
Manx (glv) 1.000000 0.965517 0.982456
Guarani (grn) 0.983333 1.000000 0.991597
Gujarati (guj) 1.000000 0.991525 0.995745
Hakka Chinese (hak) 0.955224 0.955224 0.955224
Haitian Creole (hat) 0.833333 0.666667 0.740741
Hausa (hau) 0.936709 0.913580 0.925000
Serbo-Croatian (hbs) 0.452830 0.410256 0.430493
Hebrew (heb) 0.988235 0.976744 0.982456
Fiji Hindi (hif) 0.936709 0.840909 0.886228
Hindi (hin) 0.965517 0.756757 0.848485
Croatian (hrv) 0.443820 0.537415 0.486154
Upper Sorbian (hsb) 0.951613 0.830986 0.887218
Hungarian (hun) 0.854701 0.909091 0.881057
Armenian (hye) 1.000000 0.816327 0.898876
Igbo (ibo) 0.974359 0.926829 0.950000
Ido (ido) 0.975000 0.987342 0.981132
Interlingue (ile) 0.880597 0.921875 0.900763
Iloko (ilo) 0.882353 0.821918 0.851064
Interlingua (ina) 0.952381 0.895522 0.923077
Indonesian (ind) 0.606383 0.695122 0.647727
Icelandic (isl) 0.978261 0.882353 0.927835
Italian (ita) 0.910448 0.910448 0.910448
Jamaican Patois (jam) 0.988764 0.967033 0.977778
Javanese (jav) 0.903614 0.862069 0.882353
Lojban (jbo) 0.943878 0.929648 0.936709
Japanese (jpn) 1.000000 0.764706 0.866667
Karakalpak (kaa) 0.940171 0.901639 0.920502
Kabyle (kab) 0.985294 0.837500 0.905405
Kannada (kan) 0.975806 0.975806 0.975806
Georgian (kat) 0.953704 0.903509 0.927928
Kazakh (kaz) 0.934579 0.877193 0.904977
Kabardian (kbd) 0.987952 0.953488 0.970414
Central Khmer (khm) 0.928571 0.829787 0.876404
Kinyarwanda (kin) 0.953125 0.938462 0.945736
Kirghiz (kir) 0.927632 0.881250 0.903846
Komi-Permyak (koi) 0.750000 0.776786 0.763158
Konkani (kok) 0.893491 0.872832 0.883041
Komi (kom) 0.734177 0.690476 0.711656
Korean (kor) 0.989899 0.989899 0.989899
Karachay-Balkar (krc) 0.928571 0.917647 0.923077
Ripuarisch (ksh) 0.915789 0.896907 0.906250
Kurdish (kur) 0.977528 0.935484 0.956044
Ladino (lad) 0.985075 0.904110 0.942857
Lao (lao) 0.896552 0.812500 0.852459
Latin (lat) 0.741935 0.831325 0.784091
Latvian (lav) 0.710526 0.878049 0.785455
Lezghian (lez) 0.975309 0.877778 0.923977
Ligurian (lij) 0.951807 0.897727 0.923977
Limburgan (lim) 0.909091 0.921053 0.915033
Lingala (lin) 0.942857 0.814815 0.874172
Lithuanian (lit) 0.892857 0.925926 0.909091
Lombard (lmo) 0.766234 0.951613 0.848921
Northern Luri (lrc) 0.972222 0.875000 0.921053
Latgalian (ltg) 0.895349 0.865169 0.880000
Luxembourgish (ltz) 0.882353 0.750000 0.810811
Luganda (lug) 0.946429 0.883333 0.913793
Literary Chinese (lzh) 1.000000 1.000000 1.000000
Maithili (mai) 0.893617 0.823529 0.857143
Malayalam (mal) 1.000000 0.975000 0.987342
Banyumasan (map-bms) 0.924242 0.772152 0.841379
Marathi (mar) 0.874126 0.919118 0.896057
Moksha (mdf) 0.771242 0.830986 0.800000
Eastern Mari (mhr) 0.820000 0.860140 0.839590
Minangkabau (min) 0.973684 0.973684 0.973684
Macedonian (mkd) 0.895652 0.953704 0.923767
Malagasy (mlg) 1.000000 0.966102 0.982759
Maltese (mlt) 0.987952 0.964706 0.976190
Min Nan Chinese (nan) 0.975000 1.000000 0.987342
Mongolian (mon) 0.954545 0.933333 0.943820
Maori (mri) 0.985294 1.000000 0.992593
Western Mari (mrj) 0.966292 0.914894 0.939891
Malay (msa) 0.770270 0.695122 0.730769
Mirandese (mwl) 0.970588 0.891892 0.929577
Burmese (mya) 1.000000 0.964286 0.981818
Erzya (myv) 0.535714 0.681818 0.600000
Mazanderani (mzn) 0.968750 0.898551 0.932331
Neapolitan (nap) 0.892308 0.865672 0.878788
Navajo (nav) 0.984375 0.984375 0.984375
Classical Nahuatl (nci) 0.901408 0.761905 0.825806
Low German (nds) 0.896226 0.913462 0.904762
West Low German (nds-nl) 0.873563 0.835165 0.853933
Nepali (macrolanguage) (nep) 0.704545 0.861111 0.775000
Newari (new) 0.920000 0.741935 0.821429
Dutch (nld) 0.925926 0.872093 0.898204
Norwegian Nynorsk (nno) 0.847059 0.808989 0.827586
Bokmål (nob) 0.861386 0.852941 0.857143
Narom (nrm) 0.966667 0.983051 0.974790
Northern Sotho (nso) 0.897436 0.921053 0.909091
Occitan (oci) 0.958333 0.696970 0.807018
Livvi-Karelian (olo) 0.967742 0.937500 0.952381
Oriya (ori) 0.933333 1.000000 0.965517
Oromo (orm) 0.977528 0.915789 0.945652
Ossetian (oss) 0.958333 0.841463 0.896104
Pangasinan (pag) 0.847328 0.909836 0.877470
Pampanga (pam) 0.969697 0.780488 0.864865
Panjabi (pan) 1.000000 1.000000 1.000000
Papiamento (pap) 0.876190 0.920000 0.897561
Picard (pcd) 0.707317 0.568627 0.630435
Pennsylvania German (pdc) 0.827273 0.827273 0.827273
Palatine German (pfl) 0.882353 0.914634 0.898204
Western Panjabi (pnb) 0.964286 0.931034 0.947368
Polish (pol) 0.859813 0.910891 0.884615
Portuguese (por) 0.535714 0.833333 0.652174
Pushto (pus) 0.989362 0.902913 0.944162
Quechua (que) 0.979167 0.903846 0.940000
Tarantino dialect (roa-tara) 0.964912 0.901639 0.932203
Romansh (roh) 0.914894 0.895833 0.905263
Romanian (ron) 0.880597 0.880597 0.880597
Rusyn (rue) 0.932584 0.805825 0.864583
Aromanian (rup) 0.783333 0.758065 0.770492
Russian (rus) 0.517986 0.765957 0.618026
Yakut (sah) 0.954023 0.922222 0.937853
Sanskrit (san) 0.866667 0.951220 0.906977
Sicilian (scn) 0.984375 0.940299 0.961832
Scots (sco) 0.851351 0.900000 0.875000
Samogitian (sgs) 0.977011 0.876289 0.923913
Sinhala (sin) 0.406154 0.985075 0.575163
Slovak (slk) 0.956989 0.872549 0.912821
Slovene (slv) 0.907216 0.854369 0.880000
Northern Sami (sme) 0.949367 0.892857 0.920245
Shona (sna) 0.936508 0.855072 0.893939
Sindhi (snd) 0.984962 0.992424 0.988679
Somali (som) 0.949153 0.848485 0.896000
Spanish (spa) 0.584158 0.746835 0.655556
Albanian (sqi) 0.988095 0.912088 0.948571
Sardinian (srd) 0.957746 0.931507 0.944444
Sranan (srn) 0.985714 0.945205 0.965035
Serbian (srp) 0.950980 0.889908 0.919431
Saterfriesisch (stq) 0.962500 0.875000 0.916667
Sundanese (sun) 0.778846 0.910112 0.839378
Swahili (macrolanguage) (swa) 0.915493 0.878378 0.896552
Swedish (swe) 0.989247 0.958333 0.973545
Silesian (szl) 0.944444 0.904255 0.923913
Tamil (tam) 0.990000 0.970588 0.980198
Tatar (tat) 0.942029 0.902778 0.921986
Tulu (tcy) 0.980519 0.967949 0.974194
Telugu (tel) 0.965986 0.965986 0.965986
Tetum (tet) 0.898734 0.855422 0.876543
Tajik (tgk) 0.974684 0.939024 0.956522
Tagalog (tgl) 0.965909 0.934066 0.949721
Thai (tha) 0.923077 0.882353 0.902256
Tongan (ton) 0.970149 0.890411 0.928571
Tswana (tsn) 0.888889 0.926316 0.907216
Turkmen (tuk) 0.968000 0.889706 0.927203
Turkish (tur) 0.871287 0.926316 0.897959
Tuvan (tyv) 0.948454 0.859813 0.901961
Udmurt (udm) 0.989362 0.894231 0.939394
Uighur (uig) 1.000000 0.953333 0.976109
Ukrainian (ukr) 0.893617 0.875000 0.884211
Urdu (urd) 1.000000 1.000000 1.000000
Uzbek (uzb) 0.636042 0.886700 0.740741
Venetian (vec) 1.000000 0.941176 0.969697
Veps (vep) 0.858586 0.965909 0.909091
Vietnamese (vie) 1.000000 0.940476 0.969325
Vlaams (vls) 0.885714 0.898551 0.892086
Volapük (vol) 0.975309 0.975309 0.975309
Võro (vro) 0.855670 0.864583 0.860104
Waray (war) 0.972222 0.909091 0.939597
Walloon (wln) 0.742138 0.893939 0.810997
Wolof (wol) 0.882979 0.954023 0.917127
Wu Chinese (wuu) 0.961538 0.833333 0.892857
Xhosa (xho) 0.934066 0.867347 0.899471
Mingrelian (xmf) 0.958333 0.929293 0.943590
Yiddish (yid) 0.984375 0.875000 0.926471
Yoruba (yor) 0.868421 0.857143 0.862745
Zeeuws (zea) 0.879518 0.793478 0.834286
Cantonese (zh-yue) 0.896552 0.812500 0.852459
Standard Chinese (zho) 0.906250 0.935484 0.920635
accuracy 0.881051 0.881051 0.881051
macro avg 0.903245 0.880618 0.888996
weighted avg 0.894174 0.881051 0.884520
Token level
language precision recall f1-score
Achinese (ace) 0.873846 0.827988 0.850299
Afrikaans (afr) 0.638060 0.732334 0.681954
Alemannic German (als) 0.673780 0.547030 0.603825
Amharic (amh) 0.997743 0.954644 0.975717
Old English (ang) 0.840816 0.693603 0.760148
Arabic (ara) 0.768737 0.840749 0.803132
Aragonese (arg) 0.493671 0.505181 0.499360
Egyptian Arabic (arz) 0.823529 0.741935 0.780606
Assamese (asm) 0.948454 0.893204 0.920000
Asturian (ast) 0.490000 0.508299 0.498982
Avar (ava) 0.813636 0.655678 0.726166
Aymara (aym) 0.795833 0.779592 0.787629
South Azerbaijani (azb) 0.832836 0.863777 0.848024
Azerbaijani (aze) 0.867470 0.800000 0.832370
Bashkir (bak) 0.851852 0.750000 0.797688
Bavarian (bar) 0.560897 0.522388 0.540958
Central Bikol (bcl) 0.708229 0.668235 0.687651
Belarusian (Taraschkewiza) (be-tarask) 0.615635 0.526462 0.567568
Belarusian (bel) 0.539952 0.597855 0.567430
Bengali (ben) 0.830275 0.885086 0.856805
Bhojpuri (bho) 0.723118 0.691517 0.706965
Banjar (bjn) 0.619586 0.726269 0.668699
Tibetan (bod) 0.999537 0.991728 0.995617
Bosnian (bos) 0.330849 0.403636 0.363636
Bishnupriya (bpy) 0.941634 0.949020 0.945312
Breton (bre) 0.772222 0.745308 0.758527
Bulgarian (bul) 0.771505 0.706897 0.737789
Buryat (bxr) 0.741935 0.753149 0.747500
Catalan (cat) 0.528716 0.610136 0.566516
Chavacano (cbk) 0.409449 0.312625 0.354545
Min Dong (cdo) 0.951264 0.936057 0.943599
Cebuano (ceb) 0.888298 0.876640 0.882431
Czech (ces) 0.806045 0.758294 0.781441
Chechen (che) 0.857143 0.600000 0.705882
Cherokee (chr) 0.997840 0.952577 0.974684
Chuvash (chv) 0.874346 0.776744 0.822660
Central Kurdish (ckb) 0.984848 0.953545 0.968944
Cornish (cor) 0.747596 0.807792 0.776529
Corsican (cos) 0.673913 0.708571 0.690808
Crimean Tatar (crh) 0.498801 0.700337 0.582633
Kashubian (csb) 0.797059 0.794721 0.795888
Welsh (cym) 0.829609 0.841360 0.835443
Danish (dan) 0.649789 0.622222 0.635707
German (deu) 0.559406 0.763514 0.645714
Dimli (diq) 0.835580 0.763547 0.797941
Dhivehi (div) 1.000000 0.980645 0.990228
Lower Sorbian (dsb) 0.740484 0.694805 0.716918
Doteli (dty) 0.616314 0.527132 0.568245
Emilian (egl) 0.822993 0.769625 0.795414
Modern Greek (ell) 0.972043 0.963753 0.967880
English (eng) 0.260492 0.724346 0.383183
Esperanto (epo) 0.766764 0.716621 0.740845
Estonian (est) 0.698885 0.673835 0.686131
Basque (eus) 0.882716 0.841176 0.861446
Extremaduran (ext) 0.570605 0.511628 0.539510
Faroese (fao) 0.773987 0.784017 0.778970
Persian (fas) 0.709836 0.809346 0.756332
Finnish (fin) 0.866261 0.796089 0.829694
French (fra) 0.496263 0.700422 0.580927
Arpitan (frp) 0.663366 0.584302 0.621329
Western Frisian (fry) 0.750000 0.756148 0.753061
Friulian (fur) 0.713555 0.675545 0.694030
Gagauz (gag) 0.728125 0.677326 0.701807
Scottish Gaelic (gla) 0.831601 0.817996 0.824742
Irish (gle) 0.868852 0.801296 0.833708
Galician (glg) 0.469816 0.454315 0.461935
Gilaki (glk) 0.703883 0.687204 0.695444
Manx (glv) 0.873047 0.886905 0.879921
Guarani (grn) 0.848580 0.793510 0.820122
Gujarati (guj) 0.995643 0.926978 0.960084
Hakka Chinese (hak) 0.898403 0.904971 0.901675
Haitian Creole (hat) 0.719298 0.518987 0.602941
Hausa (hau) 0.815353 0.829114 0.822176
Serbo-Croatian (hbs) 0.343465 0.244589 0.285714
Hebrew (heb) 0.891304 0.933941 0.912125
Fiji Hindi (hif) 0.662577 0.664615 0.663594
Hindi (hin) 0.782301 0.778169 0.780229
Croatian (hrv) 0.360308 0.374000 0.367026
Upper Sorbian (hsb) 0.745763 0.611111 0.671756
Hungarian (hun) 0.876812 0.846154 0.861210
Armenian (hye) 0.988201 0.917808 0.951705
Igbo (ibo) 0.825397 0.696429 0.755448
Ido (ido) 0.760479 0.814103 0.786378
Interlingue (ile) 0.701299 0.580645 0.635294
Iloko (ilo) 0.688356 0.844538 0.758491
Interlingua (ina) 0.577889 0.588235 0.583016
Indonesian (ind) 0.415879 0.514019 0.459770
Icelandic (isl) 0.855263 0.790754 0.821745
Italian (ita) 0.474576 0.561247 0.514286
Jamaican Patois (jam) 0.826087 0.791667 0.808511
Javanese (jav) 0.670130 0.658163 0.664093
Lojban (jbo) 0.896861 0.917431 0.907029
Japanese (jpn) 0.931373 0.848214 0.887850
Karakalpak (kaa) 0.790393 0.827744 0.808637
Kabyle (kab) 0.828571 0.759162 0.792350
Kannada (kan) 0.879357 0.847545 0.863158
Georgian (kat) 0.916399 0.907643 0.912000
Kazakh (kaz) 0.900901 0.819672 0.858369
Kabardian (kbd) 0.923345 0.892256 0.907534
Central Khmer (khm) 0.976667 0.816156 0.889226
Kinyarwanda (kin) 0.824324 0.726190 0.772152
Kirghiz (kir) 0.674766 0.779698 0.723447
Komi-Permyak (koi) 0.652830 0.633700 0.643123
Konkani (kok) 0.778865 0.728938 0.753075
Komi (kom) 0.737374 0.572549 0.644592
Korean (kor) 0.984615 0.967603 0.976035
Karachay-Balkar (krc) 0.869416 0.857627 0.863481
Ripuarisch (ksh) 0.709859 0.649485 0.678331
Kurdish (kur) 0.883777 0.862884 0.873206
Ladino (lad) 0.660920 0.576441 0.615797
Lao (lao) 0.986175 0.918455 0.951111
Latin (lat) 0.581250 0.636986 0.607843
Latvian (lav) 0.824513 0.797844 0.810959
Lezghian (lez) 0.898955 0.793846 0.843137
Ligurian (lij) 0.662903 0.677100 0.669927
Limburgan (lim) 0.615385 0.581818 0.598131
Lingala (lin) 0.836207 0.763780 0.798354
Lithuanian (lit) 0.756329 0.804714 0.779772
Lombard (lmo) 0.556818 0.536986 0.546722
Northern Luri (lrc) 0.838574 0.753296 0.793651
Latgalian (ltg) 0.759531 0.755102 0.757310
Luxembourgish (ltz) 0.645062 0.614706 0.629518
Luganda (lug) 0.787535 0.805797 0.796562
Literary Chinese (lzh) 0.921951 0.949749 0.935644
Maithili (mai) 0.777778 0.761658 0.769634
Malayalam (mal) 0.993377 0.949367 0.970874
Banyumasan (map-bms) 0.531429 0.453659 0.489474
Marathi (mar) 0.748744 0.818681 0.782152
Moksha (mdf) 0.728745 0.800000 0.762712
Eastern Mari (mhr) 0.790323 0.760870 0.775316
Minangkabau (min) 0.953271 0.886957 0.918919
Macedonian (mkd) 0.816399 0.849722 0.832727
Malagasy (mlg) 0.925187 0.918317 0.921739
Maltese (mlt) 0.869421 0.890017 0.879599
Min Nan Chinese (nan) 0.743707 0.820707 0.780312
Mongolian (mon) 0.852194 0.838636 0.845361
Maori (mri) 0.934726 0.937173 0.935948
Western Mari (mrj) 0.818792 0.827119 0.822934
Malay (msa) 0.508065 0.376119 0.432247
Mirandese (mwl) 0.650407 0.685225 0.667362
Burmese (mya) 0.995968 0.972441 0.984064
Erzya (myv) 0.475783 0.503012 0.489019
Mazanderani (mzn) 0.775362 0.701639 0.736661
Neapolitan (nap) 0.628993 0.595349 0.611708
Navajo (nav) 0.955882 0.937500 0.946602
Classical Nahuatl (nci) 0.679758 0.589005 0.631136
Low German (nds) 0.669789 0.690821 0.680143
West Low German (nds-nl) 0.513889 0.504545 0.509174
Nepali (macrolanguage) (nep) 0.640476 0.649758 0.645084
Newari (new) 0.928571 0.745902 0.827273
Dutch (nld) 0.553763 0.553763 0.553763
Norwegian Nynorsk (nno) 0.569277 0.519231 0.543103
Bokmål (nob) 0.519856 0.562500 0.540338
Narom (nrm) 0.691275 0.605882 0.645768
Northern Sotho (nso) 0.950276 0.815166 0.877551
Occitan (oci) 0.483444 0.366834 0.417143
Livvi-Karelian (olo) 0.816850 0.790780 0.803604
Oriya (ori) 0.981481 0.963636 0.972477
Oromo (orm) 0.885714 0.829218 0.856536
Ossetian (oss) 0.822006 0.855219 0.838284
Pangasinan (pag) 0.842105 0.715655 0.773748
Pampanga (pam) 0.770000 0.435028 0.555957
Panjabi (pan) 0.996154 0.984791 0.990440
Papiamento (pap) 0.674672 0.661670 0.668108
Picard (pcd) 0.407895 0.356322 0.380368
Pennsylvania German (pdc) 0.487047 0.509485 0.498013
Palatine German (pfl) 0.614173 0.570732 0.591656
Western Panjabi (pnb) 0.926267 0.887417 0.906426
Polish (pol) 0.797059 0.734417 0.764457
Portuguese (por) 0.500914 0.586724 0.540434
Pushto (pus) 0.941489 0.898477 0.919481
Quechua (que) 0.854167 0.797665 0.824950
Tarantino dialect (roa-tara) 0.669794 0.724138 0.695906
Romansh (roh) 0.745527 0.760649 0.753012
Romanian (ron) 0.805486 0.769048 0.786845
Rusyn (rue) 0.718543 0.645833 0.680251
Aromanian (rup) 0.288482 0.730245 0.413580
Russian (rus) 0.530120 0.690583 0.599805
Yakut (sah) 0.853521 0.865714 0.859574
Sanskrit (san) 0.931343 0.896552 0.913616
Sicilian (scn) 0.734139 0.618321 0.671271
Scots (sco) 0.571429 0.540816 0.555701
Samogitian (sgs) 0.829167 0.748120 0.786561
Sinhala (sin) 0.909474 0.935065 0.922092
Slovak (slk) 0.738235 0.665782 0.700139
Slovene (slv) 0.671123 0.662269 0.666667
Northern Sami (sme) 0.800676 0.825784 0.813036
Shona (sna) 0.761702 0.724696 0.742739
Sindhi (snd) 0.950172 0.946918 0.948542
Somali (som) 0.849462 0.802030 0.825065
Spanish (spa) 0.325234 0.413302 0.364017
Albanian (sqi) 0.875899 0.832479 0.853637
Sardinian (srd) 0.750000 0.711061 0.730012
Sranan (srn) 0.888889 0.771084 0.825806
Serbian (srp) 0.824561 0.814356 0.819427
Saterfriesisch (stq) 0.790087 0.734417 0.761236
Sundanese (sun) 0.764192 0.631769 0.691700
Swahili (macrolanguage) (swa) 0.763496 0.796247 0.779528
Swedish (swe) 0.838284 0.723647 0.776758
Silesian (szl) 0.819788 0.750809 0.783784
Tamil (tam) 0.985765 0.955172 0.970228
Tatar (tat) 0.469780 0.795349 0.590674
Tulu (tcy) 0.893300 0.873786 0.883436
Telugu (tel) 1.000000 0.913690 0.954899
Tetum (tet) 0.765116 0.744344 0.754587
Tajik (tgk) 0.828418 0.813158 0.820717
Tagalog (tgl) 0.751468 0.757396 0.754420
Thai (tha) 0.933884 0.807143 0.865900
Tongan (ton) 0.920245 0.923077 0.921659
Tswana (tsn) 0.873397 0.889070 0.881164
Turkmen (tuk) 0.898438 0.837887 0.867107
Turkish (tur) 0.666667 0.716981 0.690909
Tuvan (tyv) 0.857143 0.805063 0.830287
Udmurt (udm) 0.865517 0.756024 0.807074
Uighur (uig) 0.991597 0.967213 0.979253
Ukrainian (ukr) 0.771341 0.702778 0.735465
Urdu (urd) 0.877647 0.855505 0.866434
Uzbek (uzb) 0.655652 0.797040 0.719466
Venetian (vec) 0.611111 0.527233 0.566082
Veps (vep) 0.672862 0.688213 0.680451
Vietnamese (vie) 0.932406 0.914230 0.923228
Vlaams (vls) 0.594427 0.501305 0.543909
Volapük (vol) 0.765625 0.942308 0.844828
Võro (vro) 0.797203 0.740260 0.767677
Waray (war) 0.930876 0.930876 0.930876
Walloon (wln) 0.636804 0.693931 0.664141
Wolof (wol) 0.864220 0.845601 0.854809
Wu Chinese (wuu) 0.848921 0.830986 0.839858
Xhosa (xho) 0.837398 0.759214 0.796392
Mingrelian (xmf) 0.943396 0.874126 0.907441
Yiddish (yid) 0.955729 0.897311 0.925599
Yoruba (yor) 0.812010 0.719907 0.763190
Zeeuws (zea) 0.617737 0.550409 0.582133
Cantonese (zh-yue) 0.859649 0.649007 0.739623
Standard Chinese (zho) 0.845528 0.781955 0.812500
accuracy 0.749527 0.749527 0.749527
macro avg 0.762866 0.742101 0.749261
weighted avg 0.762006 0.749527 0.752910

As can be seen, the model outperforms on groups of similar or dialects languages. For instance, the f1 scores for the Persian language and similar languages like Gilaki, Northern Luri, Central Kurdish, Kurdish, and Mazanderani are 92%, 91%, 92%, 99%, 94%, and 93%, respectively.

How to Install

We recommend Python 3.7 or higher, PyTorch 1.6.0 or higher. The code does not work with Python 2.7.

pip install git+https://github.com/m3hrdadfi/zabanshenas.git

How to Use

You can use this code snippet to identify the most likely language of a written document. You just have to say: ZABANSHENAS (detector) -> BESHNAS (detect) 😎.

Sounds interesting, doesn't it?

from zabanshenas.zabanshenas import Zabanshenas

zabanshenas = Zabanshenas()
text = "زیر لکه‌های زمان احساسات محو میشن تو یکی دیگه شدی و من هنوز اینجام"
# Beneath the strains of time, the feelings disappear, you are someone else, I'm still right here!

r = zabanshenas.detect(text, return_all_scores=False)
print(r)

Output:

[
  {
    "language": "Persian",
    "code": "fas",
    "score": 0.6105580925941467
  }
]

Or you can find out all the candidates' scores using the following snippet.

from zabanshenas.zabanshenas import Zabanshenas

zabanshenas = Zabanshenas()
text = "زیر لکه‌های زمان احساسات محو میشن تو یکی دیگه شدی و من هنوز اینجام"
# Beneath the strains of time, the feelings disappear, you are someone else, I'm still right here!

r = zabanshenas.detect(text, return_all_scores=True)
print(r)

Output:

See all the 235 candidates
[
  [
    {
      "language": "Persian",
      "code": "fas",
      "score": 0.6105580925941467
    },
    {
      "language": "Gilaki",
      "code": "glk",
      "score": 0.29982829093933105
    },
    {
      "language": "Northern Luri",
      "code": "lrc",
      "score": 0.04840774089097977
    },
    {
      "language": "Mazanderani",
      "code": "mzn",
      "score": 0.030142733827233315
    },
    {
      "language": "South Azerbaijani",
      "code": "azb",
      "score": 0.005220199003815651
    },
    {
      "language": "Urdu",
      "code": "urd",
      "score": 0.0019745035097002983
    },
    {
      "language": "Pushto",
      "code": "pus",
      "score": 0.0015690263826400042
    },
    {
      "language": "Western Panjabi",
      "code": "pnb",
      "score": 0.0005721596535295248
    },
    {
      "language": "Central Kurdish",
      "code": "ckb",
      "score": 0.00025537016335874796
    },
    {
      "language": "Sindhi",
      "code": "snd",
      "score": 0.0001820324978325516
    },
    {
      "language": "Egyptian Arabic",
      "code": "arz",
      "score": 0.0001247940381290391
    },
    {
      "language": "Arabic",
      "code": "ara",
      "score": 7.754910620860755e-05
    },
    {
      "language": "Korean",
      "code": "kor",
      "score": 5.718228203477338e-05
    },
    {
      "language": "Fiji Hindi",
      "code": "hif",
      "score": 3.5903740354115143e-05
    },
    {
      "language": "Uighur",
      "code": "uig",
      "score": 3.5565532016335055e-05
    },
    {
      "language": "Maori",
      "code": "mri",
      "score": 2.1078320060041733e-05
    },
    {
      "language": "Literary Chinese",
      "code": "lzh",
      "score": 2.09943773370469e-05
    },
    {
      "language": "Navajo",
      "code": "nav",
      "score": 1.8877935872296803e-05
    },
    {
      "language": "Mongolian",
      "code": "mon",
      "score": 1.783044899639208e-05
    },
    {
      "language": "Basque",
      "code": "eus",
      "score": 1.2980432074982673e-05
    },
    {
      "language": "Moksha",
      "code": "mdf",
      "score": 1.2325609532126691e-05
    },
    {
      "language": "Tongan",
      "code": "ton",
      "score": 1.1610675755946431e-05
    },
    {
      "language": "Min Dong",
      "code": "cdo",
      "score": 1.1508132956805639e-05
    },
    {
      "language": "Sinhala",
      "code": "sin",
      "score": 1.0617596672091167e-05
    },
    {
      "language": "Venetian",
      "code": "vec",
      "score": 1.0375520105299074e-05
    },
    {
      "language": "Western Mari",
      "code": "mrj",
      "score": 1.0316403859178536e-05
    },
    {
      "language": "Malayalam",
      "code": "mal",
      "score": 1.0265099263051525e-05
    },
    {
      "language": "Interlingua",
      "code": "ina",
      "score": 1.0040446795755997e-05
    },
    {
      "language": "Tatar",
      "code": "tat",
      "score": 9.836200661084149e-06
    },
    {
      "language": "Cantonese",
      "code": "zh-yue",
      "score": 9.80662207439309e-06
    },
    {
      "language": "Wu Chinese",
      "code": "wuu",
      "score": 9.661145668360405e-06
    },
    {
      "language": "Igbo",
      "code": "ibo",
      "score": 9.207592484017368e-06
    },
    {
      "language": "Waray",
      "code": "war",
      "score": 8.970115231932141e-06
    },
    {
      "language": "Yiddish",
      "code": "yid",
      "score": 8.926748705562204e-06
    },
    {
      "language": "Udmurt",
      "code": "udm",
      "score": 8.702583727426827e-06
    },
    {
      "language": "Dhivehi",
      "code": "div",
      "score": 8.36203707876848e-06
    },
    {
      "language": "Newari",
      "code": "new",
      "score": 8.140945283230394e-06
    },
    {
      "language": "Karachay-Balkar",
      "code": "krc",
      "score": 8.123539373627864e-06
    },
    {
      "language": "Lojban",
      "code": "jbo",
      "score": 8.114019692584407e-06
    },
    {
      "language": "Sanskrit",
      "code": "san",
      "score": 8.087784408417065e-06
    },
    {
      "language": "Luganda",
      "code": "lug",
      "score": 8.023569534998387e-06
    },
    {
      "language": "Maithili",
      "code": "mai",
      "score": 7.723083399469033e-06
    },
    {
      "language": "Kirghiz",
      "code": "kir",
      "score": 7.715119863860309e-06
    },
    {
      "language": "Standard Chinese",
      "code": "zho",
      "score": 7.5126054071006365e-06
    },
    {
      "language": "Amharic",
      "code": "amh",
      "score": 7.451813871739432e-06
    },
    {
      "language": "Chechen",
      "code": "che",
      "score": 7.444541097356705e-06
    },
    {
      "language": "Gujarati",
      "code": "guj",
      "score": 7.395997727144277e-06
    },
    {
      "language": "Tibetan",
      "code": "bod",
      "score": 7.390805421891855e-06
    },
    {
      "language": "Komi",
      "code": "kom",
      "score": 7.373077551164897e-06
    },
    {
      "language": "Lao",
      "code": "lao",
      "score": 7.351867679972202e-06
    },
    {
      "language": "Wolof",
      "code": "wol",
      "score": 7.305452982109273e-06
    },
    {
      "language": "Silesian",
      "code": "szl",
      "score": 7.301976893359097e-06
    },
    {
      "language": "Northern Sotho",
      "code": "nso",
      "score": 7.2927336987049785e-06
    },
    {
      "language": "Armenian",
      "code": "hye",
      "score": 7.243447726068553e-06
    },
    {
      "language": "Arpitan",
      "code": "frp",
      "score": 7.137540251278551e-06
    },
    {
      "language": "Bishnupriya",
      "code": "bpy",
      "score": 7.062033091642661e-06
    },
    {
      "language": "Azerbaijani",
      "code": "aze",
      "score": 6.906778253323864e-06
    },
    {
      "language": "Tajik",
      "code": "tgk",
      "score": 6.730050699843559e-06
    },
    {
      "language": "Old English ",
      "code": "ang",
      "score": 6.6442084971640725e-06
    },
    {
      "language": "Marathi",
      "code": "mar",
      "score": 6.63194168737391e-06
    },
    {
      "language": "Kurdish",
      "code": "kur",
      "score": 6.615779057028703e-06
    },
    {
      "language": "Lithuanian",
      "code": "lit",
      "score": 6.561998816323467e-06
    },
    {
      "language": "Russian",
      "code": "rus",
      "score": 6.4370215113740414e-06
    },
    {
      "language": "Tulu",
      "code": "tcy",
      "score": 6.370255960064242e-06
    },
    {
      "language": "Extremaduran",
      "code": "ext",
      "score": 6.3398160818906035e-06
    },
    {
      "language": "Aymara",
      "code": "aym",
      "score": 6.288398708420573e-06
    },
    {
      "language": "Lower Sorbian",
      "code": "dsb",
      "score": 6.209619641595054e-06
    },
    {
      "language": "Classical Nahuatl",
      "code": "nci",
      "score": 5.954705557087436e-06
    },
    {
      "language": "Polish",
      "code": "pol",
      "score": 5.952156243438367e-06
    },
    {
      "language": "Cebuano",
      "code": "ceb",
      "score": 5.911888820264721e-06
    },
    {
      "language": "Hakka Chinese",
      "code": "hak",
      "score": 5.756284735980444e-06
    },
    {
      "language": "Georgian",
      "code": "kat",
      "score": 5.656391749653267e-06
    },
    {
      "language": "Mingrelian",
      "code": "xmf",
      "score": 5.57373004994588e-06
    },
    {
      "language": "Telugu",
      "code": "tel",
      "score": 5.5334053286060225e-06
    },
    {
      "language": "Doteli",
      "code": "dty",
      "score": 5.510717073775595e-06
    },
    {
      "language": "Portuguese",
      "code": "por",
      "score": 5.50901131646242e-06
    },
    {
      "language": "Komi-Permyak",
      "code": "koi",
      "score": 5.447328476293478e-06
    },
    {
      "language": "Eastern Mari",
      "code": "mhr",
      "score": 5.414771294454113e-06
    },
    {
      "language": "Lezghian",
      "code": "lez",
      "score": 5.2741329454875086e-06
    },
    {
      "language": "Nepali (macrolanguage)",
      "code": "nep",
      "score": 5.273408532957546e-06
    },
    {
      "language": "Samogitian",
      "code": "sgs",
      "score": 5.207636149862083e-06
    },
    {
      "language": "Bhojpuri",
      "code": "bho",
      "score": 5.19551804245566e-06
    },
    {
      "language": "Occitan",
      "code": "oci",
      "score": 5.172901182959322e-06
    },
    {
      "language": "Western Frisian",
      "code": "fry",
      "score": 5.066170615464216e-06
    },
    {
      "language": "Vlaams",
      "code": "vls",
      "score": 5.014707312511746e-06
    },
    {
      "language": "Japanese",
      "code": "jpn",
      "score": 4.986791282135528e-06
    },
    {
      "language": "V\u00f5ro",
      "code": "vro",
      "score": 4.9785726332629565e-06
    },
    {
      "language": "Rusyn",
      "code": "rue",
      "score": 4.937043286190601e-06
    },
    {
      "language": "Hindi",
      "code": "hin",
      "score": 4.9325194595439825e-06
    },
    {
      "language": "Sicilian",
      "code": "scn",
      "score": 4.8434171731059905e-06
    },
    {
      "language": "Somali",
      "code": "som",
      "score": 4.722482117358595e-06
    },
    {
      "language": "Galician",
      "code": "glg",
      "score": 4.664954758482054e-06
    },
    {
      "language": "Kazakh",
      "code": "kaz",
      "score": 4.485120825847844e-06
    },
    {
      "language": "Kannada",
      "code": "kan",
      "score": 4.438274572748924e-06
    },
    {
      "language": "Oromo",
      "code": "orm",
      "score": 4.422903202794259e-06
    },
    {
      "language": "Albanian",
      "code": "sqi",
      "score": 4.410150268085999e-06
    },
    {
      "language": "Minangkabau",
      "code": "min",
      "score": 4.407007509144023e-06
    },
    {
      "language": "Finnish",
      "code": "fin",
      "score": 4.374884611024754e-06
    },
    {
      "language": "Ossetian",
      "code": "oss",
      "score": 4.322507265897002e-06
    },
    {
      "language": "Volap\u00fck",
      "code": "vol",
      "score": 4.30220188718522e-06
    },
    {
      "language": "Min Nan Chinese",
      "code": "nan",
      "score": 4.2357942220405675e-06
    },
    {
      "language": "Bashkir",
      "code": "bak",
      "score": 4.212616204313235e-06
    },
    {
      "language": "Ligurian",
      "code": "lij",
      "score": 4.1821313061518595e-06
    },
    {
      "language": "Welsh",
      "code": "cym",
      "score": 4.174029982095817e-06
    },
    {
      "language": "Slovene",
      "code": "slv",
      "score": 4.172954504610971e-06
    },
    {
      "language": "Dimli",
      "code": "diq",
      "score": 4.078176516486565e-06
    },
    {
      "language": "Chuvash",
      "code": "chv",
      "score": 4.048466053063748e-06
    },
    {
      "language": "Panjabi",
      "code": "pan",
      "score": 3.940522674383828e-06
    },
    {
      "language": "Cornish",
      "code": "cor",
      "score": 3.940297119697789e-06
    },
    {
      "language": "West Low German",
      "code": "nds-nl",
      "score": 3.926987574232044e-06
    },
    {
      "language": "Cherokee",
      "code": "chr",
      "score": 3.9112833292165305e-06
    },
    {
      "language": "Ido",
      "code": "ido",
      "score": 3.892145286954474e-06
    },
    {
      "language": "Friulian",
      "code": "fur",
      "score": 3.869370175380027e-06
    },
    {
      "language": "Ukrainian",
      "code": "ukr",
      "score": 3.7814761526533403e-06
    },
    {
      "language": "Vietnamese",
      "code": "vie",
      "score": 3.7795757634739857e-06
    },
    {
      "language": "Emilian",
      "code": "egl",
      "score": 3.7286854421836324e-06
    },
    {
      "language": "Hungarian",
      "code": "hun",
      "score": 3.706084498844575e-06
    },
    {
      "language": "Haitian Creole",
      "code": "hat",
      "score": 3.6860656109638512e-06
    },
    {
      "language": "Jamaican Patois",
      "code": "jam",
      "score": 3.6750652725459076e-06
    },
    {
      "language": "Turkmen",
      "code": "tuk",
      "score": 3.6414037367649144e-06
    },
    {
      "language": "Gagauz",
      "code": "gag",
      "score": 3.6310443647380453e-06
    },
    {
      "language": "Yakut",
      "code": "sah",
      "score": 3.611620968513307e-06
    },
    {
      "language": "Breton",
      "code": "bre",
      "score": 3.5204120649723336e-06
    },
    {
      "language": "Afrikaans",
      "code": "afr",
      "score": 3.5164177916158224e-06
    },
    {
      "language": "Assamese",
      "code": "asm",
      "score": 3.5076063795713708e-06
    },
    {
      "language": "Crimean Tatar",
      "code": "crh",
      "score": 3.4974791560671292e-06
    },
    {
      "language": "Tswana",
      "code": "tsn",
      "score": 3.4639840578165604e-06
    },
    {
      "language": "Malagasy",
      "code": "mlg",
      "score": 3.4424308523739455e-06
    },
    {
      "language": "Tamil",
      "code": "tam",
      "score": 3.433554866205668e-06
    },
    {
      "language": "Belarusian (Taraschkewiza)",
      "code": "be-tarask",
      "score": 3.4065565159835387e-06
    },
    {
      "language": "Scottish Gaelic",
      "code": "gla",
      "score": 3.383374632903724e-06
    },
    {
      "language": "Latin",
      "code": "lat",
      "score": 3.299320724181598e-06
    },
    {
      "language": "Chavacano",
      "code": "cbk",
      "score": 3.277132236689795e-06
    },
    {
      "language": "Tarantino dialect",
      "code": "roa-tara",
      "score": 3.2704483601264656e-06
    },
    {
      "language": "Modern Greek",
      "code": "ell",
      "score": 3.2669522624928504e-06
    },
    {
      "language": "Ladino",
      "code": "lad",
      "score": 3.1890219815977616e-06
    },
    {
      "language": "Latgalian",
      "code": "ltg",
      "score": 3.1830948046263075e-06
    },
    {
      "language": "Pampanga",
      "code": "pam",
      "score": 3.1460281206818763e-06
    },
    {
      "language": "Tagalog",
      "code": "tgl",
      "score": 3.100457433902193e-06
    },
    {
      "language": "Hebrew",
      "code": "heb",
      "score": 3.0715009415871464e-06
    },
    {
      "language": "Serbo-Croatian",
      "code": "hbs",
      "score": 3.050950908800587e-06
    },
    {
      "language": "Achinese",
      "code": "ace",
      "score": 3.0138855890982086e-06
    },
    {
      "language": "Italian",
      "code": "ita",
      "score": 3.003329993589432e-06
    },
    {
      "language": "English",
      "code": "eng",
      "score": 2.97778979074792e-06
    },
    {
      "language": "Burmese",
      "code": "mya",
      "score": 2.9546490623033606e-06
    },
    {
      "language": "Spanish",
      "code": "spa",
      "score": 2.9272057417983888e-06
    },
    {
      "language": "Papiamento",
      "code": "pap",
      "score": 2.8780641514458694e-06
    },
    {
      "language": "Sardinian",
      "code": "srd",
      "score": 2.866505383281037e-06
    },
    {
      "language": "Esperanto",
      "code": "epo",
      "score": 2.848199301297427e-06
    },
    {
      "language": "Serbian",
      "code": "srp",
      "score": 2.7479175059852423e-06
    },
    {
      "language": "Zeeuws",
      "code": "zea",
      "score": 2.7430314730736427e-06
    },
    {
      "language": "Czech",
      "code": "ces",
      "score": 2.7409500944486354e-06
    },
    {
      "language": "Bengali",
      "code": "ben",
      "score": 2.6958239232044434e-06
    },
    {
      "language": "Erzya",
      "code": "myv",
      "score": 2.6273187359038275e-06
    },
    {
      "language": "Croatian",
      "code": "hrv",
      "score": 2.6178654479735997e-06
    },
    {
      "language": "Buryat",
      "code": "bxr",
      "score": 2.60430465459649e-06
    },
    {
      "language": "Swahili (macrolanguage)",
      "code": "swa",
      "score": 2.6016373340098653e-06
    },
    {
      "language": "Pangasinan",
      "code": "pag",
      "score": 2.60037768384791e-06
    },
    {
      "language": "Xhosa",
      "code": "xho",
      "score": 2.580123918960453e-06
    },
    {
      "language": "Bosnian",
      "code": "bos",
      "score": 2.5763115445442963e-06
    },
    {
      "language": "Low German",
      "code": "nds",
      "score": 2.5743340756889665e-06
    },
    {
      "language": "Kinyarwanda",
      "code": "kin",
      "score": 2.568235458966228e-06
    },
    {
      "language": "Aromanian",
      "code": "rup",
      "score": 2.520287125662435e-06
    },
    {
      "language": "Aragonese",
      "code": "arg",
      "score": 2.4836215288814856e-06
    },
    {
      "language": "Tetum",
      "code": "tet",
      "score": 2.396502168267034e-06
    },
    {
      "language": "Quechua",
      "code": "que",
      "score": 2.3799134396540467e-06
    },
    {
      "language": "Livvi-Karelian",
      "code": "olo",
      "score": 2.3709426386631094e-06
    },
    {
      "language": "Kashubian",
      "code": "csb",
      "score": 2.358733354412834e-06
    },
    {
      "language": "Avar",
      "code": "ava",
      "score": 2.330698407604359e-06
    },
    {
      "language": "Hausa",
      "code": "hau",
      "score": 2.286114295202424e-06
    },
    {
      "language": "Ripuarisch",
      "code": "ksh",
      "score": 2.254129412904149e-06
    },
    {
      "language": "Bulgarian",
      "code": "bul",
      "score": 2.2492179141408997e-06
    },
    {
      "language": "Oriya",
      "code": "ori",
      "score": 2.1661755909008207e-06
    },
    {
      "language": "Interlingue",
      "code": "ile",
      "score": 2.059975486190524e-06
    },
    {
      "language": "Guarani",
      "code": "grn",
      "score": 2.024690957114217e-06
    },
    {
      "language": "Banjar",
      "code": "bjn",
      "score": 2.0237362150510307e-06
    },
    {
      "language": "Thai",
      "code": "tha",
      "score": 2.01868806470884e-06
    },
    {
      "language": "Dutch",
      "code": "nld",
      "score": 1.9297158360132016e-06
    },
    {
      "language": "Kabyle",
      "code": "kab",
      "score": 1.9132662600895856e-06
    },
    {
      "language": "Palatine German",
      "code": "pfl",
      "score": 1.9122355752188014e-06
    },
    {
      "language": "Javanese",
      "code": "jav",
      "score": 1.8900879013017402e-06
    },
    {
      "language": "Banyumasan",
      "code": "map-bms",
      "score": 1.8552185565567925e-06
    },
    {
      "language": "Faroese",
      "code": "fao",
      "score": 1.8414674514133367e-06
    },
    {
      "language": "Scots",
      "code": "sco",
      "score": 1.818199393710529e-06
    },
    {
      "language": "Central Khmer",
      "code": "khm",
      "score": 1.7993022538576042e-06
    },
    {
      "language": "Slovak",
      "code": "slk",
      "score": 1.7988603531193803e-06
    },
    {
      "language": "Belarusian",
      "code": "bel",
      "score": 1.782583581189101e-06
    },
    {
      "language": "Swedish",
      "code": "swe",
      "score": 1.7702136574371252e-06
    },
    {
      "language": "Saterfriesisch",
      "code": "stq",
      "score": 1.7663436437942437e-06
    },
    {
      "language": "Latvian",
      "code": "lav",
      "score": 1.7178032294395962e-06
    },
    {
      "language": "Konkani",
      "code": "kok",
      "score": 1.690383783170546e-06
    },
    {
      "language": "Tuvan",
      "code": "tyv",
      "score": 1.672853159107035e-06
    },
    {
      "language": "Walloon",
      "code": "wln",
      "score": 1.6722132158975e-06
    },
    {
      "language": "Sranan",
      "code": "srn",
      "score": 1.646132773203135e-06
    },
    {
      "language": "Picard",
      "code": "pcd",
      "score": 1.6385885146519286e-06
    },
    {
      "language": "Limburgan",
      "code": "lim",
      "score": 1.6372666777897393e-06
    },
    {
      "language": "French",
      "code": "fra",
      "score": 1.6239549722740776e-06
    },
    {
      "language": "Icelandic",
      "code": "isl",
      "score": 1.5904075780781568e-06
    },
    {
      "language": "Irish",
      "code": "gle",
      "score": 1.5750525790281245e-06
    },
    {
      "language": "Corsican",
      "code": "cos",
      "score": 1.570832523611898e-06
    },
    {
      "language": "Alemannic German",
      "code": "als",
      "score": 1.5651218063794659e-06
    },
    {
      "language": "German",
      "code": "deu",
      "score": 1.5594737305946182e-06
    },
    {
      "language": "Upper Sorbian",
      "code": "hsb",
      "score": 1.5125158370210556e-06
    },
    {
      "language": "Romanian",
      "code": "ron",
      "score": 1.5119784393391456e-06
    },
    {
      "language": "Manx",
      "code": "glv",
      "score": 1.5035052456369158e-06
    },
    {
      "language": "Lingala",
      "code": "lin",
      "score": 1.493238073635439e-06
    },
    {
      "language": "Malay",
      "code": "msa",
      "score": 1.4067626352698426e-06
    },
    {
      "language": "Maltese",
      "code": "mlt",
      "score": 1.370485165352875e-06
    },
    {
      "language": "Luxembourgish",
      "code": "ltz",
      "score": 1.3397349221122568e-06
    },
    {
      "language": "Estonian",
      "code": "est",
      "score": 1.3280839539220324e-06
    },
    {
      "language": "Kabardian",
      "code": "kbd",
      "score": 1.3062604011793155e-06
    },
    {
      "language": "Macedonian",
      "code": "mkd",
      "score": 1.2802570381609257e-06
    },
    {
      "language": "Pennsylvania German",
      "code": "pdc",
      "score": 1.2550040082714986e-06
    },
    {
      "language": "Sundanese",
      "code": "sun",
      "score": 1.1068191270169336e-06
    },
    {
      "language": "Iloko",
      "code": "ilo",
      "score": 1.0791690101541462e-06
    },
    {
      "language": "Karakalpak",
      "code": "kaa",
      "score": 1.0603262126096524e-06
    },
    {
      "language": "Norwegian Nynorsk",
      "code": "nno",
      "score": 1.0554679192864569e-06
    },
    {
      "language": "Yoruba",
      "code": "yor",
      "score": 1.046297711582156e-06
    },
    {
      "language": "Neapolitan",
      "code": "nap",
      "score": 1.0279602520313347e-06
    },
    {
      "language": "Danish",
      "code": "dan",
      "score": 1.0038916116172913e-06
    },
    {
      "language": "Indonesian",
      "code": "ind",
      "score": 9.83746303973021e-07
    },
    {
      "language": "Mirandese",
      "code": "mwl",
      "score": 8.806521236692788e-07
    },
    {
      "language": "Catalan",
      "code": "cat",
      "score": 8.687447348165733e-07
    },
    {
      "language": "Turkish",
      "code": "tur",
      "score": 8.384120064874878e-07
    },
    {
      "language": "Veps",
      "code": "vep",
      "score": 7.812500371073838e-07
    },
    {
      "language": "Bokm\u00e5l",
      "code": "nob",
      "score": 7.427178161378833e-07
    },
    {
      "language": "Shona",
      "code": "sna",
      "score": 6.660703775196453e-07
    },
    {
      "language": "Bavarian",
      "code": "bar",
      "score": 6.222485353646334e-07
    },
    {
      "language": "Uzbek",
      "code": "uzb",
      "score": 6.021850822435226e-07
    },
    {
      "language": "Central Bikol",
      "code": "bcl",
      "score": 5.77034370508045e-07
    },
    {
      "language": "Asturian",
      "code": "ast",
      "score": 5.743918336520437e-07
    },
    {
      "language": "Lombard",
      "code": "lmo",
      "score": 4.6301857992148143e-07
    },
    {
      "language": "Romansh",
      "code": "roh",
      "score": 4.5534079617937095e-07
    },
    {
      "language": "Narom",
      "code": "nrm",
      "score": 3.6611126574825903e-07
    },
    {
      "language": "Northern Sami",
      "code": "sme",
      "score": 1.0723972820869676e-07
    }
  ]
]

Citation

Please cite this repository in publications as the following:

@misc{ZabanShenas,
  author       = {Mehrdad Farahani},
  title        = {Zabanshenas is a solution for identifying the most likely language of a piece of written text},
  month        = feb,
  year         = 2021,
  publisher    = {Zenodo},
  version      = {v2.0.1},
  doi          = {10.5281/zenodo.5029022},
  url          = {https://doi.org/10.5281/zenodo.5029022}
}

License

Apache License 2.0