confusable_homoglyphs [doc]
a homoglyph is one of two or more graphemes, characters, or glyphs with shapes that appear identical or very similar wikipedia:Homoglyph
Unicode homoglyphs can be a nuisance on the web. Your most popular client, AlaskaJazz, might be upset to be impersonated by a trickster who deliberately chose the username ΑlaskaJazz.
AlaskaJazzis single script: only Latin characters.ΑlaskaJazzis mixed-script: the first character is a greek letter.
You might also want to avoid people being tricked into entering their
password on www.microsоft.com or www.faϲebook.com instead of
www.microsoft.com or www.facebook.com. Here is a
utility to play
with these confusable homoglyphs.
Not all mixed-script strings have to be ruled out though, you could only exclude mixed-script strings containing characters that might be confused with a character from some unicode blocks of your choosing.
Alloandρττare fine: single script.AlloΓis fine when our preferred script alias is 'latin': mixed script, butΓis not confusable.Alloρis dangerous: mixed script andρcould be confused withp.
This library is compatible Python 2 and Python 3.
Yep.
The unicode blocks aliases and names for each character are extracted from this file provided by the unicode consortium.
The matrix of which character can be confused with which other characters is built using this file provided by the unicode consortium.
This data is stored in two JSON files: categories.json and
confusables.json. If you delete them, they will both be recreated by
downloading and parsing the two abovementioned files and stored as JSON
files again.