-
Notifications
You must be signed in to change notification settings - Fork 402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spanish 150 word list and a suggestion #5
Comments
Hello! Thanks for the suggestion! During the first stages of this project support for non-asci characters were really limited, and i've written anything that causes an exception into an error file. I've re-added them during the second phase, and i'm sure it'll impact the results of regional passwords greatly. It was also one of the main reasons why i released only very limited (150 lines) for regional ones. It will be fixed in the next release of regional passwords. Meanwhile, please enjoy this small list of passwords (ranked by most common to least common) containing "Ñ".
About Colombia, I've refrained from adding any country codes for countries that had multiple languages, so i wouldn't taint the language lists. I've skipped colombia because it said "Spanish, Castillian" in the language and i thought it was two different languages. I'll re-add Colombia to the spanish list. Your feedback and suggestions are very much appreciated. I hope next release will be a big change. Cheers! |
Great, and just as a trivia issue
I am fluent in Spanish as you may presume so if you need help let me know by emailing me at [email protected] Javier |
Thanks for the trivia! About In another issue, i've given some examples for services i want to give access to everyone:
I kind of hit a brick wall about the language one because there are fundamentally difficult parts about that.
As a result of this, i've been considering implementing a queue system. Each user puts their request in a queue. They are processed one by one, and result are emailed back or available to the user after login. A second queue will also keep track of the query results. It'll insert query results to the front of the queue, while last query at the back gets popped. If a new query is requested and its result is already in the queue, it'll be pushed to the front without inserting, so frequent requests wont be repeated. I'll have to implement this to see if it could create enough of an impact to make it viable first. But at the time, it looks like i don't have the time nor funds to make this yet :( If thats okay with you, i'll contact you once collection 2-5 is processed, and i'm ready to update the regional lists with larger versions. I'll keep this issue open until then. |
3 ideas
BUT many italian sites are not .it but are .com so how can you relate language and passwords ? You may assume in general terms that you will go to a site that its the same language that you speak,( unless youre multilingual like me and I go to english ans sites in spanish) So if this is the case you can not go just consider the domain of the site because we already saw that there are italian sites which are in italian language but they are not .it but .com This means that there will be more sites with lang="it" in it's source code than sites with lang="it" and domain .it The way I would suggest is by going inside the source code of the site where in the first line you have to declare what language the site is in . I am sure that if you go to an italian site the source code will probably say it's language is italian, ( lang="it" ) lang="fr" for french sites an so on. So the relation would be what is the "lang" parameter found in the source code of the website where you found the password. Some lang parameters are double because you have lang=english but english spoken in uk, australia ( en.us, en,uk, en,au, etc) or in the case of spanish which is spoken in spain, colombia argentina or mexico es.es es.co es.ar. es.mx So the second part of the lang= would give you the most information on the country to where the password is related to. Of course we assume that the site was designed properly Also you may find the country parameter but I don't think is a must when designing your site
|
So i guess i was a bit ambiguous there, For now, i've only taken top level domain of the email addresses of users to filter the languages. So for example, I'm betting my money on no french users using
Its a good idea, but there are a lot of multilingual websites with options to change language, and thats going to taint the dataset.
Its a good idea that i've been considering - with small differences. Some alphabetical characters can hint which language that user uses. For example, a password containing "Ñ" can hint towards spanish. But i don't think i'll take this approach. This will create a bias towards what kind of passwords make it to the dataset, and which wont. If i start filtering on the passwords and not the leak source, email, and other metadata, it might create a problem.
Yup, i've taken a note of it, and im going to merge castillian and spanish together in the next big release. I've been a bit hasty reading your comment and replying, so please do not hesitate if i missed something or misunderstood something. Cheers! |
Not at all, its nice of you to listen opinions Regarding what you said about you mentioned:
That's why i mentioned you could filter on the tld of the leaked website and scraping the source code find the language and probably the country of said website to where the email account is coming from Example if you had a leak from the website http://hogarmania.com how do you know what language was this in_ Looking in the source code, in the first line you will find <html xmlns="https://www.w3.org/1999/xhtml" lang="es"> in this case at least you know it's in Spanish Or maybe the language is in a metatag that not all have a metatag withth elanguage look at this one http://webawards.com.au Because the website ends in a 2 letter domain, au , you'll know it belongs to Australia rightaway BUT look at their language meta tag it is .......... See its in English from Australia "en-AU" you are interested in the country so its here in the metatag and not in the first line where it just said or sometimes both or just one place So if the email was a gmail account and the website was just webawards.com, unless you looked inside the code you would not guess what country the email account was from, this way you can |
the Spanish words seems ok You may run in some isues if the word "Ñ" is used, its an n with a litle ~ on top of it
If you can, please consider the country of Colombia, domain is ".co"
Javier
The text was updated successfully, but these errors were encountered: