Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

source of language corpus #3

Open
DonaldTsang opened this issue Nov 24, 2019 · 2 comments
Open

source of language corpus #3

DonaldTsang opened this issue Nov 24, 2019 · 2 comments

Comments

@DonaldTsang
Copy link

Where is the source text dataset for the Ngrams of those 73 languages? Would like to see if it is different from wooorm/franc#78 usage of UDHR, and if it is more accurate than them.

@danielantelo
Copy link
Contributor

It is in data/resources which contains thousands of tweets scraped using the script provided in the bin folder.

You could provide the datasets from franc to our scripts and see what they output. We provide it anonymised whatsapp messages in our final implementation as we wanted to detect sms type text, but tweets were working good and is what we provide in the library.

@DonaldTsang
Copy link
Author

It cited http://unicode.org/udhr/ as the base for their system

@DonaldTsang DonaldTsang changed the title source of language datasets source of language corpus Nov 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants