source of language corpus #3

DonaldTsang · 2019-11-24T09:33:49Z

Where is the source text dataset for the Ngrams of those 73 languages? Would like to see if it is different from wooorm/franc#78 usage of UDHR, and if it is more accurate than them.

danielantelo · 2019-11-25T15:01:37Z

It is in data/resources which contains thousands of tweets scraped using the script provided in the bin folder.

You could provide the datasets from franc to our scripts and see what they output. We provide it anonymised whatsapp messages in our final implementation as we wanted to detect sms type text, but tweets were working good and is what we provide in the library.

DonaldTsang · 2019-11-25T15:04:36Z

It cited http://unicode.org/udhr/ as the base for their system

DonaldTsang changed the title ~~source of language datasets~~ source of language corpus Nov 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

source of language corpus #3

source of language corpus #3

DonaldTsang commented Nov 24, 2019

danielantelo commented Nov 25, 2019

DonaldTsang commented Nov 25, 2019

source of language corpus #3

source of language corpus #3

Comments

DonaldTsang commented Nov 24, 2019

danielantelo commented Nov 25, 2019

DonaldTsang commented Nov 25, 2019