Skip to content

saganoren/ukr-twi-corpus

Repository files navigation

ukr-twi-corpus

A corpus of Ukrainian Twitter texts + instructions for downloading and filtering texts.

There are 4 files:

  • corpus.tar.xz - ready to use corpus of 1,854,993 Ukrainian Twitter texts with .csv extention.
  • Corpus-Downloading.ipynb - Jupyter Notebook file with instructions for Downloading.
  • Corpus-Filtering.ipynb - Jupyter Notebook file with instructions for Filtering.
  • twitter_scraper.py - Python script for Downloading (modified version of Kenneth Reitz's scraper - https://github.com/kennethreitz/twitter-scraper)

Reference:

Bobrovnyk K. (2019) AUTOMATED BUILDING AND ANALYSIS OF UKRAINIAN TWITTER CORPUS FOR TOXIC TEXT DETECTION. in proc. of 3rd International Conference, COLINS 2019, Kharkiv, Ukraine, 2019