Skip to content

versusvoid/RuLanAdCor

Repository files navigation

RuLanAdCor

Russian Language Advertisement Corpus

-- ##Note Yandex closed catalog from automatic requests, so this part won't work now. I'll try to find another source of samples.

--

Corpus is automatically extracted from, uh, Internets, from two sources for now: sites in Yandex Catalog and Apple AppStore. All samples are publicly available, no problems should arise when using corpus for research purposes, but investigate license question a bit more if you are a rotten capitalist.

Final corpus can be obtained by running generate-corpus.sh. Or you can run separate scripts:

  • crawl.py - collects urls from Yandex Catalog.
  • process-appstore.py - extracts samples from Apple AppStore
  • process-yaca.py - extracts samples from list of urls. If file is supplied as argument, takes urls from it, otherwise crawls urls from YaCa.
  • join.py - joins text files generated by process-* scripts into one xml corpus.

You'll need phantomjs and selenium to process YaCa.