Skip to content


Repository files navigation


Russian Language Advertisement Corpus

-- ##Note Yandex closed catalog from automatic requests, so this part won't work now. I'll try to find another source of samples.


Corpus is automatically extracted from, uh, Internets, from two sources for now: sites in Yandex Catalog and Apple AppStore. All samples are publicly available, no problems should arise when using corpus for research purposes, but investigate license question a bit more if you are a rotten capitalist.

Final corpus can be obtained by running Or you can run separate scripts:

  • - collects urls from Yandex Catalog.
  • - extracts samples from Apple AppStore
  • - extracts samples from list of urls. If file is supplied as argument, takes urls from it, otherwise crawls urls from YaCa.
  • - joins text files generated by process-* scripts into one xml corpus.

You'll need phantomjs and selenium to process YaCa.