RuLanAdCor

Russian Language Advertisement Corpus

-- ##Note Yandex closed catalog from automatic requests, so this part won't work now. I'll try to find another source of samples.

--

Corpus is automatically extracted from, uh, Internets, from two sources for now: sites in Yandex Catalog and Apple AppStore. All samples are publicly available, no problems should arise when using corpus for research purposes, but investigate license question a bit more if you are a rotten capitalist.

Final corpus can be obtained by running generate-corpus.sh. Or you can run separate scripts:

crawl.py - collects urls from Yandex Catalog.
process-appstore.py - extracts samples from Apple AppStore
process-yaca.py - extracts samples from list of urls. If file is supplied as argument, takes urls from it, otherwise crawls urls from YaCa.
join.py - joins text files generated by process-* scripts into one xml corpus.

You'll need phantomjs and selenium to process YaCa.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
js		js
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
content_extraction.py		content_extraction.py
crawl.py		crawl.py
generate-corpus.sh		generate-corpus.sh
html_filtering.py		html_filtering.py
join.py		join.py
logs.py		logs.py
process-appstore.py		process-appstore.py
process-yaca.py		process-yaca.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RuLanAdCor

About

Releases 1

Packages

Languages

License

versusvoid/RuLanAdCor

Folders and files

Latest commit

History

Repository files navigation

RuLanAdCor

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages