isixhosa-crawler

Simple focused web crawler for discovering documents written in isiXhosa. This was produced as part of an undergraduate independent research project under the supervision of Professor Hussein Suleman during my B.Sc Computer Science & Xhosa Communication at the University of Cape Town.

Disclosure

This research was partially funded by the National Research Foundation of South Africa (Grant number: 129253) and University of Cape Town. The authors acknowledge that opinions, findings and conclusions or recommendations expressed in this publication are that of the authors, and that the NRF accepts no liability whatsoever in this regard.

Publication

Results associated with the crawler were published in the SAICSIT2023 conference.

The final paper is available from SpringerLink, and a pre-print version is available for free from UCT CS's publications archive.

The dataset itself is available here.

Citation

Please cite as follows:

@InProceedings{10.1007/978-3-031-39652-6_2,
author="Marquard, Cael
and Suleman, Hussein",
editor="Gerber, Aurona
and Coetzee, Marijke",
title="Focused Crawling for Automated IsiXhosa Corpus Building",
booktitle="South African Institute of Computer Scientists and Information Technologists",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="19--31",
abstract="IsiXhosa is a low-resource language, which means that it does not have many large, high-quality corpora. This makes it difficult to perform many kinds of research with the language. This paper examines the use of focused Web crawling for automatic corpus generation. The resulting corpus is characterised using statistical methods: its vocabulary growth has been found to fit Heaps' Law, and its word frequency has been found to be heavy-tailed. In addition, as expected, the corpus statistics did not match expectations from non-agglutinative languages.",
isbn="978-3-031-39652-6"
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
crawler		crawler
inspect_out		inspect_out
.gitignore		.gitignore
README.md		README.md
examples.csv		examples.csv
gen_seeds.py		gen_seeds.py
inspect_out.py		inspect_out.py
remove_blacklisted.py		remove_blacklisted.py
reprocess_seeds.py		reprocess_seeds.py
requirements.txt		requirements.txt
run.sh		run.sh
scrapy.cfg		scrapy.cfg
seeds.txt		seeds.txt
words.csv		words.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

isixhosa-crawler

Disclosure

Publication

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Restioson/isixhosa-crawler

Folders and files

Latest commit

History

Repository files navigation

isixhosa-crawler

Disclosure

Publication

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages