Simple focused web crawler for discovering documents written in isiXhosa. This was produced as part of an undergraduate independent research project under the supervision of Professor Hussein Suleman during my B.Sc Computer Science & Xhosa Communication at the University of Cape Town.
This research was partially funded by the National Research Foundation of South Africa (Grant number: 129253) and University of Cape Town. The authors acknowledge that opinions, findings and conclusions or recommendations expressed in this publication are that of the authors, and that the NRF accepts no liability whatsoever in this regard.
Results associated with the crawler were published in the SAICSIT2023 conference.
The final paper is available from SpringerLink, and a pre-print version is available for free from UCT CS's publications archive.
The dataset itself is available here.
Please cite as follows:
@InProceedings{10.1007/978-3-031-39652-6_2,
author="Marquard, Cael
and Suleman, Hussein",
editor="Gerber, Aurona
and Coetzee, Marijke",
title="Focused Crawling for Automated IsiXhosa Corpus Building",
booktitle="South African Institute of Computer Scientists and Information Technologists",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="19--31",
abstract="IsiXhosa is a low-resource language, which means that it does not have many large, high-quality corpora. This makes it difficult to perform many kinds of research with the language. This paper examines the use of focused Web crawling for automatic corpus generation. The resulting corpus is characterised using statistical methods: its vocabulary growth has been found to fit Heaps' Law, and its word frequency has been found to be heavy-tailed. In addition, as expected, the corpus statistics did not match expectations from non-agglutinative languages.",
isbn="978-3-031-39652-6"
}