This project was created for the 2020 Nashville Analytics Summit hosted by the Nashville Technology Council.
Abstract: In natural language processing, topic models are used to extract meaningful and human-interpretable topics from a corpus. However, tuning topic models for large corpora can be time consuming and computationally expensive. By monitoring topic coherence as a function of corpus size, we can determine how to efficiently create a high quality topic model. In this talk, we will demonstrate this technique using the English Wikipedia corpus.
I used the following papers, packages, and tutorials in this work:
- Simple API for XML
- MWParserFromHell
- Blei, Ng, and Jordan. “Latent Dirichlet Allocation.” Journal of Machine Learning Research (2003).
- Blei. “Probabilistic Topic Models.” Communications of the ACM (2012). Hoffman, Blei, and Bach. “Online learning for Latent Dirichlet Allocation.” Conference on * Neural Information Processing Systems (2010).
- Röder, Both, and Hinneburg. “Exploring the space of topic coherence measures.” Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (2015).
- Gensim Latent Dirichlet Allocation algorithm documentation
- Gensim Topic coherence pipeline documentation
- https://towardsdatascience.com/wikipedia-data-science-working-with-the-worlds-largest-encyclopedia-c08efbac5f5c
- https://github.com/DOsinga/deep_learning_cookbook/blob/master/04.1%20Collect%20movie%20data%20from%20Wikipedia.ipynb