GitHub - gcdunn/ntc_analytics_2020: NTC Analytics Summit 2020

This project was created for the 2020 Nashville Analytics Summit hosted by the Nashville Technology Council.

Abstract: In natural language processing, topic models are used to extract meaningful and human-interpretable topics from a corpus. However, tuning topic models for large corpora can be time consuming and computationally expensive. By monitoring topic coherence as a function of corpus size, we can determine how to efficiently create a high quality topic model. In this talk, we will demonstrate this technique using the English Wikipedia corpus.

I used the following papers, packages, and tutorials in this work:

Simple API for XML
MWParserFromHell
Blei, Ng, and Jordan. “Latent Dirichlet Allocation.” Journal of Machine Learning Research (2003).
Blei. “Probabilistic Topic Models.” Communications of the ACM (2012). Hoffman, Blei, and Bach. “Online learning for Latent Dirichlet Allocation.” Conference on * Neural Information Processing Systems (2010).
Röder, Both, and Hinneburg. “Exploring the space of topic coherence measures.” Proceedings of the Eighth ACM International Conference on Web Search and Data Mining (2015).
Gensim Latent Dirichlet Allocation algorithm documentation
Gensim Topic coherence pipeline documentation
https://towardsdatascience.com/wikipedia-data-science-working-with-the-worlds-largest-encyclopedia-c08efbac5f5c
https://github.com/DOsinga/deep_learning_cookbook/blob/master/04.1%20Collect%20movie%20data%20from%20Wikipedia.ipynb

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
notebooks		notebooks
README.md		README.md
wiki_topics.yml		wiki_topics.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

gcdunn/ntc_analytics_2020

Folders and files

Latest commit

History

Repository files navigation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages