Skip to content

We created a topic modeling pipeline to evaluate different topic modeling algorithms, including their performance on short and long text, preprocessed and not preprocessed datasets, and with different embedding models. Finally, we summarized the results and suggested how to choose algorithms based on the task.

Notifications You must be signed in to change notification settings

berksudan/OTMISC-Topic-Modeling-Tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OTMISC: Our Topic Modeling Is Super Cool

An advanced topic modeling tool that can do many things!


Introduction

This project is developed by Computer Science and Mathematics master students at TUM (Technical University of Munich) for the course "Master's Practical Course - Machine Learning for Natural Language Processing Applications" in SS22 (Summer Semester 2022). Since this project is still in its infancy, we suggest those who want to use this project to be careful.

  • Project Advisors:
    • PhD Candidate (M.Sc.) Miriam Anschütz
    • PhD Candidate (M.Sc.) Ahmed Mosharafa
  • Project Scope:
    • Evaluating different Topic Modeling algorithms on short/long text dataset.
    • Drawing observations on the applicability of certain algorithms’ clusters to different types of datasets.
    • Having an outcome including metric-based evaluation, as well as, human based evaluation to the algorithms.

Contributors

Contributor GitHub Account Email Address LinkedIn Account Other Links

Berk Sudan
github:berksudan [email protected] 🔗 medium.com/@berksudan

Ferdinand Kapl
github:fkapl [email protected] - -

Yuyin Lang
github:YuyinLang [email protected] 🔗 -

Repository structure

  • docs includes documents for this work, such as task description, final paper, presentations, and literature research.
  • data includes all the datasets used in this work
  • notebooks includes all the demo notebooks (for different algorithms) and one bulk run notebook
  • src includes py files that consist of the pipeline of this work

Project Report and Presentations

Datasets

  • Explored the provided datasets to unveil the inherent characteristics.
  • Obtained an overview of the statistical characteristics of the datasets.

Available Datasets

Resource Name Is Suitable? Type Contains Tweet Text? Topic Count Total Instances Topic Distribution
20 News (By Date) Yes Long Text Dataset No 20 853627 (42K - 45K - 52K - 33K - 30K - 53K - 33K - 35K - 33K - 37K - 45K - 51K - 33K - 45K - 45K - 51K - 46K - 65K - 50K - 33K)
Yahoo Dataset (60K) Yes Long Text Dataset No 10 60000 (6K - 6K - 6K - 6K - 6K - 6K - 6K - 6K - 6K - 6K)
AG News Titles and Texts Yes Long Text Dataset No 4 127600 (32K - 32K - 32K - 32K)
CRISIS NLP - Resource #01 Yes Short Text Dataset Yes 4 20514 (3K - 9K - 4K - 5K)
CRISIS NLP - Resource #12 Yes Short Text Dataset Yes 4 8007 (2K - 2K - 2K - 2K)
CRISIS NLP - Resource #07 Yes Short Text Dataset Yes 2 10941 (5K - 6K)
CRISIS NLP - Resource #17 Yes Short Text Dataset Yes 10 76484 (6K - 5K - 3K - 21K - 8K - 7K - 4K - 12K - 0.5K - 9K)
AG News Titles Yes Short Text Dataset No 4 127600 (32K - 32K - 32K - 32K)

Deployment and Run

Build

  • For Linux, It is enough to run the following command for setting up virtual environment and install dependencies.
$ ./build_for_linux.sh
  • For windows and other operating systems, install python 3.8, and install dependencies with pip install -r requirements.txt. Be careful about the package versions and make sure that you have the correct version in your current set up!

Run

  • To run the Jupyter Notebook, just execute the following command:
$ ./run_jupyter.sh

Note: For windows and other operating systems, it can be done via Anaconda or similar tools.

  • Then, you can run the notebooks in ./notebooks. There is one notebook for each algorithm and a general main runner that executes with a config parametrically.

Evaluation Metrics

The following evaluation metrics are used for a metric based assessment of the produced topics:

  • Diversity Unique: percentage of unique topic words; in [0,1] and 1 for all different topic words
  • Diversity Inverted Rank-Biased Overlap: rank weighted percentage of unique topic words, words at higher ranks are penalized less; in [0,1] and 1 for all different topic words
  • Coherence Normalized Pointwise Mutual Information: metric for coherence of topic words, how well do they fit together as topic?; in [-1,1] and 1 for perfect association
  • Coherence V: metric for coherence of topic words evaluated by large sliding windows over the text together with indirect cosine similarity based on NPMI; in [0,1] and 1 for perfect association
  • Rand Index: similarity measure for the two clusters given by the topic model and the real labels, in [0,1] and 1 for perfect match

References

About

We created a topic modeling pipeline to evaluate different topic modeling algorithms, including their performance on short and long text, preprocessed and not preprocessed datasets, and with different embedding models. Finally, we summarized the results and suggested how to choose algorithms based on the task.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages