GitHub - x-tabdeveloping/turftopic: Robust and fast topic models with sentence-transformers.

Topic modeling is your turf too.
Contextual topic models with representations from transformers.

Features

Novel transformer-based topic models:
- Semantic Signal Separation - S³ 🧭
- KeyNMF 🔑
- GMM
Implementations of existing transformer-based topic models
- Clustering Topic Models: BERTopic and Top2Vec
- Autoencoding Topic Models: CombinedTM and ZeroShotTM
Streamlined scikit-learn compatible API 🛠️
Easy topic interpretation 🔍
Dynamic Topic Modeling 📈 (GMM, ClusteringTopicModel and KeyNMF)
Visualization with topicwizard 🖌️

This package is still work in progress and scientific papers on some of the novel methods are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues.

New in version 0.3.0: Dynamic KeyNMF

KeyNMF can now be used for dynamic topic modeling.

from datetime import datetime
from turftopic import KeyNMF

corpus: list[str] = [...]
timestamps = list[datetime] = [...]

model = KeyNMF(10)
doc_topic_matrix = model.fit_transform_dynamic(corpus, timestamps=timestamps, bins=10)

model.print_topics_over_time()

# This needs Plotly: pip install plotly
model.plot_topics_over_time()

Basics (Documentation)

Installation

Turftopic can be installed from PyPI.

pip install turftopic

If you intend to use CTMs, make sure to install the package with Pyro as an optional dependency.

pip install turftopic[pyro-ppl]

Fitting a Model

Turftopic's models follow the scikit-learn API conventions, and as such they are quite easy to use if you are familiar with scikit-learn workflows.

Here's an example of how you use KeyNMF, one of our models on the 20Newsgroups dataset from scikit-learn.

from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(
    subset="all",
    remove=("headers", "footers", "quotes"),
)
corpus = newsgroups.data

Turftopic also comes with interpretation tools that make it easy to display and understand your results.

from turftopic import KeyNMF

model = KeyNMF(20).fit(corpus)

Interpreting Models

Turftopic comes with a number of pretty printing utilities for interpreting the models.

To see the highest the most important words for each topic, use the print_topics() method.

model.print_topics()

Topic ID	Top 10 Words
0	armenians, armenian, armenia, turks, turkish, genocide, azerbaijan, soviet, turkey, azerbaijani
1	sale, price, shipping, offer, sell, prices, interested, 00, games, selling
2	christians, christian, bible, christianity, church, god, scripture, faith, jesus, sin
3	encryption, chip, clipper, nsa, security, secure, privacy, encrypted, crypto, cryptography
	....

# Print highest ranking documents for topic 0
model.print_representative_documents(0, corpus, document_topic_matrix)

Document	Score
Poor 'Poly'. I see you're preparing the groundwork for yet another retreat from your...	0.40
Then you must be living in an alternate universe. Where were they? An Appeal to Mankind During the...	0.40
It is 'Serdar', 'kocaoglan'. Just love it. Well, it could be your head wasn't screwed on just right...	0.39

model.print_topic_distribution(
    "I think guns should definitely banned from all public institutions, such as schools."
)

Topic name	Score
7_gun_guns_firearms_weapons	0.05
17_mail_address_email_send	0.00
3_encryption_chip_clipper_nsa	0.00
19_baseball_pitching_pitcher_hitter	0.00
11_graphics_software_program_3d	0.00

Visualization

Turftopic does not come with built-in visualization utilities, topicwizard, an interactive topic model visualization library, is compatible with all models from Turftopic.

pip install topic-wizard

By far the easiest way to visualize your models for interpretation is to launch the topicwizard web app.

import topicwizard

topicwizard.visualize(corpus, model=model)

Screenshot of the topicwizard Web Application

Alternatively you can use the Figures API in topicwizard for individual HTML figures.

References

Kardos, M., Kostkan, J., Vermillet, A., Nielbo, K., Enevoldsen, K., & Rocca, R. (2024, June 13). $S^3$ - Semantic Signal separation. arXiv.org. https://arxiv.org/abs/2406.09556
Grootendorst, M. (2022, March 11). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv.org. https://arxiv.org/abs/2203.05794
Angelov, D. (2020, August 19). Top2VEC: Distributed representations of topics. arXiv.org. https://arxiv.org/abs/2008.09470
Bianchi, F., Terragni, S., & Hovy, D. (2020, April 8). Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence. arXiv.org. https://arxiv.org/abs/2004.03974
Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2021). Cross-lingual Contextualized Topic Models with Zero-shot Learning. In Proceedings of the 16th Conference of the European
Chapter of the Association for Computational Linguistics: Main Volume (pp. 1676–1683). Association for Computational Linguistics.

Name		Name	Last commit message	Last commit date
Latest commit History 185 Commits
.github/workflows		.github/workflows
assets		assets
docs		docs
examples		examples
tests		tests
turftopic		turftopic
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
citation.cff		citation.cff
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Features

New in version 0.3.0: Dynamic KeyNMF

Basics (Documentation)

Installation

Fitting a Model

Interpreting Models

Visualization

References

About

Releases 1

Packages

Contributors 4

Languages

License

x-tabdeveloping/turftopic

Folders and files

Latest commit

History

Repository files navigation

Features

New in version 0.3.0: Dynamic KeyNMF

Basics (Documentation)

Installation

Fitting a Model

Interpreting Models

Visualization

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 4

Languages

Packages