Topic Modeling from PDF/txt files

Python scripts for extracting text from PDF and TXT files, followed by text preprocessing using NLP methods and topic modeling of the preprocessed texts. Topic modeling can be used to perform preliminary data analysis to identify key themes and topics present in a corpus of text. Implementation adapted from tutorial by Maarten Grootendorst.

Key Functionalities:

Text Extraction:

Extracts text from files in a specified directory.
Handles both PDF and TXT file formats.

Text Preprocessing:

Stopword removal
Lemmatization

Topic Modeling:

BERTopic model is used for identifying topics within the preprocessed text.
UMAP is used for dimensionality reduction of text embeddings.

The code returns the 10 most representative words and the number of documents associated with each topic.

Tools:

NLTK: NLP Library Used for natural language preprocessing tasks like stopword removal and lemmatization.
PyPDF2: Library for extracting text from PDF files.
BERTopic: Topic modeling technique that leverages transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.
UMAP: Algorithm used for dimensionality reduction

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md
Topic_Modeller.ipynb		Topic_Modeller.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topic Modeling from PDF/txt files

Key Functionalities:

Text Extraction:

Text Preprocessing:

Topic Modeling:

Tools:

About

Releases

Packages

Languages

License

klabrou/topic-modeling

Folders and files

Latest commit

History

Repository files navigation

Topic Modeling from PDF/txt files

Key Functionalities:

Text Extraction:

Text Preprocessing:

Topic Modeling:

Tools:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages