Python scripts for extracting text from PDF and TXT files, followed by text preprocessing using NLP methods and topic modeling of the preprocessed texts. Topic modeling can be used to perform preliminary data analysis to identify key themes and topics present in a corpus of text. Implementation adapted from tutorial by Maarten Grootendorst.
- Extracts text from files in a specified directory.
- Handles both PDF and TXT file formats.
- Stopword removal
- Lemmatization
- BERTopic model is used for identifying topics within the preprocessed text.
- UMAP is used for dimensionality reduction of text embeddings.
The code returns the 10 most representative words and the number of documents associated with each topic.
- NLTK: NLP Library Used for natural language preprocessing tasks like stopword removal and lemmatization.
- PyPDF2: Library for extracting text from PDF files.
- BERTopic: Topic modeling technique that leverages transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.
- UMAP: Algorithm used for dimensionality reduction