Skip to content

Unsupervised clustering of articles with real-time similarity search using Flask and Redis.

Notifications You must be signed in to change notification settings

aryanxxvii/LDA-Document-Topic-Modelling

Repository files navigation

LDA-Based Unsupervised Document Topic Clustering

Project Overview

This project uses Latent Dirichlet Allocation (LDA) to cluster over 90,000 CNN news articles into topics. It features a Flask backend API for real-time similarity search, allowing users to find the top-K most similar articles based on an input article. Redis is used for caching to improve search speed. The project is dockerized for easy deployment.

Tech Stack

  • Python - Flask - Backend web framework
  • Gensim, spaCy - Topic modeling with LDA
  • Redis - Caching for faster similarity search
  • Docker - Containerization for deployment

Key Features

  • LDA Topic Clustering: Clusters documents into topics using Gensim's LDA.
  • Similarity Search: Real-time search for top-K most similar articles based on LDA topics.
  • Caching: Redis caching for fast search results.

How to Run

  1. Clone the repository.
  2. Navigate to the project directory.
  3. Run the project using Docker:
    docker-compose up --build

API Endpoint

  • POST /api/similarity_search:
    • Input: Article text.
    • Output: Top-K most similar articles.

Installation

To run without Docker:

pip install -r requirements.txt
flask run

Access the API at http://127.0.0.1:5000.

About

Unsupervised clustering of articles with real-time similarity search using Flask and Redis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published