This project uses Latent Dirichlet Allocation (LDA) to cluster over 90,000 CNN news articles into topics. It features a Flask backend API for real-time similarity search, allowing users to find the top-K most similar articles based on an input article. Redis is used for caching to improve search speed. The project is dockerized for easy deployment.
- Python - Flask - Backend web framework
- Gensim, spaCy - Topic modeling with LDA
- Redis - Caching for faster similarity search
- Docker - Containerization for deployment
- LDA Topic Clustering: Clusters documents into topics using Gensim's LDA.
- Similarity Search: Real-time search for top-K most similar articles based on LDA topics.
- Caching: Redis caching for fast search results.
- Clone the repository.
- Navigate to the project directory.
- Run the project using Docker:
docker-compose up --build
- POST /api/similarity_search:
- Input: Article text.
- Output: Top-K most similar articles.
To run without Docker:
pip install -r requirements.txt
flask run
Access the API at http://127.0.0.1:5000
.