This project implements a multi-stage deepsearch agent that automates the process of gathering, scraping, and summarizing information from the web and Reddit for a given topic.
The DeepSearch performs the following steps:
- Search: Takes a user-defined topic and uses a SearXNG-compatible search API (
search.py
) to find relevant web pages. - Scrape: Fetches the content from the URLs identified in the search phase using a dedicated scraping API (
scraper.py
). - Summarize: Utilizes Google's Gemini large language models to generate a comprehensive report based on the scraped web content and Reddit mentions. The report includes an executive summary and a detailed breakdown covering core identity, public profile, news, affiliations, sentiment, and Reddit presence.
- Reddit Integration: Fetches recent Reddit posts related to the topic to provide insights into public discussion on the platform.
The main interface is a Streamlit application (main.py
) that orchestrates these steps and displays the results.
The project consists of three main Python scripts:
-
main.py
:- A Streamlit web application that serves as the user interface and orchestrator for the research pipeline.
- Manages user input, calls the search and scrape APIs, invokes the LLM for summarization, and fetches Reddit data.
- Displays search results, scraped content, the final summary, and Reddit posts.
-
search.py
:- An
aiohttp
server that provides a search API. - Uses
pyppeteer
to interact with a SearXNG instance (or a compatible search engine) to perform searches. - Designed for asynchronous and parallel page fetching to speed up search result gathering.
- Endpoint:
POST /search
- An
-
scraper.py
:- An
aiohttp
server that provides a content scraping API. - Uses
aiohttp
for asynchronous fetching of multiple URLs andBeautifulSoup
for parsing HTML content. - Extracts titles and textual content from web pages.
- Endpoint:
POST /scrape
- An
- Automated Research Pipeline: Streamlines the process of searching, scraping, and summarizing information.
- LLM-Powered Summarization: Leverages Google Gemini models for in-depth and structured report generation.
- Dynamic Model Selection: Chooses between primary and fallback Gemini models based on token limits.
- Content Truncation: Implements a strategy to truncate content if it exceeds model token limits.
- Asynchronous Operations: Both search and scrape servers are built with
aiohttp
for efficient, non-blocking I/O. - Reddit Integration: Fetches recent discussions from Reddit to supplement web findings.
- User-Friendly Interface: Streamlit app for easy interaction and visualization of results.
- Configurable: Uses environment variables for API keys, server URLs, and other settings.
- Python 3.8+
pip
(Python package installer)- A running SearXNG instance (or similar search provider) accessible via URL.
- Access to Google AI Studio API (for Gemini models).
- Reddit API credentials (Client ID, Client Secret, User Agent).
-
Clone the repository (if applicable) or ensure all files (
main.py
,scraper.py
,search.py
,.env
) are in the same directory. -
Create and activate a virtual environment:
python -m venv scrape_env # On Windows scrape_env\Scripts\activate # On macOS/Linux source scrape_env/bin/activate
-
Install dependencies: Create a
requirements.txt
file with the following content:// filepath: requirements.txt pandas requests streamlit python-dotenv google-generativeai praw aiohttp beautifulsoup4 fake-useragent pyppeteer pyppeteer-stealth
Then run:
pip install -r requirements.txt
-
Set up environment variables: Create a
.env
file in the root directory of the project with the following content, replacing placeholder values with your actual credentials and URLs:// filepath: .env SEARCH_API_URL="http://localhost:8081/search" SCRAPER_API_URL="http://localhost:8082/scrape" AISTUDIO_API_KEY="YOUR_GOOGLE_AISTUDIO_API_KEY" REDDIT_CLIENT_ID="YOUR_REDDIT_CLIENT_ID" REDDIT_CLIENT_SECRET="YOUR_REDDIT_CLIENT_SECRET" REDDIT_USER_AGENT="YOUR_REDDIT_USER_AGENT_STRING (e.g., DeepSearchAgent/0.1 by YourUsername)" # Optional: If your SearXNG instance is not at the default used in search.py # SEARX_INSTANCE="YOUR_SEARXNG_INSTANCE_URL/search" # Optional: If your Chrome executable is not found automatically by pyppeteer # CHROME_EXECUTABLE="PATH_TO_YOUR_CHROME_EXECUTABLE"
You need to run the three components in separate terminal windows. Ensure your virtual environment is activated in each terminal.
-
Start the Search Server (
search.py
):python search.py
This will typically start the server on
http://0.0.0.0:8081
. -
Start the Scraper Server (
scraper.py
):python scraper.py
This will typically start the server on
http://0.0.0.0:8082
. -
Start the Main Streamlit Application (
main.py
):streamlit run main.py
This will open the DeepSearch application in your web browser, usually at
http://localhost:8501
.
Once all three components are running, you can use the Streamlit interface to enter a research topic and start the process.
- Endpoint:
POST /search
- Request Body (JSON):
{ "query": "your search topic", "max_pages": 10, // Optional, defaults to 100 "query_page_concurrency": 32 // Optional, defaults to 32 }
- Response (JSON): A list of search result objects, each containing
url
,title
,description
, andoriginal_page_no
.
- Endpoint:
POST /scrape
- Request Body (JSON):
{ "urls": ["url1", "url2", ...], "timeout": 15, // Optional, defaults to 15 seconds "concurrent_requests": 10 // Optional, defaults to (CPU cores * 5) }
- Response (JSON):
{ "scraped_data": [ {"url": "url1", "title": "Page Title", "text_content": "Scraped text...", "error": null}, ... ] }
Key configurations are managed through the .env
file:
SEARCH_API_URL
: URL for the search server.SCRAPER_API_URL
: URL for the scraper server.AISTUDIO_API_KEY
: Your API key for Google AI Studio (Gemini).REDDIT_CLIENT_ID
,REDDIT_CLIENT_SECRET
,REDDIT_USER_AGENT
: Credentials for accessing the Reddit API.SEARX_INSTANCE
(insearch.py
, can be set via env): The base URL of your SearXNG instance.CHROME_EXECUTABLE
(insearch.py
, can be set via env): Path to your Chrome/Chromium executable if not found automatically.