DeepSearch Project

This project implements a multi-stage deepsearch agent that automates the process of gathering, scraping, and summarizing information from the web and Reddit for a given topic.

Project Overview

The DeepSearch performs the following steps:

Search: Takes a user-defined topic and uses a SearXNG-compatible search API (search.py) to find relevant web pages.
Scrape: Fetches the content from the URLs identified in the search phase using a dedicated scraping API (scraper.py).
Summarize: Utilizes Google's Gemini large language models to generate a comprehensive report based on the scraped web content and Reddit mentions. The report includes an executive summary and a detailed breakdown covering core identity, public profile, news, affiliations, sentiment, and Reddit presence.
Reddit Integration: Fetches recent Reddit posts related to the topic to provide insights into public discussion on the platform.

The main interface is a Streamlit application (main.py) that orchestrates these steps and displays the results.

Components

The project consists of three main Python scripts:

main.py:
- A Streamlit web application that serves as the user interface and orchestrator for the research pipeline.
- Manages user input, calls the search and scrape APIs, invokes the LLM for summarization, and fetches Reddit data.
- Displays search results, scraped content, the final summary, and Reddit posts.
search.py:
- An aiohttp server that provides a search API.
- Uses pyppeteer to interact with a SearXNG instance (or a compatible search engine) to perform searches.
- Designed for asynchronous and parallel page fetching to speed up search result gathering.
- Endpoint: POST /search
scraper.py:
- An aiohttp server that provides a content scraping API.
- Uses aiohttp for asynchronous fetching of multiple URLs and BeautifulSoup for parsing HTML content.
- Extracts titles and textual content from web pages.
- Endpoint: POST /scrape

Features

Automated Research Pipeline: Streamlines the process of searching, scraping, and summarizing information.
LLM-Powered Summarization: Leverages Google Gemini models for in-depth and structured report generation.
Dynamic Model Selection: Chooses between primary and fallback Gemini models based on token limits.
Content Truncation: Implements a strategy to truncate content if it exceeds model token limits.
Asynchronous Operations: Both search and scrape servers are built with aiohttp for efficient, non-blocking I/O.
Reddit Integration: Fetches recent discussions from Reddit to supplement web findings.
User-Friendly Interface: Streamlit app for easy interaction and visualization of results.
Configurable: Uses environment variables for API keys, server URLs, and other settings.

Setup

Prerequisites

Python 3.8+
pip (Python package installer)
A running SearXNG instance (or similar search provider) accessible via URL.
Access to Google AI Studio API (for Gemini models).
Reddit API credentials (Client ID, Client Secret, User Agent).

Installation

Clone the repository (if applicable) or ensure all files (main.py, scraper.py, search.py, .env) are in the same directory.

Create and activate a virtual environment:

python -m venv scrape_env
# On Windows
scrape_env\Scripts\activate
# On macOS/Linux
source scrape_env/bin/activate

Install dependencies: Create a requirements.txt file with the following content:

// filepath: requirements.txt
pandas
requests
streamlit
python-dotenv
google-generativeai
praw
aiohttp
beautifulsoup4
fake-useragent
pyppeteer
pyppeteer-stealth

Then run:

pip install -r requirements.txt

Set up environment variables: Create a .env file in the root directory of the project with the following content, replacing placeholder values with your actual credentials and URLs:

// filepath: .env
SEARCH_API_URL="http://localhost:8081/search"
SCRAPER_API_URL="http://localhost:8082/scrape"
AISTUDIO_API_KEY="YOUR_GOOGLE_AISTUDIO_API_KEY"
REDDIT_CLIENT_ID="YOUR_REDDIT_CLIENT_ID"
REDDIT_CLIENT_SECRET="YOUR_REDDIT_CLIENT_SECRET"
REDDIT_USER_AGENT="YOUR_REDDIT_USER_AGENT_STRING (e.g., DeepSearchAgent/0.1 by YourUsername)"

# Optional: If your SearXNG instance is not at the default used in search.py
# SEARX_INSTANCE="YOUR_SEARXNG_INSTANCE_URL/search"

# Optional: If your Chrome executable is not found automatically by pyppeteer
# CHROME_EXECUTABLE="PATH_TO_YOUR_CHROME_EXECUTABLE"

Running the Application

You need to run the three components in separate terminal windows. Ensure your virtual environment is activated in each terminal.

Start the Search Server (search.py):
```
python search.py
```
This will typically start the server on http://0.0.0.0:8081.
Start the Scraper Server (scraper.py):
```
python scraper.py
```
This will typically start the server on http://0.0.0.0:8082.
Start the Main Streamlit Application (main.py):
```
streamlit run main.py
```
This will open the DeepSearch application in your web browser, usually at http://localhost:8501.

Once all three components are running, you can use the Streamlit interface to enter a research topic and start the process.

API Endpoints

Search API (`search.py`)

Endpoint: POST /search

Request Body (JSON):

{
    "query": "your search topic",
    "max_pages": 10, // Optional, defaults to 100
    "query_page_concurrency": 32 // Optional, defaults to 32
}

Response (JSON): A list of search result objects, each containing url, title, description, and original_page_no.

Scraper API (`scraper.py`)

Endpoint: POST /scrape

Request Body (JSON):

{
    "urls": ["url1", "url2", ...],
    "timeout": 15, // Optional, defaults to 15 seconds
    "concurrent_requests": 10 // Optional, defaults to (CPU cores * 5)
}

Response (JSON):

{
    "scraped_data": [
        {"url": "url1", "title": "Page Title", "text_content": "Scraped text...", "error": null},
        ...
    ]
}

Configuration

Key configurations are managed through the .env file:

SEARCH_API_URL: URL for the search server.
SCRAPER_API_URL: URL for the scraper server.
AISTUDIO_API_KEY: Your API key for Google AI Studio (Gemini).
REDDIT_CLIENT_ID, REDDIT_CLIENT_SECRET, REDDIT_USER_AGENT: Credentials for accessing the Reddit API.
SEARX_INSTANCE (in search.py, can be set via env): The base URL of your SearXNG instance.
CHROME_EXECUTABLE (in search.py, can be set via env): Path to your Chrome/Chromium executable if not found automatically.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
main.py		main.py
readme.md		readme.md
reddit_test.py		reddit_test.py
requirements.txt		requirements.txt
scraper.py		scraper.py
search.py		search.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DeepSearch Project

Project Overview

Components

Features

Setup

Prerequisites

Installation

Running the Application

API Endpoints

Search API (`search.py`)

Scraper API (`scraper.py`)

Configuration

About

Uh oh!

Releases

Packages

Languages

Tasmay-Tibrewal/deepsearch

Folders and files

Latest commit

History

Repository files navigation

DeepSearch Project

Project Overview

Components

Features

Setup

Prerequisites

Installation

Running the Application

API Endpoints

Search API (search.py)

Scraper API (scraper.py)

Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Search API (`search.py`)

Scraper API (`scraper.py`)

Packages