Skip to content

Search about a person/company across the internet for free, get 100s of results, create a report and track social media, like reddit for mentions.

Notifications You must be signed in to change notification settings

Tasmay-Tibrewal/deepsearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DeepSearch Project

This project implements a multi-stage deepsearch agent that automates the process of gathering, scraping, and summarizing information from the web and Reddit for a given topic.

Project Overview

The DeepSearch performs the following steps:

  1. Search: Takes a user-defined topic and uses a SearXNG-compatible search API (search.py) to find relevant web pages.
  2. Scrape: Fetches the content from the URLs identified in the search phase using a dedicated scraping API (scraper.py).
  3. Summarize: Utilizes Google's Gemini large language models to generate a comprehensive report based on the scraped web content and Reddit mentions. The report includes an executive summary and a detailed breakdown covering core identity, public profile, news, affiliations, sentiment, and Reddit presence.
  4. Reddit Integration: Fetches recent Reddit posts related to the topic to provide insights into public discussion on the platform.

The main interface is a Streamlit application (main.py) that orchestrates these steps and displays the results.

Components

The project consists of three main Python scripts:

  1. main.py:

    • A Streamlit web application that serves as the user interface and orchestrator for the research pipeline.
    • Manages user input, calls the search and scrape APIs, invokes the LLM for summarization, and fetches Reddit data.
    • Displays search results, scraped content, the final summary, and Reddit posts.
  2. search.py:

    • An aiohttp server that provides a search API.
    • Uses pyppeteer to interact with a SearXNG instance (or a compatible search engine) to perform searches.
    • Designed for asynchronous and parallel page fetching to speed up search result gathering.
    • Endpoint: POST /search
  3. scraper.py:

    • An aiohttp server that provides a content scraping API.
    • Uses aiohttp for asynchronous fetching of multiple URLs and BeautifulSoup for parsing HTML content.
    • Extracts titles and textual content from web pages.
    • Endpoint: POST /scrape

Features

  • Automated Research Pipeline: Streamlines the process of searching, scraping, and summarizing information.
  • LLM-Powered Summarization: Leverages Google Gemini models for in-depth and structured report generation.
  • Dynamic Model Selection: Chooses between primary and fallback Gemini models based on token limits.
  • Content Truncation: Implements a strategy to truncate content if it exceeds model token limits.
  • Asynchronous Operations: Both search and scrape servers are built with aiohttp for efficient, non-blocking I/O.
  • Reddit Integration: Fetches recent discussions from Reddit to supplement web findings.
  • User-Friendly Interface: Streamlit app for easy interaction and visualization of results.
  • Configurable: Uses environment variables for API keys, server URLs, and other settings.

Setup

Prerequisites

  • Python 3.8+
  • pip (Python package installer)
  • A running SearXNG instance (or similar search provider) accessible via URL.
  • Access to Google AI Studio API (for Gemini models).
  • Reddit API credentials (Client ID, Client Secret, User Agent).

Installation

  1. Clone the repository (if applicable) or ensure all files (main.py, scraper.py, search.py, .env) are in the same directory.

  2. Create and activate a virtual environment:

    python -m venv scrape_env
    # On Windows
    scrape_env\Scripts\activate
    # On macOS/Linux
    source scrape_env/bin/activate
  3. Install dependencies: Create a requirements.txt file with the following content:

    // filepath: requirements.txt
    pandas
    requests
    streamlit
    python-dotenv
    google-generativeai
    praw
    aiohttp
    beautifulsoup4
    fake-useragent
    pyppeteer
    pyppeteer-stealth
    

    Then run:

    pip install -r requirements.txt
  4. Set up environment variables: Create a .env file in the root directory of the project with the following content, replacing placeholder values with your actual credentials and URLs:

    // filepath: .env
    SEARCH_API_URL="http://localhost:8081/search"
    SCRAPER_API_URL="http://localhost:8082/scrape"
    AISTUDIO_API_KEY="YOUR_GOOGLE_AISTUDIO_API_KEY"
    REDDIT_CLIENT_ID="YOUR_REDDIT_CLIENT_ID"
    REDDIT_CLIENT_SECRET="YOUR_REDDIT_CLIENT_SECRET"
    REDDIT_USER_AGENT="YOUR_REDDIT_USER_AGENT_STRING (e.g., DeepSearchAgent/0.1 by YourUsername)"
    
    # Optional: If your SearXNG instance is not at the default used in search.py
    # SEARX_INSTANCE="YOUR_SEARXNG_INSTANCE_URL/search"
    
    # Optional: If your Chrome executable is not found automatically by pyppeteer
    # CHROME_EXECUTABLE="PATH_TO_YOUR_CHROME_EXECUTABLE"

Running the Application

You need to run the three components in separate terminal windows. Ensure your virtual environment is activated in each terminal.

  1. Start the Search Server (search.py):

    python search.py

    This will typically start the server on http://0.0.0.0:8081.

  2. Start the Scraper Server (scraper.py):

    python scraper.py

    This will typically start the server on http://0.0.0.0:8082.

  3. Start the Main Streamlit Application (main.py):

    streamlit run main.py

    This will open the DeepSearch application in your web browser, usually at http://localhost:8501.

Once all three components are running, you can use the Streamlit interface to enter a research topic and start the process.

API Endpoints

Search API (search.py)

  • Endpoint: POST /search
  • Request Body (JSON):
    {
        "query": "your search topic",
        "max_pages": 10, // Optional, defaults to 100
        "query_page_concurrency": 32 // Optional, defaults to 32
    }
  • Response (JSON): A list of search result objects, each containing url, title, description, and original_page_no.

Scraper API (scraper.py)

  • Endpoint: POST /scrape
  • Request Body (JSON):
    {
        "urls": ["url1", "url2", ...],
        "timeout": 15, // Optional, defaults to 15 seconds
        "concurrent_requests": 10 // Optional, defaults to (CPU cores * 5)
    }
  • Response (JSON):
    {
        "scraped_data": [
            {"url": "url1", "title": "Page Title", "text_content": "Scraped text...", "error": null},
            ...
        ]
    }

Configuration

Key configurations are managed through the .env file:

  • SEARCH_API_URL: URL for the search server.
  • SCRAPER_API_URL: URL for the scraper server.
  • AISTUDIO_API_KEY: Your API key for Google AI Studio (Gemini).
  • REDDIT_CLIENT_ID, REDDIT_CLIENT_SECRET, REDDIT_USER_AGENT: Credentials for accessing the Reddit API.
  • SEARX_INSTANCE (in search.py, can be set via env): The base URL of your SearXNG instance.
  • CHROME_EXECUTABLE (in search.py, can be set via env): Path to your Chrome/Chromium executable if not found automatically.

About

Search about a person/company across the internet for free, get 100s of results, create a report and track social media, like reddit for mentions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages