Welcome to the Mega Advanced AI Browser Agent, a sophisticated autonomous agent designed to navigate and interact with the web to achieve complex objectives using the power of large language models.
This isn't just a simple automation script. It's a powerful framework that provides rich, real-time visual feedback directly within the browser, simulating a human-like interaction flow. The agent analyzes web pages, decides on the best course of action, and executes it, all while showing you its thought process through an elegant in-browser UI.
This project is packed with over 100+ advanced features, making it a robust platform for web automation.
- Natural Language Control: Give the agent complex objectives in plain English.
- AI Decision Making: Utilizes a large language model to analyze the screen and decide the next best action.
- Streaming AI Responses: The agent's thoughts and decisions are streamed in real-time for better observability.
- Confidence Scoring: The AI provides a confidence score for each decision it makes.
- Human-like Cursor Movement: A custom-rendered cursor moves smoothly between elements with realistic, multi-step animations.
- Clean Chat Interface: A modern, minimal speech bubble UI displays AI thoughts, analysis, and status updates.
- AI Avatar & Typing Indicator: An AI avatar and typing indicator create a more intuitive user experience.
- Live Element Annotation: Screenshots are automatically annotated with numbered labels on all interactive elements.
- Dynamic Progress Indicators: Professional progress rings and status bars show the agent's current state.
- Comprehensive Action Library: Supports over 20 actions, including
NAVIGATE
,CLICK
,TYPE
,SCROLL
,HOVER
,EXECUTE_JS
, and more. - Robust Element Detection: Identifies all interactive elements on a page, including those within
iframes
. - Multi-Strategy Interaction: Uses a cascade of strategies (e.g., WebDriver click, JavaScript click) to ensure successful interactions.
- Auto-Detection: Automatically finds the most relevant input field for a given task, like a search bar.
- Advanced Error Handling & Recovery: The agent is designed to be resilient, with multiple retry mechanisms and failure detection.
- SQLite Database Logging: Every action and session detail is logged to a local SQLite database for analysis.
- Professional HTML Reports: Automatically generate a comprehensive HTML report at the end of each session with stats, timelines, and screenshots.
- Action-Level Screenshots: A screenshot is saved for every single action, whether it succeeds or fails, providing a complete visual audit trail.
- Email Notifications: Can be configured to automatically email session reports.
- Detailed Session Analytics: Tracks success rates, actions per minute, total duration, and other key performance indicators.
The agent operates in a continuous loop, observing the screen, thinking, and acting.
- User Objective: You provide a high-level goal, like
"Go to Google, search for the latest AI news, and summarize the top result."
- Observe: The agent captures the current state of the web page.
- Analyze & Annotate: It identifies all interactive elements (buttons, links, inputs) and generates a screenshot, drawing numbered labels over each element.
- Think: The annotated screenshot, the objective, and the history of past actions are sent to the AI model. The AI analyzes the visual information and returns a structured JSON response containing its
thought
process and the nextaction
to take (e.g.,{"action": {"name": "TYPE", "parameters": {"id": 5, "text": "latest AI news"}}}
). - Act: The agent parses the AI's decision and executes the specified action using Selenium WebDriver. The custom UI (cursor, bubbles) provides visual feedback on this action.
- Log & Repeat: The result of the action is logged to the database. The loop repeats until the objective is marked as complete by the AI.
Follow these steps to get the agent up and running.
- Python 3.8+
- Google Chrome browser installed
git clone https://github.com/Niansuh/Agent.git
cd Agent
# For Windows
python -m venv venv
venv\Scripts\activate
# For macOS/Linux
python3 -m venv venv
source venv/bin/activate
Create a requirements.txt
file with the following contents:
requests
python-dotenv
selenium
webdriver-manager
pillow
numpy
opencv-python
pyyaml
pandas
openpyxl
psutil
websocket-client
schedule
Then, install them using pip:
pip install -r requirements.txt
Create a file named .env
in the root of the project and add your credentials. This is crucial for the AI and email functionality.
# AI Model Configuration (compatible with OpenAI's API format)
API_KEY="sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
API_ENDPOINT_URL="https://your-ai-provider-api.com/v1/chat/completions"
MODEL_NAME="your-chosen-model-name"
# Email Configuration (Optional - for sending reports)
EMAIL_FROM="[email protected]"
EMAIL_USERNAME="[email protected]"
EMAIL_PASSWORD="your-gmail-app-password"
Note: For Gmail, you will need to generate an "App Password" to use here.
To start the agent, run the main script from your terminal:
python your_script_name.py
The script will initialize, open a Chrome browser window, and you will be prompted to enter an objective in the console.
go to wikipedia.org and search for "Quantum Computing"
open youtube.com, find a channel called "MKBHD", and click on the latest video
navigate to github.com and find trending python repositories
You can also enter special commands at the prompt:
exit
: Shuts down the agent and generates a final report.report
: Generates and saves an HTML report for the current session.stats
: Displays the latest session statistics in the console.history
: Shows a log of the most recent actions taken.screenshot
: Manually takes and saves an annotated screenshot.chat
: Runs a short demo of the in-browser chat UI features.help
: Displays a list of available commands and tips.
.
├── data/ # SQLite database files
├── downloads/ # Files downloaded by the agent
├── logs/ # Detailed .log files for debugging
├── reports/ # Generated HTML session reports
├── screenshots/ # All screenshots taken by the agent
├── .env # Environment variables (API keys, etc.)
├── your_script_name.py # The main Python script
└── README.md # This file