This project is an intelligent web data extractor that fetches, parses, generates, and stores item data from web pages. It combines elements of web crawling, parsing, data generation, and database management to provide a comprehensive solution for extracting and enriching web data.
- Fetching: Retrieves web pages using proxies to avoid rate limits and blocks.
- Parsing: Extracts relevant item details from the HTML content.
- Generating: Uses GPT-4 to generate detailed item data based on the parsed details.
- Storing: Saves the extracted and generated data into a SQLite database.
- Concurrency: Handles multiple requests concurrently for efficiency.
- Scheduling: Automates periodic data extraction using a scheduler.
- Configuration Management: Manages settings and inputs through configuration files.
main.py
- Main entry point of the script.fetcher.py
- Handles fetching the item page.parser.py
- Parses the item details from the HTML content.generator.py
- Generates the item data using GPT-4.database.py
- Manages database operations.utils.py
- Contains utility functions, like loading proxies and configurations.scheduler.py
- Manages task scheduling.config.ini
- Configuration file.proxies.json
- Contains the list of proxies.items.json
- Contains the list of item URLs..env
- Contains environment variables.
-
Clone the repository:
git clone https://github.com/your-username/data-ninja.git cd data-ninja
-
Create and activate a virtual environment:
python3 -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install the dependencies:
pip install -r requirements.txt
-
Create a .env file and add your OpenAI API key:
OPENAI_API_KEY=your_openai_api_key_here
```ini
[DEFAULT]
ProxiesFile = proxies.json
ItemsFile = items.json
MaxWorkers = 5
Retries = 5
BackoffFactor = 2
ScheduleInterval = 1
[DATABASE]
DatabaseFile = items.db
[API]
OpenAIKey = your_openai_api_key_here
```
-
Contains the list of proxies.
{ "proxies": [ "http://your_proxy", "http://your_proxy" ] }
Contains the list of item URLs.
```json
{
"item_urls": [
"https://item_url",
"https://another_item_url"
]
}
```