Data Ninja

Overview

This project is an intelligent web data extractor that fetches, parses, generates, and stores item data from web pages. It combines elements of web crawling, parsing, data generation, and database management to provide a comprehensive solution for extracting and enriching web data.

Features

Fetching: Retrieves web pages using proxies to avoid rate limits and blocks.
Parsing: Extracts relevant item details from the HTML content.
Generating: Uses GPT-4 to generate detailed item data based on the parsed details.
Storing: Saves the extracted and generated data into a SQLite database.
Concurrency: Handles multiple requests concurrently for efficiency.
Scheduling: Automates periodic data extraction using a scheduler.
Configuration Management: Manages settings and inputs through configuration files.

Project Structure

main.py - Main entry point of the script.
fetcher.py - Handles fetching the item page.
parser.py - Parses the item details from the HTML content.
generator.py - Generates the item data using GPT-4.
database.py - Manages database operations.
utils.py - Contains utility functions, like loading proxies and configurations.
scheduler.py - Manages task scheduling.
config.ini - Configuration file.
proxies.json - Contains the list of proxies.
items.json - Contains the list of item URLs.
.env - Contains environment variables.

Installation

Clone the repository:

git clone https://github.com/your-username/data-ninja.git
cd data-ninja

Create and activate a virtual environment:

python3 -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

Install the dependencies:
```
pip install -r requirements.txt
```
Create a .env file and add your OpenAI API key:
```
OPENAI_API_KEY=your_openai_api_key_here
```

Configuration

config.ini

```ini
[DEFAULT]
ProxiesFile = proxies.json
ItemsFile = items.json
MaxWorkers = 5
Retries = 5
BackoffFactor = 2
ScheduleInterval = 1

[DATABASE]
DatabaseFile = items.db

[API]
OpenAIKey = your_openai_api_key_here
```

proxies.json

Contains the list of proxies.

{
    "proxies": [
        "http://your_proxy",
        "http://your_proxy"
    ]
}

items.json

Contains the list of item URLs.

```json
{
"item_urls": [
    "https://item_url",
    "https://another_item_url"
]
}
```

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
data		data
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Ninja

Overview

Features

Project Structure

Installation

Configuration

config.ini

proxies.json

items.json

About

Releases

Packages

Languages

emreyesilyurt/data-ninja

Folders and files

Latest commit

History

Repository files navigation

Data Ninja

Overview

Features

Project Structure

Installation

Configuration

config.ini

proxies.json

items.json

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages