Crypto ETL Pipeline

Overview

This repository contains a data pipeline that processes cryptocurrency data using Google BigQuery and uploads the processed data to an AWS S3 bucket. The pipeline consists of multiple scripts to handle different parts of the ETL (Extract, Transform, Load) process.

Files

crypto_bigquery_data_processing.py

Fetches data from Google BigQuery and writes the results to Parquet files.
crypto_localstack_s3_data_loader.py

Uploads the Parquet files generated by the BigQuery processing script to an AWS S3 bucket.
crypto_datapipeline_executor.py

Orchestrates the execution of the crypto_bigquery_data_processing.py and crypto_localstack_s3_data_loader.py scripts.

Usage

Environment Setup

Set up your environment: Ensure the necessary environment variables are set for AWS access keys.

Running Locally

Environment Setup:
- Make sure you have Python 3.12 installed on your system.
- Set up a virtual environment for the project to manage dependencies.
- Install Poetry, a dependency management tool:
```
pip install poetry
```
Clone the Repository:
- Use the following commands to clone the repository and change to the directory:
```
git clone <repository_url>
cd <repository_directory>
```
Install Dependencies:
- Install project dependencies with Poetry:
```
poetry install
```
  Dependencies
  - google-cloud-bigquery
  - pandas
  - boto3
  - botocore
Run the Data Pipeline:
- To execute the main Python script:
  - First, activate the Poetry shell:
```
poetry shell 
```
  - Set the environment variable for Google credentials:
```
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/service-account-key.json
```
  - Run the script:
```
python crypto_datapipeline_executor.py
```

Running on the Cloud

Set up your environment: Ensure your cloud environment is configured correctly with the necessary credentials and permissions.
Deploy the Scripts: Depending on your cloud provider, you may need to deploy the scripts to a virtual machine, container, or a managed service. Ensure the necessary environment variables are set for Google and AWS credentials.
Execute the Data Pipeline: Connect to your cloud environment and run the scripts as you would locally:
```
python crypto_bigquery_data_processing.py
python crypto_localstack_s3_data_loader.py
```
Or, use the executor script to run both sequentially:
```
python crypto_datapipeline_executor.py
```

Logging

All scripts use Python's built-in logging module to log information and errors. Logs include timestamps, log levels, and messages.

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
docker		docker
images		images
output		output
src		src
.dockerignore		.dockerignore
.gitignore		.gitignore
README.md		README.md
commands.txt		commands.txt
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Crypto ETL Pipeline

Overview

Files

Usage

Environment Setup

Running Locally

Dependencies

Running on the Cloud

Logging

About

Releases

Packages

Languages

Hazperera/crypto-ETL-pipeline

Folders and files

Latest commit

History

Repository files navigation

Crypto ETL Pipeline

Overview

Files

Usage

Environment Setup

Running Locally

Dependencies

Running on the Cloud

Logging

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages