This repository contains a data pipeline that processes cryptocurrency data using Google BigQuery and uploads the processed data to an AWS S3 bucket. The pipeline consists of multiple scripts to handle different parts of the ETL (Extract, Transform, Load) process.
-
crypto_bigquery_data_processing.py
Fetches data from Google BigQuery and writes the results to Parquet files.
-
crypto_localstack_s3_data_loader.py
Uploads the Parquet files generated by the BigQuery processing script to an AWS S3 bucket.
-
crypto_datapipeline_executor.py
Orchestrates the execution of the
crypto_bigquery_data_processing.py
andcrypto_localstack_s3_data_loader.py
scripts.
- Set up your environment: Ensure the necessary environment variables are set for AWS access keys.
-
Environment Setup:
- Make sure you have Python 3.12 installed on your system.
- Set up a virtual environment for the project to manage dependencies.
- Install Poetry, a dependency management tool:
pip install poetry
-
Clone the Repository:
- Use the following commands to clone the repository and change to the directory:
git clone <repository_url> cd <repository_directory>
- Use the following commands to clone the repository and change to the directory:
-
Install Dependencies:
- Install project dependencies with Poetry:
poetry install
- google-cloud-bigquery
- pandas
- boto3
- botocore
- Install project dependencies with Poetry:
-
Run the Data Pipeline:
- To execute the main Python script:
- First, activate the Poetry shell:
poetry shell
- Set the environment variable for Google credentials:
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/service-account-key.json
- Run the script:
python crypto_datapipeline_executor.py
- To execute the main Python script:
-
Set up your environment: Ensure your cloud environment is configured correctly with the necessary credentials and permissions.
-
Deploy the Scripts: Depending on your cloud provider, you may need to deploy the scripts to a virtual machine, container, or a managed service. Ensure the necessary environment variables are set for Google and AWS credentials.
-
Execute the Data Pipeline: Connect to your cloud environment and run the scripts as you would locally:
python crypto_bigquery_data_processing.py python crypto_localstack_s3_data_loader.py
Or, use the executor script to run both sequentially:
python crypto_datapipeline_executor.py
All scripts use Python's built-in logging module to log information and errors. Logs include timestamps, log levels, and messages.