This repository contains a Proof of Concept (PoC) for setting up and using DataHub, an open-source metadata platform for data discovery, management, and governance. The PoC demonstrates how to install and configure DataHub using Docker, load sample data, and optionally integrate with local Kafka and Airflow instances for data ingestion and processing.
Before you start, ensure that your Docker daemon is running. You can verify this by running:
docker --version
If Docker is not installed, follow the instructions on the Docker website to install it for your operating system.
Once Docker is installed and running, you can proceed with the steps below to set up and use DataHub.
- Docker
- Python 3
To install the required packages, run:
python3 -m pip install --upgrade -r requirements.txt
datahub docker quickstart [--version TEXT (e.g. "v0.9.2")]
datahub docker ingest-sample-data
Go to the local_airflow
directory and run:
docker-compose -f docker-compose.yml up -d
If you want to test Kafka ingestion to DataHub, go to the local_kafka
directory and run:
docker-compose -f docker-compose.yml up -d
If you want to test Kafka ingestion to DataHub, go to the local_datahub/recipes
directory and run:
datahub ingest -c kafka_test_recipe.dhub.yaml
If you want to test Kafka ingestion to DataHub, go to the scripts
directory and run:
python3 eth_tx.py
docker exec kafka_test_broker \
kafka-topics --bootstrap-server kafka_test_broker:49816 \
--list
docker exec --interactive --tty kafka_test_broker \
kafka-console-consumer --bootstrap-server kafka_test_broker:49816 \
--topic transaction \
--from-beginning