data-eng-taxi-ibis-dagster

NYC Taxi Data Pipeline

with Dagster, DuckDB, Spark, and Ibis (Data Engineering example)

🚀 Overview

This project demonstrates a modern, backend-agnostic data pipeline using Dagster for orchestration, DuckDB for fast local analytics, Apache Spark for scalable distributed processing, and Ibis for portable data transformations. It ingests raw NYC taxi trip data in Parquet format, persists it in DuckDB, exports a clean Parquet dataset, and runs identical analytics on both DuckDB and Spark backends.

🧭 Philosophy: Why, What, and How

Why

Reproducibility: Ensuring all compute engines operate on the same, well-defined data.
Portability: Using Ibis for backend-agnostic transformation logic, so you can switch engines with minimal code changes.
Observability: Leveraging Dagster for orchestration, lineage, and monitoring.
Performance: Combining the speed of DuckDB for local analytics with the scalability of Spark for big data.

What

Ingest: Loads all raw Parquet files into a persistent DuckDB database.
Export: Writes a clean, unified Parquet file from DuckDB.
Analyse: Runs the same Ibis aggregation logic on both DuckDB and Spark.
Log: Uses Loguru to report on the size of the exported Parquet file(s).

How

Dagster assets define each pipeline step and manage dependencies.
Ibis expresses SQL-like logic in Python, portable across engines.
DuckDB serves as a fast, local OLAP engine.
Spark enables distributed analytics on the exported Parquet.
Loguru provides rich, structured logging for pipeline observability.

🏗️ Pipeline Steps

Ingest Parquet to DuckDB Loads all NYC taxi Parquet files into a persistent DuckDB table.
Export DuckDB Table to Parquet Exports the unified DuckDB table to a single Parquet file, logging the output size.
Analyse with DuckDB Runs an Ibis query on the DuckDB table to answer this question: "For trips with a fare over $50, what is the average fare by passenger count?"
Analyse with Spark Runs the same Ibis query on the exported Parquet file using Spark.

📦 Setup & Usage

1. Install Dependencies

uv add dagster dagster-webserver duckdb ibis-framework ibis-duckdb ibis-spark loguru pyarrow pandas

2. Prepare Data Directory

Place your NYC taxi Parquet files in ../data-eng-taxi/seeds/.

3. Run the Pipeline

You can run assets individually in a Python session or orchestrate everything via Dagster:

dagster dev

Then visit http://localhost:3000 to materialise assets and view logs.

📝 Example Pipeline Code

See taxi_pipeline_native.py

📊 Observability & Logging

Loguru provides rich, timestamped logs for each pipeline stage.
The size of the exported Parquet file is logged for traceability and optimisation.
Dagster UI offers run history, asset lineage, and step timing.

🧠 Extending This Pipeline

Partition data by month or region for scalable analytics.
Add data quality checks or profiling as new assets.
Integrate with cloud storage (S3, GCS) for distributed workflows.
Parameterise thresholds, file paths, or aggregation logic for greater flexibility.

💡 Why This Pattern?

Unified Logic: Write your data transformation once with Ibis, run it anywhere.
Reproducibility: Every step, from ingestion to export to analytics, is tracked and repeatable.
Scalability: Start local with DuckDB, scale out with Spark-no code rewrite needed.
Transparency: Logging and orchestration provide full visibility into your data flow.

📚 References

Happy data engineering! 🚕✨

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
utils		utils
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
justfile		justfile
pyproject.toml		pyproject.toml
taxi_pipeline.py		taxi_pipeline.py
taxi_pipeline_native.py		taxi_pipeline_native.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data-eng-taxi-ibis-dagster

NYC Taxi Data Pipeline

🚀 Overview

🧭 Philosophy: Why, What, and How

Why

What

How

🏗️ Pipeline Steps

📦 Setup & Usage

1. Install Dependencies

2. Prepare Data Directory

3. Run the Pipeline

📝 Example Pipeline Code

📊 Observability & Logging

🧠 Extending This Pipeline

💡 Why This Pattern?

📚 References

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

DataBooth/data-eng-taxi-ibis-dagster

Folders and files

Latest commit

History

Repository files navigation

data-eng-taxi-ibis-dagster

NYC Taxi Data Pipeline

🚀 Overview

🧭 Philosophy: Why, What, and How

Why

What

How

🏗️ Pipeline Steps

📦 Setup & Usage

1. Install Dependencies

2. Prepare Data Directory

3. Run the Pipeline

📝 Example Pipeline Code

📊 Observability & Logging

🧠 Extending This Pipeline

💡 Why This Pattern?

📚 References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages