Skip to content

Data engineering example: Dataset - NYC Yellow Taxi; Ibis, DuckDB, Spark, Dagster

License

Notifications You must be signed in to change notification settings

DataBooth/data-eng-taxi-ibis-dagster

Repository files navigation

data-eng-taxi-ibis-dagster

NYC Taxi Data Pipeline

with Dagster, DuckDB, Spark, and Ibis (Data Engineering example)

🚀 Overview

This project demonstrates a modern, backend-agnostic data pipeline using Dagster for orchestration, DuckDB for fast local analytics, Apache Spark for scalable distributed processing, and Ibis for portable data transformations. It ingests raw NYC taxi trip data in Parquet format, persists it in DuckDB, exports a clean Parquet dataset, and runs identical analytics on both DuckDB and Spark backends.


🧭 Philosophy: Why, What, and How

Why

  • Reproducibility: Ensuring all compute engines operate on the same, well-defined data.
  • Portability: Using Ibis for backend-agnostic transformation logic, so you can switch engines with minimal code changes.
  • Observability: Leveraging Dagster for orchestration, lineage, and monitoring.
  • Performance: Combining the speed of DuckDB for local analytics with the scalability of Spark for big data.

What

  • Ingest: Loads all raw Parquet files into a persistent DuckDB database.
  • Export: Writes a clean, unified Parquet file from DuckDB.
  • Analyse: Runs the same Ibis aggregation logic on both DuckDB and Spark.
  • Log: Uses Loguru to report on the size of the exported Parquet file(s).

How

  • Dagster assets define each pipeline step and manage dependencies.
  • Ibis expresses SQL-like logic in Python, portable across engines.
  • DuckDB serves as a fast, local OLAP engine.
  • Spark enables distributed analytics on the exported Parquet.
  • Loguru provides rich, structured logging for pipeline observability.

🏗️ Pipeline Steps

  1. Ingest Parquet to DuckDB Loads all NYC taxi Parquet files into a persistent DuckDB table.
  2. Export DuckDB Table to Parquet Exports the unified DuckDB table to a single Parquet file, logging the output size.
  3. Analyse with DuckDB Runs an Ibis query on the DuckDB table to answer this question: "For trips with a fare over $50, what is the average fare by passenger count?"
  4. Analyse with Spark Runs the same Ibis query on the exported Parquet file using Spark.

📦 Setup & Usage

1. Install Dependencies

uv add dagster dagster-webserver duckdb ibis-framework ibis-duckdb ibis-spark loguru pyarrow pandas

2. Prepare Data Directory

Place your NYC taxi Parquet files in ../data-eng-taxi/seeds/.

3. Run the Pipeline

You can run assets individually in a Python session or orchestrate everything via Dagster:

dagster dev

Then visit http://localhost:3000 to materialise assets and view logs.


📝 Example Pipeline Code

See taxi_pipeline_native.py


📊 Observability & Logging

  • Loguru provides rich, timestamped logs for each pipeline stage.
  • The size of the exported Parquet file is logged for traceability and optimisation.
  • Dagster UI offers run history, asset lineage, and step timing.

🧠 Extending This Pipeline

  • Partition data by month or region for scalable analytics.
  • Add data quality checks or profiling as new assets.
  • Integrate with cloud storage (S3, GCS) for distributed workflows.
  • Parameterise thresholds, file paths, or aggregation logic for greater flexibility.

💡 Why This Pattern?

  • Unified Logic: Write your data transformation once with Ibis, run it anywhere.
  • Reproducibility: Every step, from ingestion to export to analytics, is tracked and repeatable.
  • Scalability: Start local with DuckDB, scale out with Spark-no code rewrite needed.
  • Transparency: Logging and orchestration provide full visibility into your data flow.

📚 References


Happy data engineering! 🚕✨

About

Data engineering example: Dataset - NYC Yellow Taxi; Ibis, DuckDB, Spark, Dagster

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published