Awesome Open Source Data Engineering

A curated list of open source tools used in analytics platforms and data engineering ecosystem For more information about the above compiled landscape for 2024, please read the published blog post on Substack or Medium

STORAGE SYSTEMS

Relational DBMS

PostgreSQL - Advanced object-relational database management system
MySQL - One of the most popular open Source Databases
MariaDB - A popular MySQL server fork
Supabase - An open source Firebase alternative
SQlite - Most popular embedded database engine

Distributed SQL DBMS

Citus - A popular distributed PostgreSQL as an extension
CockroachDB - A cloud-native distributed SQL database
YugabyteDB - A cloud-native distributed SQL database
TiDB - A cloud-native, distributed, MySQL-Compatible database
OceanBase - A scalable distributed relational database
ShardingSphere - A Distributed SQL transaction & query engine
Neon - A serverless open-source alternative to AWS Aurora Postgres
CrateDB - A distributed and scalable PostgreSQL-compatible SQL database

Cache Store

Redis - A popular key-value based cache store
Memcached - A high performance multithreadedkey-value cache store
Dragonfly - A modern cache store compatible with Redis and Memcached APIs

In-memory SQL Database

Apache Ignite - A distributed, ACID-compliant in-memory DBMS
ReadySet - A MySQL and Postgres wire-compatible caching layer
VoltDB - A distributed, horizontally-scalable, ACID-compliant database

Document Store

MongoDB - A cross-platform, document-oriented NoSQL database
RavenDB - An ACID NoSQL document database
RethinkDB - A distributed document-oriented database for real-time applications
CouchDB - A Scalable document-oriented NoSQL database
Couchbase - A modern cloud-native NoSQL distributed database
FerretDB - A truly Open Source MongoDB alternative!
LowDB - A simple and fast JSON database

NoSQL Multi-model

OrientDB - A Multi-model DBMS supporting Graph, Document, Reactive, Full-Text and Geospatial models
ArrangoDB - A Multi-model database with flexible data models for documents, graphs, and key-values
SurrealDB - A scalable, distributed, collaborative, document-graph database
EdgeDB - A graph-relational database with declarative schema

Graph Database

Neo4j - A high performance leading graph database
JunasGraph - A highly scalable distributed graph database
HugeGraph - A fast-speed and highly-scalable graph database
NebulaGraph - A distributed, horizontal scalability, fast open-source graph database
Cayley - Inspired by the graph database behind Google's Knowledge Graph
Dgraph - A horizontally scalable and distributed GraphQL database with a graph backend

Distributed Key-value Store

Riak - A decentralized key-value datastore from Basho Technologies
FoundationDB - A distributed, transactional key-value store from Apple
etcd - A distributed reliable key-value store written in Go
TiKV - A distributed transactional key-value database, originally created to complement TiDB
Immudb - A database with built-in cryptographic proof and verification
Valkey - A distributed key-value datastore forked from Redis

Wide-column Key-value Store

Apache Cassandra - A highly-scalable LSM-Tree based partitioned row store
Apache Hbase - A distributed wide column-oriented store modeled after Google' Bigtable
Scylla - LSM-Tree based wide-column API-compatible with Apache Cassandra and Amazon DynamoDB
Apache Accumulo - A distributed key-value store with scalable data storage and retrieval, on top of Hadoop

Embedded Key-value Store

LevelDB - A fast key-value storage library written at Google
RocksDB - An embeddable, persistent key-value store developed by Meta (Facebook)
MyRocks - A RocksDB storage engine for MySQL
BadgerDB - An embeddable, fast key-value database written in pure Go

Search Engine

Apache Solr - A fast distributed search database built on Apache Lucene
Elastic Search - A distributed, RESTful search engine optimized for speed
Sphinx - A fulltext search engine with high speed of indexation
Meilisearch - A fast search API with great integration support
OpenSearch - A community-driven, open source fork of Elasticsearch and Kibana
Quickwit - A fast cloud-native search engine for observability data
ParadeDB - A search engine built on Postgres

Streaming Database

RisingWave - A scalable Postgres for stream processing, analytics, and management
Materialize - A real-time data warehouse purpose-built for operational workloads
EventStoreDB - An event-native database designed for event sourcing and event-driven architectures
KsqlDB - A database for building stream processing applications on top of Apache Kafka
Timeplus Proton - A streaming SQL engine, fast and lightweight, powered by ClickHouse

Time-Series Database

Influxdb - A scalable datastore for metrics, events, and real-time analytics
TimeScaleDB - A fast ingest time-series SQL database packaged as a PostgreSQL extension
Apache IoTDB - An Internet of Things database with seamless integration with the Hadoop and Spark ecology
Netflix Atlas - An n-memory dimensional time series database developed and open sourced by Netflix
QuestDB - A time-series database for fast ingest and SQL queries
TDEngine - A high-performance, cloud native time-series database optimized for Internet of Things (IoT)
KairosDB - A scalable time series database written in Java
GreptimeDB - A cloud-native, unified time series database for metrics, logs and events

Columnar OLAP Database

Apache Kudu - A column-oriented data store for the Apache Hadoop ecosystem
Greeenplum - A column-oriented massively parallel PostgreSQL for analytics
MonetDB - A high-performance columnar database originally developed by the CWI database research group
Databend - An lastic, workload-aware cloud-native data warehouse built in Rust
ByConity - A cloud-native data warehouse forked from ClickHouse
hydra - A fast column-oriented Postgres extension

Real-time OLAP Engine

ClickHouse - A real-time column-oriented database originally developed at Yandex
Apache Pinot - A a real-time distributed OLAP datastore open sourced by LinkedIn
Apache Druid - A high performance real-time OLAP engine developed and open sourced by Metamarkets
Apache Kylin - A distributed OLAP engine designed to provide multi-dimensional analysis on Hadoop
Apache Doris - A high-performance and real-time analytical database based on MPP architecture
StarRocks - A sub-second OLAP database supporting multi-dimensional analytics (Linux Foundation project)

In-process OLAP Engine

DuckDB - An in-process SQL OLAP Database Management System
GlareDB - A SQL database for running analytics across distributed data
Apache DataFusion - An extensible query engine with SQL and Dataframe APIs
chdb - An in-process OLAP SQL Engine powered by ClickHouse

OLAP Extensions

pg_duckdb - A Postgres extension that embeds DuckDB's analytics engine
pg_analytics - A DuckDB-powered analytics extension for Postgres

DATA LAKE PLATFORM

Distributed File System

Apache Hadoop HDFS - A highly scalable distributed block-based file system
GlusterFS - A scalable distributed storage that can scale to several petabytes
JuiceFS - A distributed POSIX file system built on top of Redis and S3
Lustre - A distributed parallel file system purpose-built to provide global POSIX-compliant namespace

Distributed Object Store

Apache Ozone - A scalable, redundant, and distributed object store for Apache Hadoop
Ceph - A distributed object, block, and file storage platform
Minio - A high performance object storage being API compatible with Amazon S3
Garage - A S3-compatible distributed object storage designed for self-hosting at a small-to-medium scale

Serialisation Framework

Apache Parquet - An efficient columnar binary storage format that supports nested data
Apache Avro - An efficient and fast row-based binary serialisation framework
Apache ORC - A self-describing type-aware columnar file format designed for Hadoop
Lance - A modern columnar data format for ML and LLMs implemented in Rust
Vortex - A highly extensible and fast columnar file format

Open Table Format

Apache Hudi - An open table format desined to support incremental data ingestion on cloud and Hadoop
Apache Iceberg - A high-performance table format for large analytic tables developed at Netflix
Delta Lake - A storage framework for building Lakehouse architecture developed by Databricks
Apache Paimon - An Apache inclubating project to support streaming high-speed data ingestion
Apache XTable - A unified framework supporting interoperability across multiple open-source table formats
OpenHouse - A declarative catalog with data services for open Data Lakehouse formats

Native Open Table Format Library

Delta-rs - A native Rust library for Delta Lake, with bindings into Python
PyIceberg - A native Python library for interacting with Iceberg table format
Hudi-rs- A native Rust library for Apache Hudi, with bindings into Python

DATA INTEGRATION

Data Integration Platform

Airbyte - A data integration platform for ETL / ELT data pipelines with wide range of connectors
Apache Nifi - A reliable, scalable low-code data integration platform with good enterprise support
Apache Camel - An embeddable integration framework supporting many enterprise integration patterns
Apache Gobblin - A distributed data integration framework built by LinkedIn supporting both streaming and batch data
Apache Inlong - An integration framework for supporting massive data, originally built at Tencent
Meltano - A declarative code-first data integration engine
Apache SeaTunnel - A high-performance, distributed data integration tool supporting vairous ingestion patterns
Estuary Flow - A real-time ETL and data pipeline platform for quick data integration
dlt - A lightweight data integration library for Python-first data platforms

CDC Tool

Debezium - A change data capture framework supporting variety of databases
Kafka Connect - A streaming data integration framework and runtime on top of Apache Kafka supporting CDC
Redpanda Conenct - A data streaming and integration framework on top of Redpanda
Flink CDC Connectors - CDC Connectors for Apache Flink engine supporting different databases
Brooklin - A distributed platform for streaming data between various heterogeneous source and destination systems
RudderStack - A headless Customer Data Platform to build data pipelines, open alternative to Segment
Artie Transfer - A real-time CDC replication solution between OLTP and OLAP databases
Dozer - A real-time CDC based data integration tool between various sources and sinks
PeerDB - A CDC tool to replicate data from Postgres to data warehouses, queues and other storage

Data Migration

DBmate - A lightweight, framework-agnostic database migration tool.
Ingestr - A CLI tool to copy data between any databases with a single command
Sling - A CLI tool to transfer data from a source to target storage/database

Log & Event Collection

CloudQuery - An ETL tool for syncing data from cloud APIs to variety of supported destinations
Snowplow - A cloud-native engine for collecting behavioral data and load into various cloud storage systems
EventMesh - A serverless event middlewar for collecting and loading event data into various targets
Apache Flume - A scalable distributed log aggregation service
Steampipe - A zero-ETL solution for getting data directly from APIs and services
Jitsu - A fully-scriptable data ingestion engine for collecting event data

Event Hub

Apache Kafka - A highly scalable distributed event store and streaming platform
NSQ - A realtime distributed messaging platform designed to operate at scale
Apache Pulsar - A scalable distributed pub-sub messaging system
Apache RocketMQ - A a cloud native messaging and streaming platform
Redpanda - A high performance Kafka API compatible streaming data platform
Memphis - A scalable data streaming platform for building event-driven applications

Reverse ETL

Multiwoven - A Reverse ETL open source alternative to Hightouch and RudderStack

DATA PROCESSING AND COMPUTATION

Unified Processing

Apache Beam - A unified programming model supporting execution on popular distributed processing backends
Apache Spark - A unified analytics engine for large-scale data processing
Dinky - A unified streaming & batch computation platform based on Apache Flink

Batch processing

Hadoop MapReduce - A highly scalable distributed batch processing framework from Apache Hadoop project
Apache Tez - A distributed data processing pipeline built for Apache Hive and Hadoop

Stream Processing

Apache Flink - A scalable high throughput stream processing framework
Apache Samza - A distributed stream processing framework which uses Kafka and Hadoop, originally developed by LinkedIn
Apache Storm - A distributed realtime computation system based on Actor Model framework
Benthos - A high performance declarative stream processing engine
Akka - A highly concurrent, distributed, message-driven processing system based on Actor Model
Bytewax - A Python stream processing framework with a Rust distributed processing engine
Timeplus Proton - A streaming SQL engine, fast and lightweight, powered by ClickHouse
FastStream - A Python framework for interacting with event streams such as Apache Kafka
Bento - A stream processing engine from WarpStream Labs

Python Processing Framework

Polars - A multithreaded Dataframe with vectorized query engine, written in Rust
PySpark - An interface for Apache Spark in Python
Vaex - A high performance Python library for big tabular datasets.
Apache Arrow - An efficient in-memory data format
Ibis - A portable Python dataframe library supporting many engine backends
SQLFrame - A Spark DataFrame API compatible library for data transformation

Python Workflow Scaling

Dask - A flexible parallel computing library with task scheduling
RAY - A unified framework with distributed runtime for scaling Python applications
Modin - A library for scaling Pandas workflows to multi-threded execution
Pandaral·lel - A library to parallelize Pandas operations on all available CPUs

SQL Toolkit

SQLAlchemy - A Python SQL toolkit and Object Relational Mapper
SQLGlot - A Python SQL parser and transpiler

WORKFLOW MANAGEMENT & DATAOPS

Workflow Orchestration

Apache Airflow - A plaform for creating and scheduling workflows as directed acyclic graphs (DAGs) of tasks
Prefect - A Python based workflow orchestration tool
Argo - A container-native workflow engine for orchestrating parallel jobs on Kubernetes
Azkaban - A batch workflow job scheduler created at LinkedIn to run Hadoop jobs
Cadence - A distributed, scalable available orchestration supporting different language client libraries
Dagster - A cloud-native data pipeline orchestrator written in Python
Apache DolpinScheduler - A low-code high performance workflow orchestration platform
Luigi - A python library for building complex pipelines of batch jobs
Flyte - A scalable and flexible workflow orchestration platform for both data and ML workloads
Kestra - A declarative language-agnostic worfklow orchestration and scheduling platform
Mage.ai - A platform for integrating, cheduling and managing data pipelines
Temporal - A resilient workflow management system, originated as a fork of Uber's Cadence
Windmill - A fast workflow engine, and open-source alternative to Airplane and Retool
Maestro - A general-purpose workflow orchestrator developed by Netflix

Job Scheduling

Celery - A distributed Task Queue system for Python
DKron - A distributed, fault tolerant job scheduling system
ApScheduler - An advanced task scheduler and task queue system for Python

Data Quality

Data-diff - A tool for comparing tables within or across databases
Great Expectations - A data validation and profiling tool written in Python
Deeque - A library based on Apache Spark for measuring data quality in large datasets
Pandera - A light-weight, flexible, and expressive statistical data testing library
Soda - A CLI tool and Python library for data quality testing

Data Versioning

LakeFS - A data version control for data stored in data lakes
Project Nessie - A transactional Catalog for Data Lakes with Git-like semantics
DVC - A data version control tool for data and ML experiments

Data Modeling

dbt - A data modeling and transformation tool for data pipelines
SQLMesh - A data transformation and modeling framework that is backwards compatible with dbt

Pipeline Observability

Elementry - A dbt-native data observability solution to monitor data pipelines

DATA INFRASTRUCTURE

Resource Scheduling

Apache Yarn - The default Resource Scheduler for Apache Hadoop clusters
Apache Mesos - A resource scheduling and cluster resource abstraction framework developed by Ph.D. students at UC Berkeley
Kubernetes - A production-grade container scheduling and management tool
Docker - The popular OS-level virtualization and containerization software

Cluster Administration

Apache Ambari - A tool for provisioning, managing, and monitoring of Apache Hadoop clusters
Apache Helix - A generic cluster management framework developed at LinkedIn

Security

Apache Knox - A gateway and SSO service for managing access to Hadoop clusters
Apache Ranger - A security and governance platform for Hadoop and other popular services
Kerberos - A popular enterprise network authentication protocol

Metrics Store

Influxdb - A scalable datastore for metrics and events
Mimir - A scalable long-term metrics storage for Prometheus, developed by Grafana Labs
OpenTSDB - A distributed, scalable Time Series Database written on top of Apache Hbase
M3 - A distributed TSDB and metrics storage and aggregator

Observability Framework

Prometheus - A popular metric collection and management tool
ELK - A poular observability stack comprsing of Elasticsearch, Kibana, Beats, and Logstash
Graphite - An established infrastructure monitoring and observability system
OpenTelemetry - A collection of APIs, SDKs, and tools for managing and monitoring metrics
VictoriaMetrics - An scalable monitoring solution with a time series database
Zabbix - A real-time infrastructure and application monitoring service

Monitoring Dashboard

Grafana - A popular open and composable observability and data visualization platform
Kibana - The visualistion and search dashboard for Elasticsearch
RConsole - A UI for monitoring and managing Apache Kafka and Redpanda workloads

Log & Metrics Pipeline

Fluentd - A metric collection, buffering and router service
Fluent Bit - A fast log processor and forwarder, and part of the Fluentd ecosystem
Logstash - A server-side log and metric transport and processor, as part of the ELK stack
Telegraf - A plugin-driven server agent for collecting & reporting metrics developed by Influxdata
Vector - A high-performance, end-to-end (agent & aggregator) observability data pipeline
StatsD - A network daemon for collection, aggregation and routing of metrics

METADATA MANAGEMENT

Metadata Platform

Amundsen - A data discovery and metadata engine developed by Lyft engineers
Apache Atlas - A data observability platform for Apache Hadoop ecosystem
DataHub - A metadata platform for the modern data stack developed at Netflix
Marquez - A metadata service for the collection, aggregation, and visualization of metadata
ckan - A data management system for cataloging, managing and accessing data
Open Metadata - A unified platform for discovery and governance, using a central metadata repository
ODD Platform - A data discovery and observability platform

Open Standards

Open Lineage - An open standard for lineage metadata collection
Open Metadata - A unified metadata platform providing open stadards for managing metadata
Egeria - Open metadata and governance standards to facilitate metadata exchange

Schema & Catalog Service

Hive Metastore - A popular schema management and metastore service as part of the Apache hive project
Confluent Schema Registry - A schema registry for Kafka, developed by Confluent
Apache Polaris - An interoperable, open source catalog for Apache Iceberg
Unity Catalog - A Universal catalog for Data Lakehouse formats and other data/AI assets
Lakekeeper - A Rust native Apache Iceberg REST Catalog

ANALYTICS & VISUALISATION

BI & Dashboard

Apache Superset - A poular open source data visualization and data exploration platform
Metabase - A simple data visualisation and exploration dashboard
Redash - A tool to explore, query, visualize, and share data with many data source connectors
Lightdash - A self-service BI to turn dbt project into a full-stack BI platform

BI as Code (Web App)

Streamlit - A python tool to package and share data as web apps
Evidence - A tool to build interactive data visualizations in pure SQL and markdown
dash - A Python framework for building ML & data science web apps
Vizro - A toolkit for creating modular data visualization applications
Mercury - A tool to convert Jupyter Notebooks to web apps
Quary - A code-based BI solution

Query & Collaboration

Hue - A query and data exploration tool with Hadoop ecosystem support, developed by Cloudera
Apache Zeppelin - A web-base Notebook for interactive data analytics and collaboration for Hadoop
Querybook - A simple query and notebook UI developed by Pinterest
Jupyter - A popular interactive web-based notebook application
Datasette - A tool for exploring and publishing data

MPP Query Engine

Apache Hive - A data warehousing and MPP engine on top of Hadoop
Apache Implala - A MPP engine mainly for Hadoop clusters, developed by Cloudera
Presto - A distributed SQL query engine for big data
Trino - The former PrestoSQL distributed SQL query engine
Apache Drill - A distributed MPP query engine against NoSQL and Hadoop data storage systems

Semantic & Middleware Layer

Alluxio - A data orchestration and virtual distributed storage system
Cube - A semantic layer for building data applications supporting popular databse engines
Apache Linkis - A computation middleware to facilitate connection and orchestration between applications and data engines
Apache Gluten - A middle layer for offloading JVM-based SQL engines execution to native engines

Data Sharing

delta-sharing - An open protocol for secure real-time exchange of large datasets

ML/AI PLATFORM

Vector Storage

milvus - A cloud-native vector database, storage for AI applications
qdrant - A high-performance, scalable Vector database for AI
chroma - An AI-native embedding database for building LLM apps
marqo - An end-to-end vector search engine for both text and images
LanceDB - A serverless vector database for AI applications written in Rust
weaviate - A scalable, cloud-native supporting storage of both objects and vectors
deeplake - A storage format optimized AI database for deep-learning applications
Vespa - A storage to organize vectors, tensors, text and structured data
vald - A scalable distributed approximate nearest neighbor (ANN) dense vector search engine
pgvector - A vector similarity search as a Postgres extension

MLOps

mlflow - A a platform to streamline machine learning development and lifecycle management
Metaflow - A tool to build and manage ML/AI, and data science projects, developed at Netflix
SkyPilot - A framework for running LLMs, AI, and batch jobs on any cloud
Jina - A tool to build multimodal AI applications with cloud-native stack
NNI - An autoML toolkit for automate machine learning lifecycle, from Microsoft
BentoML - A framework for building reliable and scalable AI applications
Determined AI - An ML platform that simplifies distributed training, tuning and experiment tracking
RAY - A unified framework for scaling AI and Python applications
kubeflow - A cloud-native platform for ML operations - pipelines, training and deployment
Haystack - AI orchestration framework to build customizable, production-ready LLM applications
Kedro - A toolbox and framework for building production-ready data science and ML workflows
Pachyderm - A calable ML and Data Science data processing workflow management platform
Superduper - a Python based framework for building AI-data workflows and applications

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
README.md		README.md

pracdata/awesome-open-source-data-engineering

Folders and files

Latest commit

History

Repository files navigation