A curated list of open source tools used in analytics platforms and data engineering ecosystem For more information about the above compiled landscape for 2024, please read the published blog post on Substack or Medium
- Storage Systems
- Data Lake Platform
- Data Integration
- Data Processing & Computation
- Workflow Management & DataOps
- Data Infrastructure
- Metadata Management
- Analytics & Visualisation
- ML/AI Platform
- PostgreSQL - Advanced object-relational database management system
- MySQL - One of the most popular open Source Databases
- MariaDB - A popular MySQL server fork
- Supabase - An open source Firebase alternative
- SQlite - Most popular embedded database engine
- Citus - A popular distributed PostgreSQL as an extension
- CockroachDB - A cloud-native distributed SQL database
- YugabyteDB - A cloud-native distributed SQL database
- TiDB - A cloud-native, distributed, MySQL-Compatible database
- OceanBase - A scalable distributed relational database
- ShardingSphere - A Distributed SQL transaction & query engine
- Neon - A serverless open-source alternative to AWS Aurora Postgres
- CrateDB - A distributed and scalable PostgreSQL-compatible SQL database
- Redis - A popular key-value based cache store
- Memcached - A high performance multithreadedkey-value cache store
- Dragonfly - A modern cache store compatible with Redis and Memcached APIs
- Apache Ignite - A distributed, ACID-compliant in-memory DBMS
- ReadySet - A MySQL and Postgres wire-compatible caching layer
- VoltDB - A distributed, horizontally-scalable, ACID-compliant database
- MongoDB - A cross-platform, document-oriented NoSQL database
- RavenDB - An ACID NoSQL document database
- RethinkDB - A distributed document-oriented database for real-time applications
- CouchDB - A Scalable document-oriented NoSQL database
- Couchbase - A modern cloud-native NoSQL distributed database
- FerretDB - A truly Open Source MongoDB alternative!
- LowDB - A simple and fast JSON database
- OrientDB - A Multi-model DBMS supporting Graph, Document, Reactive, Full-Text and Geospatial models
- ArrangoDB - A Multi-model database with flexible data models for documents, graphs, and key-values
- SurrealDB - A scalable, distributed, collaborative, document-graph database
- EdgeDB - A graph-relational database with declarative schema
- Neo4j - A high performance leading graph database
- JunasGraph - A highly scalable distributed graph database
- HugeGraph - A fast-speed and highly-scalable graph database
- NebulaGraph - A distributed, horizontal scalability, fast open-source graph database
- Cayley - Inspired by the graph database behind Google's Knowledge Graph
- Dgraph - A horizontally scalable and distributed GraphQL database with a graph backend
- Riak - A decentralized key-value datastore from Basho Technologies
- FoundationDB - A distributed, transactional key-value store from Apple
- etcd - A distributed reliable key-value store written in Go
- TiKV - A distributed transactional key-value database, originally created to complement TiDB
- Immudb - A database with built-in cryptographic proof and verification
- Valkey - A distributed key-value datastore forked from Redis
- Apache Cassandra - A highly-scalable LSM-Tree based partitioned row store
- Apache Hbase - A distributed wide column-oriented store modeled after Google' Bigtable
- Scylla - LSM-Tree based wide-column API-compatible with Apache Cassandra and Amazon DynamoDB
- Apache Accumulo - A distributed key-value store with scalable data storage and retrieval, on top of Hadoop
- LevelDB - A fast key-value storage library written at Google
- RocksDB - An embeddable, persistent key-value store developed by Meta (Facebook)
- MyRocks - A RocksDB storage engine for MySQL
- BadgerDB - An embeddable, fast key-value database written in pure Go
- Apache Solr - A fast distributed search database built on Apache Lucene
- Elastic Search - A distributed, RESTful search engine optimized for speed
- Sphinx - A fulltext search engine with high speed of indexation
- Meilisearch - A fast search API with great integration support
- OpenSearch - A community-driven, open source fork of Elasticsearch and Kibana
- Quickwit - A fast cloud-native search engine for observability data
- ParadeDB - A search engine built on Postgres
- RisingWave - A scalable Postgres for stream processing, analytics, and management
- Materialize - A real-time data warehouse purpose-built for operational workloads
- EventStoreDB - An event-native database designed for event sourcing and event-driven architectures
- KsqlDB - A database for building stream processing applications on top of Apache Kafka
- Timeplus Proton - A streaming SQL engine, fast and lightweight, powered by ClickHouse
- Influxdb - A scalable datastore for metrics, events, and real-time analytics
- TimeScaleDB - A fast ingest time-series SQL database packaged as a PostgreSQL extension
- Apache IoTDB - An Internet of Things database with seamless integration with the Hadoop and Spark ecology
- Netflix Atlas - An n-memory dimensional time series database developed and open sourced by Netflix
- QuestDB - A time-series database for fast ingest and SQL queries
- TDEngine - A high-performance, cloud native time-series database optimized for Internet of Things (IoT)
- KairosDB - A scalable time series database written in Java
- GreptimeDB - A cloud-native, unified time series database for metrics, logs and events
- Apache Kudu - A column-oriented data store for the Apache Hadoop ecosystem
- Greeenplum - A column-oriented massively parallel PostgreSQL for analytics
- MonetDB - A high-performance columnar database originally developed by the CWI database research group
- Databend - An lastic, workload-aware cloud-native data warehouse built in Rust
- ByConity - A cloud-native data warehouse forked from ClickHouse
- hydra - A fast column-oriented Postgres extension
- ClickHouse - A real-time column-oriented database originally developed at Yandex
- Apache Pinot - A a real-time distributed OLAP datastore open sourced by LinkedIn
- Apache Druid - A high performance real-time OLAP engine developed and open sourced by Metamarkets
- Apache Kylin - A distributed OLAP engine designed to provide multi-dimensional analysis on Hadoop
- Apache Doris - A high-performance and real-time analytical database based on MPP architecture
- StarRocks - A sub-second OLAP database supporting multi-dimensional analytics (Linux Foundation project)
- DuckDB - An in-process SQL OLAP Database Management System
- GlareDB - A SQL database for running analytics across distributed data
- Apache DataFusion - An extensible query engine with SQL and Dataframe APIs
- chdb - An in-process OLAP SQL Engine powered by ClickHouse
- pg_duckdb - A Postgres extension that embeds DuckDB's analytics engine
- pg_analytics - A DuckDB-powered analytics extension for Postgres
- Apache Hadoop HDFS - A highly scalable distributed block-based file system
- GlusterFS - A scalable distributed storage that can scale to several petabytes
- JuiceFS - A distributed POSIX file system built on top of Redis and S3
- Lustre - A distributed parallel file system purpose-built to provide global POSIX-compliant namespace
- Apache Ozone - A scalable, redundant, and distributed object store for Apache Hadoop
- Ceph - A distributed object, block, and file storage platform
- Minio - A high performance object storage being API compatible with Amazon S3
- Garage - A S3-compatible distributed object storage designed for self-hosting at a small-to-medium scale
- Apache Parquet - An efficient columnar binary storage format that supports nested data
- Apache Avro - An efficient and fast row-based binary serialisation framework
- Apache ORC - A self-describing type-aware columnar file format designed for Hadoop
- Lance - A modern columnar data format for ML and LLMs implemented in Rust
- Vortex - A highly extensible and fast columnar file format
- Apache Hudi - An open table format desined to support incremental data ingestion on cloud and Hadoop
- Apache Iceberg - A high-performance table format for large analytic tables developed at Netflix
- Delta Lake - A storage framework for building Lakehouse architecture developed by Databricks
- Apache Paimon - An Apache inclubating project to support streaming high-speed data ingestion
- Apache XTable - A unified framework supporting interoperability across multiple open-source table formats
- OpenHouse - A declarative catalog with data services for open Data Lakehouse formats
- Delta-rs - A native Rust library for Delta Lake, with bindings into Python
- PyIceberg - A native Python library for interacting with Iceberg table format
- Hudi-rs- A native Rust library for Apache Hudi, with bindings into Python
- Airbyte - A data integration platform for ETL / ELT data pipelines with wide range of connectors
- Apache Nifi - A reliable, scalable low-code data integration platform with good enterprise support
- Apache Camel - An embeddable integration framework supporting many enterprise integration patterns
- Apache Gobblin - A distributed data integration framework built by LinkedIn supporting both streaming and batch data
- Apache Inlong - An integration framework for supporting massive data, originally built at Tencent
- Meltano - A declarative code-first data integration engine
- Apache SeaTunnel - A high-performance, distributed data integration tool supporting vairous ingestion patterns
- Estuary Flow - A real-time ETL and data pipeline platform for quick data integration
- dlt - A lightweight data integration library for Python-first data platforms
- Debezium - A change data capture framework supporting variety of databases
- Kafka Connect - A streaming data integration framework and runtime on top of Apache Kafka supporting CDC
- Redpanda Conenct - A data streaming and integration framework on top of Redpanda
- Flink CDC Connectors - CDC Connectors for Apache Flink engine supporting different databases
- Brooklin - A distributed platform for streaming data between various heterogeneous source and destination systems
- RudderStack - A headless Customer Data Platform to build data pipelines, open alternative to Segment
- Artie Transfer - A real-time CDC replication solution between OLTP and OLAP databases
- Dozer - A real-time CDC based data integration tool between various sources and sinks
- PeerDB - A CDC tool to replicate data from Postgres to data warehouses, queues and other storage
- DBmate - A lightweight, framework-agnostic database migration tool.
- Ingestr - A CLI tool to copy data between any databases with a single command
- Sling - A CLI tool to transfer data from a source to target storage/database
- CloudQuery - An ETL tool for syncing data from cloud APIs to variety of supported destinations
- Snowplow - A cloud-native engine for collecting behavioral data and load into various cloud storage systems
- EventMesh - A serverless event middlewar for collecting and loading event data into various targets
- Apache Flume - A scalable distributed log aggregation service
- Steampipe - A zero-ETL solution for getting data directly from APIs and services
- Jitsu - A fully-scriptable data ingestion engine for collecting event data
- Apache Kafka - A highly scalable distributed event store and streaming platform
- NSQ - A realtime distributed messaging platform designed to operate at scale
- Apache Pulsar - A scalable distributed pub-sub messaging system
- Apache RocketMQ - A a cloud native messaging and streaming platform
- Redpanda - A high performance Kafka API compatible streaming data platform
- Memphis - A scalable data streaming platform for building event-driven applications
- Multiwoven - A Reverse ETL open source alternative to Hightouch and RudderStack
- Apache Beam - A unified programming model supporting execution on popular distributed processing backends
- Apache Spark - A unified analytics engine for large-scale data processing
- Dinky - A unified streaming & batch computation platform based on Apache Flink
- Hadoop MapReduce - A highly scalable distributed batch processing framework from Apache Hadoop project
- Apache Tez - A distributed data processing pipeline built for Apache Hive and Hadoop
- Apache Flink - A scalable high throughput stream processing framework
- Apache Samza - A distributed stream processing framework which uses Kafka and Hadoop, originally developed by LinkedIn
- Apache Storm - A distributed realtime computation system based on Actor Model framework
- Benthos - A high performance declarative stream processing engine
- Akka - A highly concurrent, distributed, message-driven processing system based on Actor Model
- Bytewax - A Python stream processing framework with a Rust distributed processing engine
- Timeplus Proton - A streaming SQL engine, fast and lightweight, powered by ClickHouse
- FastStream - A Python framework for interacting with event streams such as Apache Kafka
- Bento - A stream processing engine from WarpStream Labs
- Polars - A multithreaded Dataframe with vectorized query engine, written in Rust
- PySpark - An interface for Apache Spark in Python
- Vaex - A high performance Python library for big tabular datasets.
- Apache Arrow - An efficient in-memory data format
- Ibis - A portable Python dataframe library supporting many engine backends
- SQLFrame - A Spark DataFrame API compatible library for data transformation
- Dask - A flexible parallel computing library with task scheduling
- RAY - A unified framework with distributed runtime for scaling Python applications
- Modin - A library for scaling Pandas workflows to multi-threded execution
- Pandaral·lel - A library to parallelize Pandas operations on all available CPUs
- SQLAlchemy - A Python SQL toolkit and Object Relational Mapper
- SQLGlot - A Python SQL parser and transpiler
- Apache Airflow - A plaform for creating and scheduling workflows as directed acyclic graphs (DAGs) of tasks
- Prefect - A Python based workflow orchestration tool
- Argo - A container-native workflow engine for orchestrating parallel jobs on Kubernetes
- Azkaban - A batch workflow job scheduler created at LinkedIn to run Hadoop jobs
- Cadence - A distributed, scalable available orchestration supporting different language client libraries
- Dagster - A cloud-native data pipeline orchestrator written in Python
- Apache DolpinScheduler - A low-code high performance workflow orchestration platform
- Luigi - A python library for building complex pipelines of batch jobs
- Flyte - A scalable and flexible workflow orchestration platform for both data and ML workloads
- Kestra - A declarative language-agnostic worfklow orchestration and scheduling platform
- Mage.ai - A platform for integrating, cheduling and managing data pipelines
- Temporal - A resilient workflow management system, originated as a fork of Uber's Cadence
- Windmill - A fast workflow engine, and open-source alternative to Airplane and Retool
- Maestro - A general-purpose workflow orchestrator developed by Netflix
- Celery - A distributed Task Queue system for Python
- DKron - A distributed, fault tolerant job scheduling system
- ApScheduler - An advanced task scheduler and task queue system for Python
- Data-diff - A tool for comparing tables within or across databases
- Great Expectations - A data validation and profiling tool written in Python
- Deeque - A library based on Apache Spark for measuring data quality in large datasets
- Pandera - A light-weight, flexible, and expressive statistical data testing library
- Soda - A CLI tool and Python library for data quality testing
- LakeFS - A data version control for data stored in data lakes
- Project Nessie - A transactional Catalog for Data Lakes with Git-like semantics
- DVC - A data version control tool for data and ML experiments
- dbt - A data modeling and transformation tool for data pipelines
- SQLMesh - A data transformation and modeling framework that is backwards compatible with dbt
- Elementry - A dbt-native data observability solution to monitor data pipelines
- Apache Yarn - The default Resource Scheduler for Apache Hadoop clusters
- Apache Mesos - A resource scheduling and cluster resource abstraction framework developed by Ph.D. students at UC Berkeley
- Kubernetes - A production-grade container scheduling and management tool
- Docker - The popular OS-level virtualization and containerization software
- Apache Ambari - A tool for provisioning, managing, and monitoring of Apache Hadoop clusters
- Apache Helix - A generic cluster management framework developed at LinkedIn
- Apache Knox - A gateway and SSO service for managing access to Hadoop clusters
- Apache Ranger - A security and governance platform for Hadoop and other popular services
- Kerberos - A popular enterprise network authentication protocol
- Influxdb - A scalable datastore for metrics and events
- Mimir - A scalable long-term metrics storage for Prometheus, developed by Grafana Labs
- OpenTSDB - A distributed, scalable Time Series Database written on top of Apache Hbase
- M3 - A distributed TSDB and metrics storage and aggregator
- Prometheus - A popular metric collection and management tool
- ELK - A poular observability stack comprsing of Elasticsearch, Kibana, Beats, and Logstash
- Graphite - An established infrastructure monitoring and observability system
- OpenTelemetry - A collection of APIs, SDKs, and tools for managing and monitoring metrics
- VictoriaMetrics - An scalable monitoring solution with a time series database
- Zabbix - A real-time infrastructure and application monitoring service
- Grafana - A popular open and composable observability and data visualization platform
- Kibana - The visualistion and search dashboard for Elasticsearch
- RConsole - A UI for monitoring and managing Apache Kafka and Redpanda workloads
- Fluentd - A metric collection, buffering and router service
- Fluent Bit - A fast log processor and forwarder, and part of the Fluentd ecosystem
- Logstash - A server-side log and metric transport and processor, as part of the ELK stack
- Telegraf - A plugin-driven server agent for collecting & reporting metrics developed by Influxdata
- Vector - A high-performance, end-to-end (agent & aggregator) observability data pipeline
- StatsD - A network daemon for collection, aggregation and routing of metrics
- Amundsen - A data discovery and metadata engine developed by Lyft engineers
- Apache Atlas - A data observability platform for Apache Hadoop ecosystem
- DataHub - A metadata platform for the modern data stack developed at Netflix
- Marquez - A metadata service for the collection, aggregation, and visualization of metadata
- ckan - A data management system for cataloging, managing and accessing data
- Open Metadata - A unified platform for discovery and governance, using a central metadata repository
- ODD Platform - A data discovery and observability platform
- Open Lineage - An open standard for lineage metadata collection
- Open Metadata - A unified metadata platform providing open stadards for managing metadata
- Egeria - Open metadata and governance standards to facilitate metadata exchange
- Hive Metastore - A popular schema management and metastore service as part of the Apache hive project
- Confluent Schema Registry - A schema registry for Kafka, developed by Confluent
- Apache Polaris - An interoperable, open source catalog for Apache Iceberg
- Unity Catalog - A Universal catalog for Data Lakehouse formats and other data/AI assets
- Lakekeeper - A Rust native Apache Iceberg REST Catalog
- Apache Superset - A poular open source data visualization and data exploration platform
- Metabase - A simple data visualisation and exploration dashboard
- Redash - A tool to explore, query, visualize, and share data with many data source connectors
- Lightdash - A self-service BI to turn dbt project into a full-stack BI platform
- Streamlit - A python tool to package and share data as web apps
- Evidence - A tool to build interactive data visualizations in pure SQL and markdown
- dash - A Python framework for building ML & data science web apps
- Vizro - A toolkit for creating modular data visualization applications
- Mercury - A tool to convert Jupyter Notebooks to web apps
- Quary - A code-based BI solution
- Hue - A query and data exploration tool with Hadoop ecosystem support, developed by Cloudera
- Apache Zeppelin - A web-base Notebook for interactive data analytics and collaboration for Hadoop
- Querybook - A simple query and notebook UI developed by Pinterest
- Jupyter - A popular interactive web-based notebook application
- Datasette - A tool for exploring and publishing data
- Apache Hive - A data warehousing and MPP engine on top of Hadoop
- Apache Implala - A MPP engine mainly for Hadoop clusters, developed by Cloudera
- Presto - A distributed SQL query engine for big data
- Trino - The former PrestoSQL distributed SQL query engine
- Apache Drill - A distributed MPP query engine against NoSQL and Hadoop data storage systems
- Alluxio - A data orchestration and virtual distributed storage system
- Cube - A semantic layer for building data applications supporting popular databse engines
- Apache Linkis - A computation middleware to facilitate connection and orchestration between applications and data engines
- Apache Gluten - A middle layer for offloading JVM-based SQL engines execution to native engines
- delta-sharing - An open protocol for secure real-time exchange of large datasets
- milvus - A cloud-native vector database, storage for AI applications
- qdrant - A high-performance, scalable Vector database for AI
- chroma - An AI-native embedding database for building LLM apps
- marqo - An end-to-end vector search engine for both text and images
- LanceDB - A serverless vector database for AI applications written in Rust
- weaviate - A scalable, cloud-native supporting storage of both objects and vectors
- deeplake - A storage format optimized AI database for deep-learning applications
- Vespa - A storage to organize vectors, tensors, text and structured data
- vald - A scalable distributed approximate nearest neighbor (ANN) dense vector search engine
- pgvector - A vector similarity search as a Postgres extension
- mlflow - A a platform to streamline machine learning development and lifecycle management
- Metaflow - A tool to build and manage ML/AI, and data science projects, developed at Netflix
- SkyPilot - A framework for running LLMs, AI, and batch jobs on any cloud
- Jina - A tool to build multimodal AI applications with cloud-native stack
- NNI - An autoML toolkit for automate machine learning lifecycle, from Microsoft
- BentoML - A framework for building reliable and scalable AI applications
- Determined AI - An ML platform that simplifies distributed training, tuning and experiment tracking
- RAY - A unified framework for scaling AI and Python applications
- kubeflow - A cloud-native platform for ML operations - pipelines, training and deployment
- Haystack - AI orchestration framework to build customizable, production-ready LLM applications
- Kedro - A toolbox and framework for building production-ready data science and ML workflows
- Pachyderm - A calable ML and Data Science data processing workflow management platform
- Superduper - a Python based framework for building AI-data workflows and applications