Skip to content

Latest commit

 

History

History
131 lines (94 loc) · 5.31 KB

singer-intro.md

File metadata and controls

131 lines (94 loc) · 5.31 KB

About

An introduction to the Singer ecosystem of data pipeline components for composable open source ETL.

Singer, Meltano, PipelineWise, and Airbyte, provide components and integration engines adhering to the Singer specification.

On the database integration side, the connectors of Singer and Meltano are based on SQLAlchemy.

Overview

CrateDB

CrateDB is a distributed and scalable SQL database for storing and analyzing massive amounts of data in near real-time, even with complex queries. It is PostgreSQL-compatible, and based on Apache Lucene.

CrateDB offers a Python SQLAlchemy dialect, in order to plug into the comprehensive Python data-science and -wrangling ecosystems.

Singer

The open-source standard for writing scripts that move data.

Singer is an open source specification and software framework for ETL/ELT data exchange between a range of different systems. For talking to SQL databases, it employs a metadata subsystem based on SQLAlchemy.

Singer reads and writes Singer-formatted JSONL messages, following the Singer Spec.

The Singer specification was started in 2016 by Stitch Data. It specified a data transfer format that would allow any number of data systems, called taps, to send data to any data destinations, called targets. Airbyte was incorporated in 2020 and created their own specification that was heavily inspired by Singer. There are differences, but the core of each specification is sending new-line delimited JSON data from STDOUT of a tap to STDIN of a target.

Meltano

Unlock all the data that powers your data platform.

Say goodbye to writing, maintaining, and scaling your own API integrations with Meltano's declarative code-first data integration engine, bringing a number of APIs and DBs to the table.

Meltano builds upon Singer technologies, uses configuration files in YAML syntax instead of JSON, adds an improved SDK and other components, and runs the central addon registry, meltano | Hub.

PipelineWise

PipelineWise is another Data Pipeline Framework using the Singer.io specification to ingest and replicate data from various sources to various destinations. The list of PipelineWise Taps include another bunch of high-quality data-source and -sink components.

Data Mill

Data Mill helps organizations utilize modern data infrastructure and data science to power analytics, products, and services.

SQLAlchemy

SQLAlchemy is the leading Python SQL toolkit and Object Relational Mapper that gives application developers the full power and flexibility of SQL.

It provides a full suite of well known enterprise-level persistence patterns, designed for efficient and high-performing database access, adapted into a simple and Pythonic domain language.

Evaluations

Singer vs. Meltano

Meltano as a framework fills many gaps and makes Singer convenient to actually use. It is impossible to outline all details and every difference, so we will focus on the "naming things" aspects for now.

Both ecosystems use different names for the same elements. That may be confusing at first, but it is easy to learn: For the notion of data source vs. data sink, common to all pipeline systems in one way or another, Singer uses the terms tap vs. target, while Meltano uses extractor vs. loader. Essentially, they are the same things under different names.

Ecosystem Data source Data sink
Singer Tap Target
Meltano Extractor Loader

In Singer jargon, you tap data from a source, and send it to a target. In Meltano jargon, you extract data from a source, and then load it into the target system.

Singer and Airbyte criticism