Add Apache Iceberg Streaming Writes and Batch Reads (MVP) #928

zliang-min · 2025-03-18T08:15:31Z

This PR introduces MVP-level support for streaming writes and batch reads with Apache Iceberg tables, fully implemented in C++ (no JNI). While existing C++ projects like ClickHouse and DuckDB focus on read-only Iceberg integration, this implementation adds native write capabilities, enabling end-to-end data pipelines directly from SQL.

Key Highlights

Streaming writes: Continuously write data to Iceberg tables via materialized views or direct INSERT statements.
Zero Java dependencies: Native C++ integration leveraging Apache Arrow for file I/O and AWS SDK for S3/Glue.
SQL-first workflows: Manage Iceberg catalogs, tables, and writes using familiar SQL syntax.

What’s Working (MVP) ✅

https://docs.timeplus.com/iceberg

Catalog & Setup

Support for Iceberg REST Catalog (verified with AWS Glue and S3 Table).
Create new Iceberg tables via SQL.
Support AWS SigV4 authentication for catalog(Glue)/storage(s3).

Write Operations

Append data via INSERT INTO or streaming materialized views.
AWS S3 storage with environment/IAM credentials.

Read Operations

Batch read entire Iceberg tables (v1/v2 formats).

Usage Example

-- Connect to a Iceberg database managed by AWS Glue, using AK/SK/IAM from the host
CREATE DATABASE demo
SETTINGS  type='iceberg', warehouse='(aws-12-id)',  
  catalog_type='rest', catalog_uri='https://glue.us-west-2.amazonaws.com/iceberg',
  storage_endpoint='https://bucket.s3.us-west-2.amazonaws.com',  
  rest_catalog_sigv4_enabled=true,
  rest_catalog_signing_region='us-west-2',
  rest_catalog_signing_name='glue';

-- Switch to the Iceberg database namespace
USE demo;

-- List existing Iceberg tables
SHOW STREAMS;

INSERT INTO demo.existing_table VALUES(..)

-- Or create a new Iceberg table and use MV to write data
CREATE STREAM transformed(
  timestamp datetime64,
  org_id string,
  float_value float,
  array_length int,
  max_num int,
  min_num int
);
  
-- Stream data to Iceberg  
CREATE MATERIALIZED VIEW mydb.mv_write_iceberg INTO demo.transformed AS
SELECT now() AS timestamp, org_id, float_value,
       length(`array_of_records.a_num`) AS array_length,
       array_max(`array_of_records.a_num`) AS max_num,
       array_min(`array_of_records.a_num`) AS min_num
FROM mydb.msk_stream
SETTINGS s3_min_upload_file_size=1024;

What’s Next (Help Wanted!) 🔧

Write Improvements

DELETE and UPSERT operations.
Partitioning support (bucket, truncate).
INSERT OVERWRITE operations.
Merge-on-read for updates/deletes.

Read Improvements

Streaming incremental reads (snapshot tracking).
Time travel queries.

Catalog & Security

Support S3 Table (Done in preview 3)
Support Apache Gravitino catalog (Done in preview 3)
Support Apache Polaris catalog
Database/Hive catalog

Maintenance

Snapshot management, version/branch/tag management
Schema evolution enhancements

Try it now:

We are still working on the test cases and fixing CI issues. Before we create a new Timeplus Proton release with this PR merged in, you can install Timeplus Enterprise 2.8 on Linux or macOS. Please follow the guide at https://docs.timeplus.com/enterprise-v2.8#2_8_0

You can use the web console at http://localhost:8000/ to run SQL.

Use the SQL examples above to connect to the Iceberg databases and read/write data.

You can also use this docker image on Linux/macOS/Windows:
docker.timeplus.com/timeplus/timeplusd:2.8.14. For example, start a container with the AWS AK/SK from the env var:

docker run --name timeplus_iceberg -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -d -p 7587:7587 -p 8463:8463 docker.timeplus.com/timeplus/timeplusd:2.8.14

Demo video: how to use Timeplus to read data from Amazon MSK(Managed Service for Kafka), apply stream processing, then write to S3 in the Iceberg table format, then query with Athena: https://www.youtube.com/watch?v=2m6ehwmzOnc

Contribute:

Review the code (focus on IcebergSink.cpp/IcebergSource.cpp)
Test with your Iceberg setup and share feedback
Help tackle the "What’s next" list!

Tech notes:

Built on Apache Arrow C++ for Parquet/ORC file handling.
Minimal runtime dependencies (no Hadoop/JVM).
AWS SDK integration for Glue/S3 auth.

Note: Starting from preview3, the syntax for catalog configuration is changed from ENGINE to SETTINGS.

jovezhong · 2025-03-21T06:43:35Z

A new preview build is ready for Linux x64

wget https://install.timeplus.com/iceberg_preview2/timeplusd

MD5 (timeplusd) = 10e9d405ab5ac2cb0f28db60c804324f

jovezhong · 2025-03-23T18:23:38Z

A new preview build is ready for Linux x64

wget https://install.timeplus.com/iceberg_preview3/timeplusd
chmod +x timeplusd
export AWS_ACCESS_KEY_ID=..
export AWS_SECRET_ACCESS_KEY=..
./timeplusd server

You can also run this via Docker on Linux/Mac/Windows (the image is for linux/amd64, but can run on arm host)

export AWS_ACCESS_KEY_ID=..
export AWS_SECRET_ACCESS_KEY=..
docker run --name timeplus_iceberg -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -d -p 7587:7587 -p 8463:8463 docker.timeplus.com/timeplus/timeplusd:iceberg-preview3

Enhancements in this build:

Support new syntax to connect iceberg via create database NAME SETTINGS type='iceberg', catalog_type='rest', catalog_uri='..'
Support AWS S3 read/write, set the catalog_uri to 'https://s3tables.REGION.amazonaws.com/iceberg', as well as special value for warehouse/storage_endpoint/rest_catalog_signing_name, e.g.

CREATE DATABASE jove_s3table
SETTINGS  type='iceberg', 
          catalog_type='rest', catalog_uri='https://s3tables.us-west-2.amazonaws.com/iceberg',
          warehouse='arn:aws:s3tables:us-west-2:012345678901:bucket/jove-s3', 
          storage_endpoint='https://jove-s3.s3.us-west-2.amazonaws.com',
          rest_catalog_sigv4_enabled=true, rest_catalog_signing_region='us-west-2', rest_catalog_signing_name='s3tables';

jovezhong · 2025-03-24T18:26:10Z

A new preview build is ready for Linux x64

wget https://install.timeplus.com/iceberg_preview4/timeplusd
chmod +x timeplusd
export AWS_ACCESS_KEY_ID=..
export AWS_SECRET_ACCESS_KEY=..
./timeplusd server

You can also run this via Docker on Linux/Mac/Windows (the image is for linux/amd64, but can run on arm host). The next preview bits will support ARM chip and probably with web UI.

export AWS_ACCESS_KEY_ID=..
export AWS_SECRET_ACCESS_KEY=..
docker run --name timeplus_iceberg -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -d -p 7587:7587 -p 8463:8463 docker.timeplus.com/timeplus/timeplusd:iceberg-preview4

Enhancements in this build:

fixed the calculation issue for count() optimization
better handled various cases for credentials setup
a set of other minor bugfix and enhancements

jovezhong · 2025-03-25T21:30:53Z

Another daily update (probably the last one for this week). We published a preview edition of Timeplus Enterprise 2.8, with this iceberg integration. You can try this on Linux or macOS, no matter x86_64 or arm clip. Both bare metal and docker image are available. Please follow the guide at https://docs.timeplus.com/enterprise-v2.8#2_8_0

You can use the web console at http://localhost:8000 to run SQL. Docs are also published: https://docs.timeplus.com/iceberg

zliang-min added feature external-streams iceberg labels Mar 18, 2025

zliang-min self-assigned this Mar 18, 2025

Initial Iceberg support

369d816

zliang-min force-pushed the feature/iceberg-1 branch from c4e460d to 369d816 Compare March 18, 2025 08:16

jovezhong self-requested a review March 18, 2025 15:28

jovezhong added external-tables and removed external-streams labels Mar 18, 2025

jovezhong changed the title ~~Initial Iceberg support~~ Add Apache Iceberg Streaming Writes and Batch Reads (MVP) Mar 18, 2025

This was referenced Mar 20, 2025

Support for Writing to Apache Iceberg Tables in ClickHouse ClickHouse/ClickHouse#49973

Open

[DISCUSS] We have some C++ code to share for Iceberg read/write apache/iceberg-cpp#46

Open

jovezhong requested a review from chenziliang March 21, 2025 16:38

jovezhong mentioned this pull request Mar 26, 2025

Nice work, one dead link for tabulario/iceberg-rest meong1234/clickhouse-iceberg#1

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Apache Iceberg Streaming Writes and Batch Reads (MVP) #928

Add Apache Iceberg Streaming Writes and Batch Reads (MVP) #928

zliang-min commented Mar 18, 2025 •

edited by jovezhong

Loading

jovezhong commented Mar 21, 2025

jovezhong commented Mar 23, 2025

jovezhong commented Mar 24, 2025

jovezhong commented Mar 25, 2025 •

edited

Loading

Add Apache Iceberg Streaming Writes and Batch Reads (MVP) #928

Are you sure you want to change the base?

Add Apache Iceberg Streaming Writes and Batch Reads (MVP) #928

Conversation

zliang-min commented Mar 18, 2025 • edited by jovezhong Loading

Key Highlights

What’s Working (MVP) ✅

Catalog & Setup

Write Operations

Read Operations

Usage Example

What’s Next (Help Wanted!) 🔧

Write Improvements

Read Improvements

Catalog & Security

Maintenance

Try it now:

Contribute:

Tech notes:

jovezhong commented Mar 21, 2025

jovezhong commented Mar 23, 2025

jovezhong commented Mar 24, 2025

jovezhong commented Mar 25, 2025 • edited Loading

zliang-min commented Mar 18, 2025 •

edited by jovezhong

Loading

jovezhong commented Mar 25, 2025 •

edited

Loading