Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Apache Iceberg Streaming Writes and Batch Reads (MVP) #928

Draft
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

zliang-min
Copy link
Collaborator

@zliang-min zliang-min commented Mar 18, 2025

This PR introduces MVP-level support for streaming writes and batch reads with Apache Iceberg tables, fully implemented in C++ (no JNI). While existing C++ projects like ClickHouse and DuckDB focus on read-only Iceberg integration, this implementation adds native write capabilities, enabling end-to-end data pipelines directly from SQL.

Key Highlights

  • Streaming writes: Continuously write data to Iceberg tables via materialized views or direct INSERT statements.
  • Zero Java dependencies: Native C++ integration leveraging Apache Arrow for file I/O and AWS SDK for S3/Glue.
  • SQL-first workflows: Manage Iceberg catalogs, tables, and writes using familiar SQL syntax.

What’s Working (MVP) ✅

https://docs.timeplus.com/iceberg

Catalog & Setup

  • Support for Iceberg REST Catalog (verified with AWS Glue and S3 Table).
  • Create new Iceberg tables via SQL.
  • Support AWS SigV4 authentication for catalog(Glue)/storage(s3).

Write Operations

  • Append data via INSERT INTO or streaming materialized views.
  • AWS S3 storage with environment/IAM credentials.

Read Operations

  • Batch read entire Iceberg tables (v1/v2 formats).

Usage Example

-- Connect to a Iceberg database managed by AWS Glue, using AK/SK/IAM from the host
CREATE DATABASE demo
SETTINGS  type='iceberg', warehouse='(aws-12-id)',  
  catalog_type='rest', catalog_uri='https://glue.us-west-2.amazonaws.com/iceberg',
  storage_endpoint='https://bucket.s3.us-west-2.amazonaws.com',  
  rest_catalog_sigv4_enabled=true,
  rest_catalog_signing_region='us-west-2',
  rest_catalog_signing_name='glue';

-- Switch to the Iceberg database namespace
USE demo;

-- List existing Iceberg tables
SHOW STREAMS;

INSERT INTO demo.existing_table VALUES(..)

-- Or create a new Iceberg table and use MV to write data
CREATE STREAM transformed(
  timestamp datetime64,
  org_id string,
  float_value float,
  array_length int,
  max_num int,
  min_num int
);
  
-- Stream data to Iceberg  
CREATE MATERIALIZED VIEW mydb.mv_write_iceberg INTO demo.transformed AS
SELECT now() AS timestamp, org_id, float_value,
       length(`array_of_records.a_num`) AS array_length,
       array_max(`array_of_records.a_num`) AS max_num,
       array_min(`array_of_records.a_num`) AS min_num
FROM mydb.msk_stream
SETTINGS s3_min_upload_file_size=1024;

What’s Next (Help Wanted!) 🔧

Write Improvements

  • DELETE and UPSERT operations.
  • Partitioning support (bucket, truncate).
  • INSERT OVERWRITE operations.
  • Merge-on-read for updates/deletes.

Read Improvements

  • Streaming incremental reads (snapshot tracking).
  • Time travel queries.

Catalog & Security

  • Support S3 Table (Done in preview 3)
  • Support Apache Gravitino catalog (Done in preview 3)
  • Support Apache Polaris catalog
  • Database/Hive catalog

Maintenance

  • Snapshot management, version/branch/tag management
  • Schema evolution enhancements

Try it now:

  1. We are still working on the test cases and fixing CI issues. Before we create a new Timeplus Proton release with this PR merged in, you can install Timeplus Enterprise 2.8 on Linux or macOS. Please follow the guide at https://docs.timeplus.com/enterprise-v2.8#2_8_0

You can use the web console at http://localhost:8000/ to run SQL.

Use the SQL examples above to connect to the Iceberg databases and read/write data.

You can also use this docker image on Linux/macOS/Windows:
docker.timeplus.com/timeplus/timeplusd:2.8.14. For example, start a container with the AWS AK/SK from the env var:

docker run --name timeplus_iceberg -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -d -p 7587:7587 -p 8463:8463 docker.timeplus.com/timeplus/timeplusd:2.8.14
  1. Demo video: how to use Timeplus to read data from Amazon MSK(Managed Service for Kafka), apply stream processing, then write to S3 in the Iceberg table format, then query with Athena: https://www.youtube.com/watch?v=2m6ehwmzOnc

Contribute:

Tech notes:

  • Built on Apache Arrow C++ for Parquet/ORC file handling.
  • Minimal runtime dependencies (no Hadoop/JVM).
  • AWS SDK integration for Glue/S3 auth.

Note: Starting from preview3, the syntax for catalog configuration is changed from ENGINE to SETTINGS.

@jovezhong jovezhong self-requested a review March 18, 2025 15:28
@jovezhong jovezhong changed the title Initial Iceberg support Add Apache Iceberg Streaming Writes and Batch Reads (MVP) Mar 18, 2025
@jovezhong
Copy link
Contributor

A new preview build is ready for Linux x64

wget https://install.timeplus.com/iceberg_preview2/timeplusd

MD5 (timeplusd) = 10e9d405ab5ac2cb0f28db60c804324f

@jovezhong jovezhong requested a review from chenziliang March 21, 2025 16:38
@jovezhong
Copy link
Contributor

A new preview build is ready for Linux x64

wget https://install.timeplus.com/iceberg_preview3/timeplusd
chmod +x timeplusd
export AWS_ACCESS_KEY_ID=..
export AWS_SECRET_ACCESS_KEY=..
./timeplusd server

You can also run this via Docker on Linux/Mac/Windows (the image is for linux/amd64, but can run on arm host)

export AWS_ACCESS_KEY_ID=..
export AWS_SECRET_ACCESS_KEY=..
docker run --name timeplus_iceberg -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -d -p 7587:7587 -p 8463:8463 docker.timeplus.com/timeplus/timeplusd:iceberg-preview3

Enhancements in this build:

  • Support new syntax to connect iceberg via create database NAME SETTINGS type='iceberg', catalog_type='rest', catalog_uri='..'
  • Support AWS S3 read/write, set the catalog_uri to 'https://s3tables.REGION.amazonaws.com/iceberg', as well as special value for warehouse/storage_endpoint/rest_catalog_signing_name, e.g.
CREATE DATABASE jove_s3table
SETTINGS  type='iceberg', 
          catalog_type='rest', catalog_uri='https://s3tables.us-west-2.amazonaws.com/iceberg',
          warehouse='arn:aws:s3tables:us-west-2:012345678901:bucket/jove-s3', 
          storage_endpoint='https://jove-s3.s3.us-west-2.amazonaws.com',
          rest_catalog_sigv4_enabled=true, rest_catalog_signing_region='us-west-2', rest_catalog_signing_name='s3tables';

@jovezhong
Copy link
Contributor

A new preview build is ready for Linux x64

wget https://install.timeplus.com/iceberg_preview4/timeplusd
chmod +x timeplusd
export AWS_ACCESS_KEY_ID=..
export AWS_SECRET_ACCESS_KEY=..
./timeplusd server

You can also run this via Docker on Linux/Mac/Windows (the image is for linux/amd64, but can run on arm host). The next preview bits will support ARM chip and probably with web UI.

export AWS_ACCESS_KEY_ID=..
export AWS_SECRET_ACCESS_KEY=..
docker run --name timeplus_iceberg -e AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY -d -p 7587:7587 -p 8463:8463 docker.timeplus.com/timeplus/timeplusd:iceberg-preview4

Enhancements in this build:

  • fixed the calculation issue for count() optimization
  • better handled various cases for credentials setup
  • a set of other minor bugfix and enhancements

@jovezhong
Copy link
Contributor

jovezhong commented Mar 25, 2025

Another daily update (probably the last one for this week). We published a preview edition of Timeplus Enterprise 2.8, with this iceberg integration. You can try this on Linux or macOS, no matter x86_64 or arm clip. Both bare metal and docker image are available. Please follow the guide at https://docs.timeplus.com/enterprise-v2.8#2_8_0

You can use the web console at http://localhost:8000 to run SQL. Docs are also published: https://docs.timeplus.com/iceberg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants