Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for encrypted/protected data type in iceberg table #1582

Open
yigal-rozenberg opened this issue Jan 27, 2025 · 3 comments
Open

Comments

@yigal-rozenberg
Copy link

Feature Request / Improvement

I am working on extending Apache Iceberg supported data types with a new complex type: 'ProtctedType'.
This new data type internally is a StructType including a header and a payload.
The Header to include at minimum:

  1. Encryption Provider ID
  2. Encryption Key ID
  3. Data Type

The payload to include the encrypted data as BinaryType.

The goal is to allow end user transparent interaction with the new type, allowing operations between encrypted data items, and clear text.
Further more, allow extension of puffin files to store aggregate data based on the clear text values, bloom filters, and optionally inverted index for gerex search without a full table scan.

Looking for guidance on how such data type can be introduced and what are the dependencies I would need to address with the various readers and writers.

protected_type_merge.txt

@yigal-rozenberg
Copy link
Author

Sample program to read and write using ProtectedType data.

app.txt

@kevinjqliu
Copy link
Contributor

Hey there! Thanks for creating this issue. Typically for something like this we would want to create a Improvement Proposals and get feedback from the community.
In this case, the proposal seem to be adding a new data type to the table specification. (Note, the table specification should be language agnostic.)

Here are some somewhat related threads that i've found
https://lists.apache.org/thread/jm5xoy3fro4omlqlo476cf0118dcznkr
apache/iceberg#10909

Hope this helps!

@yigal-rozenberg
Copy link
Author

Before posting this as a proper improvement request, I would like to come up with a POC that demonstrate the desired functionality/ The thread you provided talks about the need for proper data centric security, and I have some years of experience in this topic.
IMHO the best way to secure data and centrally control access is to use data item encryption. In some cases this can also be referred to column level encryption, however, one can confuse this with file encryption in column based data.
When data items are encrypted, the cipher text can be sent and shared/accessed across multiple systems and engines.
The challenge is that cipher text by itself does not include metadata such as the key-id used to encrypt it, and the original data type of the clear text.
I am trying to understand, as a first phase, how in Iceberg Python interface I can crate a new Data Type, which has a different behavior when it stores and reads the data from the table storage, and a different behavior when data is inserted/updated/selected.

Do I need to implement different str / repr methods in the new data type, or I need to do it elsewhere?
Where to implement the operators to support operations between 2 encrypted types, and operations between encrypted and clear text?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants