-
Notifications
You must be signed in to change notification settings - Fork 357
Open
Description
Apache Iceberg version
0.9.1 (latest release)
Please describe the bug 🐞
Summary
It seems that there is memory leak when running to_arrow_batch_reader(), it takes ~30GB memory to read an iceberg table with single 40MB parquet files
Example code:
import boto3
from pyiceberg.table import StaticTable, Table
def iceberg_table_from_metadata_path(metadata_path: str) -> StaticTable:
session = boto3.Session(region_name="<aws_region>")
credentials = session.get_credentials()
credentials = credentials.get_frozen_credentials()
table = StaticTable.from_metadata(
metadata_path,
{
"client.secret-access-key": credentials.secret_key,
"client.access-key-id": credentials.access_key,
"client.session-token": credentials.token,
"client.region": "<aws_region>",
},
)
return table
def main():
metadata_path = "<path-to>.metadata.json"
iceberg_table = iceberg_table_from_metadata_path(metadata_path)
scan_kwargs = {"row_filter": f"PARTITION='train'"}
batch_reader = iceberg_table.scan(**scan_kwargs).to_arrow_batch_reader()
for batch in batch_reader:
print(f"Inside batch reader")
plan_files = iceberg_table.scan(**scan_kwargs).plan_files()
for file in plan_files:
print(file.file.file_path)
print("Hello from pyiceberg-test!")
if __name__ == "__main__":
main()
Running using memray
uv run memray run main.py
Charts


Dependencies
"boto3>=1.40.21",
"memray>=1.18.0",
"pyarrow>=21.0.0",
"pyiceberg==0.10.0rc1",
"s3fs>=0.4.2"
Willingness to contribute
- I can contribute a fix for this bug independently
- I would be willing to contribute a fix for this bug with guidance from the Iceberg community
- I cannot contribute a fix for this bug at this time
Metadata
Metadata
Assignees
Labels
No labels