Skip to content

Commit 622adb7

Browse files
FokkoHonahX
andauthored
docs: Document Parquet write options (#364)
* docs: Document Parquet write options * Move to tables section * Default to 2MB * Revert some unrelated changes * Update configuration.md Co-authored-by: Honah J. <[email protected]> --------- Co-authored-by: Honah J. <[email protected]>
1 parent 40ab60a commit 622adb7

File tree

2 files changed

+17
-4
lines changed

2 files changed

+17
-4
lines changed

mkdocs/docs/configuration.md

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,20 @@ The environment variable picked up by Iceberg starts with `PYICEBERG_` and then
4646

4747
For example, `PYICEBERG_CATALOG__DEFAULT__S3__ACCESS_KEY_ID`, sets `s3.access-key-id` on the `default` catalog.
4848

49-
## FileIO
49+
# Tables
50+
51+
Iceberg tables support table properties to configure table behavior.
52+
53+
## Write options
54+
55+
| Key | Options | Default | Description |
56+
| --------------------------------- | --------------------------------- | ------- | ------------------------------------------------------------------------------------------- |
57+
| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}` | zstd | Sets the Parquet compression coddec. |
58+
| `write.parquet.compression-level` | Integer | null | Parquet compression level for the codec. If not set, it is up to PyIceberg |
59+
| `write.parquet.page-size-bytes` | Size in bytes | 1MB | Set a target threshold for the approximate encoded size of data pages within a column chunk |
60+
| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group |
61+
62+
# FileIO
5063

5164
Iceberg works with the concept of a FileIO which is a pluggable module for reading, writing, and deleting files. By default, PyIceberg will try to initialize the FileIO that's suitable for the scheme (`s3://`, `gs://`, etc.) and will use the first one that's installed.
5265

pyiceberg/io/pyarrow.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1757,14 +1757,14 @@ def write_file(table: Table, tasks: Iterator[WriteTask]) -> Iterator[DataFile]:
17571757

17581758

17591759
def _get_parquet_writer_kwargs(table_properties: Properties) -> Dict[str, Any]:
1760-
def _get_int(key: str) -> Optional[int]:
1760+
def _get_int(key: str, default: Optional[int] = None) -> Optional[int]:
17611761
if value := table_properties.get(key):
17621762
try:
17631763
return int(value)
17641764
except ValueError as e:
17651765
raise ValueError(f"Could not parse table property {key} to an integer: {value}") from e
17661766
else:
1767-
return None
1767+
return default
17681768

17691769
for key_pattern in [
17701770
"write.parquet.row-group-size-bytes",
@@ -1784,5 +1784,5 @@ def _get_int(key: str) -> Optional[int]:
17841784
"compression": compression_codec,
17851785
"compression_level": compression_level,
17861786
"data_page_size": _get_int("write.parquet.page-size-bytes"),
1787-
"dictionary_pagesize_limit": _get_int("write.parquet.dict-size-bytes"),
1787+
"dictionary_pagesize_limit": _get_int("write.parquet.dict-size-bytes", default=2 * 1024 * 1024),
17881788
}

0 commit comments

Comments
 (0)