You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
(See below for screenshots)
Closes#1510. This is my first time writing docs here! Happy to receive
style feedback - I already suspect I've written too much.
cc @kevinjqliu@Fokko
---------
Co-authored-by: Sreesh Maheshwar <[email protected]>
|`write.parquet.compression-codec`|`{uncompressed,zstd,gzip,snappy}`| zstd | Sets the Parquet compression coddec. |
60
+
|`write.parquet.compression-level`| Integer | null | Parquet compression level for the codec. If not set, it is up to PyIceberg |
61
+
|`write.parquet.row-group-limit`| Number of rows | 1048576 | The upper bound of the number of entries within a single row group |
62
+
|`write.parquet.page-size-bytes`| Size in bytes | 1MB | Set a target threshold for the approximate encoded size of data pages within a column chunk |
63
+
|`write.parquet.page-row-limit`| Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk |
64
+
|`write.parquet.dict-size-bytes`| Size in bytes | 2MB | Set the dictionary page size limit per row group |
65
+
|`write.metadata.previous-versions-max`| Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. |
66
+
|`write.object-storage.enabled`| Boolean | True | Enables the [`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider) that adds a hash component to file paths. Note: the default value of `True` differs from Iceberg's Java implementation |
67
+
|`write.object-storage.partitioned-paths`| Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled |
68
+
|`write.py-location-provider.impl`| String of form `module.ClassName`| null | Optional, [custom `LocationProvider`](configuration.md#loading-a-custom-location-provider) implementation |
The `SimpleLocationProvider` is enabled for a table by explicitly setting its `write.object-storage.enabled` table
229
+
property to `False`.
230
+
231
+
### Object Store Location Provider
232
+
233
+
PyIceberg offers the `ObjectStoreLocationProvider`, and an optional [partition-exclusion](configuration.md#partition-exclusion)
234
+
optimization, designed for tables stored in object storage. For additional context and motivation concerning these configurations,
235
+
see their [documentation for Iceberg's Java implementation](https://iceberg.apache.org/docs/latest/aws/#object-store-file-layout).
236
+
237
+
When several files are stored under the same prefix, cloud object stores such as S3 often [throttle requests on prefixes](https://repost.aws/knowledge-center/http-5xx-errors-s3),
238
+
resulting in slowdowns. The `ObjectStoreLocationProvider` counteracts this by injecting deterministic hashes, in the form of binary directories,
239
+
into file paths, to distribute files across a larger number of object store prefixes.
240
+
241
+
Paths still contain partitions just before the file name, in Hive-style, and a `data` directory beneath the table's location,
242
+
in a similar manner to the [`SimpleLocationProvider`](configuration.md#simple-location-provider). For example, a table
243
+
partitioned over a string column `category` might have a data file with location: (note the additional binary directories)
0 commit comments