Skip to content

Commit a93e300

Browse files
smaheshwar-pltrSreesh Maheshwar
andauthored
Docs: Location Provider Documentation (#1537)
(See below for screenshots) Closes #1510. This is my first time writing docs here! Happy to receive style feedback - I already suspect I've written too much. cc @kevinjqliu @Fokko --------- Co-authored-by: Sreesh Maheshwar <[email protected]>
1 parent 41d4b93 commit a93e300

File tree

2 files changed

+105
-10
lines changed

2 files changed

+105
-10
lines changed

mkdocs/docs/configuration.md

Lines changed: 99 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -54,15 +54,18 @@ Iceberg tables support table properties to configure table behavior.
5454

5555
### Write options
5656

57-
| Key | Options | Default | Description |
58-
| -------------------------------------- | --------------------------------- | ------- | ------------------------------------------------------------------------------------------- |
59-
| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}` | zstd | Sets the Parquet compression coddec. |
60-
| `write.parquet.compression-level` | Integer | null | Parquet compression level for the codec. If not set, it is up to PyIceberg |
61-
| `write.parquet.row-group-limit` | Number of rows | 1048576 | The upper bound of the number of entries within a single row group |
62-
| `write.parquet.page-size-bytes` | Size in bytes | 1MB | Set a target threshold for the approximate encoded size of data pages within a column chunk |
63-
| `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk |
64-
| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group |
65-
| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. |
57+
| Key | Options | Default | Description |
58+
|------------------------------------------|-----------------------------------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
59+
| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}` | zstd | Sets the Parquet compression coddec. |
60+
| `write.parquet.compression-level` | Integer | null | Parquet compression level for the codec. If not set, it is up to PyIceberg |
61+
| `write.parquet.row-group-limit` | Number of rows | 1048576 | The upper bound of the number of entries within a single row group |
62+
| `write.parquet.page-size-bytes` | Size in bytes | 1MB | Set a target threshold for the approximate encoded size of data pages within a column chunk |
63+
| `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk |
64+
| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group |
65+
| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. |
66+
| `write.object-storage.enabled` | Boolean | True | Enables the [`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider) that adds a hash component to file paths. Note: the default value of `True` differs from Iceberg's Java implementation |
67+
| `write.object-storage.partitioned-paths` | Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled |
68+
| `write.py-location-provider.impl` | String of form `module.ClassName` | null | Optional, [custom `LocationProvider`](configuration.md#loading-a-custom-location-provider) implementation |
6669

6770
### Table behavior options
6871

@@ -195,6 +198,93 @@ PyIceberg uses [S3FileSystem](https://arrow.apache.org/docs/python/generated/pya
195198

196199
<!-- markdown-link-check-enable-->
197200

201+
## Location Providers
202+
203+
Apache Iceberg uses the concept of a `LocationProvider` to manage file paths for a table's data. In PyIceberg, the
204+
`LocationProvider` module is designed to be pluggable, allowing customization for specific use cases. The
205+
`LocationProvider` for a table can be specified through table properties.
206+
207+
PyIceberg defaults to the [`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider), which generates
208+
file paths that are optimized for object storage.
209+
210+
### Simple Location Provider
211+
212+
The `SimpleLocationProvider` places a table's file names underneath a `data` directory in the table's base storage
213+
location (this is `table.metadata.location` - see the [Iceberg table specification](https://iceberg.apache.org/spec/#table-metadata)).
214+
For example, a non-partitioned table might have a data file with location:
215+
216+
```txt
217+
s3://bucket/ns/table/data/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
218+
```
219+
220+
When the table is partitioned, files under a given partition are grouped into a subdirectory, with that partition key
221+
and value as the directory name - this is known as the *Hive-style* partition path format. For example, a table
222+
partitioned over a string column `category` might have a data file with location:
223+
224+
```txt
225+
s3://bucket/ns/table/data/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
226+
```
227+
228+
The `SimpleLocationProvider` is enabled for a table by explicitly setting its `write.object-storage.enabled` table
229+
property to `False`.
230+
231+
### Object Store Location Provider
232+
233+
PyIceberg offers the `ObjectStoreLocationProvider`, and an optional [partition-exclusion](configuration.md#partition-exclusion)
234+
optimization, designed for tables stored in object storage. For additional context and motivation concerning these configurations,
235+
see their [documentation for Iceberg's Java implementation](https://iceberg.apache.org/docs/latest/aws/#object-store-file-layout).
236+
237+
When several files are stored under the same prefix, cloud object stores such as S3 often [throttle requests on prefixes](https://repost.aws/knowledge-center/http-5xx-errors-s3),
238+
resulting in slowdowns. The `ObjectStoreLocationProvider` counteracts this by injecting deterministic hashes, in the form of binary directories,
239+
into file paths, to distribute files across a larger number of object store prefixes.
240+
241+
Paths still contain partitions just before the file name, in Hive-style, and a `data` directory beneath the table's location,
242+
in a similar manner to the [`SimpleLocationProvider`](configuration.md#simple-location-provider). For example, a table
243+
partitioned over a string column `category` might have a data file with location: (note the additional binary directories)
244+
245+
```txt
246+
s3://bucket/ns/table/data/0101/0110/1001/10110010/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
247+
```
248+
249+
The `write.object-storage.enabled` table property determines whether the `ObjectStoreLocationProvider` is enabled for a
250+
table. It is used by default.
251+
252+
#### Partition Exclusion
253+
254+
When the `ObjectStoreLocationProvider` is used, the table property `write.object-storage.partitioned-paths`, which
255+
defaults to `True`, can be set to `False` as an additional optimization for object stores. This omits partition keys and
256+
values from data file paths *entirely* to further reduce key size. With it disabled, the same data file above would
257+
instead be written to: (note the absence of `category=orders`)
258+
259+
```txt
260+
s3://bucket/ns/table/data/1101/0100/1011/00111010-00000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet
261+
```
262+
263+
### Loading a Custom Location Provider
264+
265+
Similar to FileIO, a custom `LocationProvider` may be provided for a table by concretely subclassing the abstract base
266+
class [`LocationProvider`](../reference/pyiceberg/table/locations/#pyiceberg.table.locations.LocationProvider).
267+
268+
The table property `write.py-location-provider.impl` should be set to the fully-qualified name of the custom
269+
`LocationProvider` (i.e. `mymodule.MyLocationProvider`). Recall that a `LocationProvider` is configured per-table,
270+
permitting different location provision for different tables. Note also that Iceberg's Java implementation uses a
271+
different table property, `write.location-provider.impl`, for custom Java implementations.
272+
273+
An example, custom `LocationProvider` implementation is shown below.
274+
275+
```py
276+
import uuid
277+
278+
class UUIDLocationProvider(LocationProvider):
279+
def __init__(self, table_location: str, table_properties: Properties):
280+
super().__init__(table_location, table_properties)
281+
282+
def new_data_location(self, data_file_name: str, partition_key: Optional[PartitionKey] = None) -> str:
283+
# Can use any custom method to generate a file path given the partitioning information and file name
284+
prefix = f"{self.table_location}/{uuid.uuid4()}"
285+
return f"{prefix}/{partition_key.to_path()}/{data_file_name}" if partition_key else f"{prefix}/{data_file_name}"
286+
```
287+
198288
## Catalogs
199289

200290
PyIceberg currently has native catalog type support for REST, SQL, Hive, Glue and DynamoDB.

pyiceberg/table/locations.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,12 @@
3030

3131

3232
class LocationProvider(ABC):
33-
"""A base class for location providers, that provide data file locations for write tasks."""
33+
"""A base class for location providers, that provide data file locations for a table's write tasks.
34+
35+
Args:
36+
table_location (str): The table's base storage location.
37+
table_properties (Properties): The table's properties.
38+
"""
3439

3540
table_location: str
3641
table_properties: Properties

0 commit comments

Comments
 (0)