Skip to content

Commit c09456b

Browse files
committed
Thanks Smaheshwar
1 parent f0150df commit c09456b

File tree

4 files changed

+38
-24
lines changed

4 files changed

+38
-24
lines changed

mkdocs/docs/configuration.md

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -54,19 +54,19 @@ Iceberg tables support table properties to configure table behavior.
5454

5555
### Write options
5656

57-
| Key | Options | Default | Description |
58-
|------------------------------------------|------------------------------------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
59-
| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}` | zstd | Sets the Parquet compression coddec. |
60-
| `write.parquet.compression-level` | Integer | null | Parquet compression level for the codec. If not set, it is up to PyIceberg |
61-
| `write.parquet.row-group-limit` | Number of rows | 1048576 | The upper bound of the number of entries within a single row group |
62-
| `write.parquet.page-size-bytes` | Size in bytes | 1MB | Set a target threshold for the approximate encoded size of data pages within a column chunk |
63-
| `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk |
64-
| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group |
65-
| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. |
66-
| `write.object-storage.enabled` | Boolean | True | Enables the [`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider) that adds a hash component to file paths. Note: the default value of `True` differs from Iceberg's Java implementation |
67-
| `write.object-storage.partitioned-paths` | Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled |
68-
| `write.py-location-provider.impl` | String of form `module.ClassName` | null | Optional, [custom `LocationProvider`](configuration.md#loading-a-custom-location-provider) implementation |
69-
| `write.data.path` | String pointing to location | | Sets the location where to write the data. If not set, it will use the table location postfixed with `data/`. |
57+
| Key | Options | Default | Description |
58+
|------------------------------------------|------------------------------------|----------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------|
59+
| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}` | zstd | Sets the Parquet compression coddec. |
60+
| `write.parquet.compression-level` | Integer | null | Parquet compression level for the codec. If not set, it is up to PyIceberg |
61+
| `write.parquet.row-group-limit` | Number of rows | 1048576 | The upper bound of the number of entries within a single row group |
62+
| `write.parquet.page-size-bytes` | Size in bytes | 1MB | Set a target threshold for the approximate encoded size of data pages within a column chunk |
63+
| `write.parquet.page-row-limit` | Number of rows | 20000 | Set a target threshold for the maximum number of rows within a column chunk |
64+
| `write.parquet.dict-size-bytes` | Size in bytes | 2MB | Set the dictionary page size limit per row group |
65+
| `write.metadata.previous-versions-max` | Integer | 100 | The max number of previous version metadata files to keep before deleting after commit. |
66+
| `write.object-storage.enabled` | Boolean | True | Enables the [`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider) that adds a hash component to file paths. Note: the default value of `True` differs from Iceberg's Java implementation |
67+
| `write.object-storage.partitioned-paths` | Boolean | True | Controls whether [partition values are included in file paths](configuration.md#partition-exclusion) when object storage is enabled |
68+
| `write.py-location-provider.impl` | String of form `module.ClassName` | null | Optional, [custom `LocationProvider`](configuration.md#loading-a-custom-location-provider) implementation |
69+
| `write.data.path` | String pointing to location | `{metadata.location}/data` | Sets the location under which data is written. |
7070

7171
### Table behavior options
7272

@@ -211,8 +211,8 @@ file paths that are optimized for object storage.
211211

212212
### Simple Location Provider
213213

214-
The `SimpleLocationProvider` places a table's file names underneath a `data` directory in the table's base storage
215-
location (this is `table.metadata.location` - see the [Iceberg table specification](https://iceberg.apache.org/spec/#table-metadata)).
214+
The `SimpleLocationProvider` provides paths prefixed by `{location}/data/`, where `location` comes from the [table metadata](https://iceberg.apache.org/spec/#table-metadata-fields). This can be overridden by setting [`write.data.path` table configuration](#write-options).
215+
216216
For example, a non-partitioned table might have a data file with location:
217217

218218
```txt
@@ -240,9 +240,9 @@ When several files are stored under the same prefix, cloud object stores such as
240240
resulting in slowdowns. The `ObjectStoreLocationProvider` counteracts this by injecting deterministic hashes, in the form of binary directories,
241241
into file paths, to distribute files across a larger number of object store prefixes.
242242

243-
Paths still contain partitions just before the file name, in Hive-style, and a `data` directory beneath the table's location,
244-
in a similar manner to the [`SimpleLocationProvider`](configuration.md#simple-location-provider). For example, a table
245-
partitioned over a string column `category` might have a data file with location: (note the additional binary directories)
243+
Paths still are also prefixed by `{location}/data/`, where `location` comes from the [table metadata](https://iceberg.apache.org/spec/#table-metadata-fields), in a similar manner to the [`SimpleLocationProvider`](configuration.md#simple-location-provider). This can be overridden by setting [`write.data.path` table configuration](#write-options).
244+
245+
For example, a table partitioned over a string column `category` might have a data file with location: (note the additional binary directories)
246246

247247
```txt
248248
s3://bucket/ns/table/data/0101/0110/1001/10110010/category=orders/0000-0-5affc076-96a4-48f2-9cd2-d5efbc9f0c94-00001.parquet

pyiceberg/table/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -196,6 +196,8 @@ class TableProperties:
196196
WRITE_OBJECT_STORE_PARTITIONED_PATHS = "write.object-storage.partitioned-paths"
197197
WRITE_OBJECT_STORE_PARTITIONED_PATHS_DEFAULT = True
198198

199+
WRITE_DATA_PATH = "write.data.path"
200+
199201
DELETE_MODE = "write.delete.mode"
200202
DELETE_MODE_COPY_ON_WRITE = "copy-on-write"
201203
DELETE_MODE_MERGE_ON_READ = "merge-on-read"

pyiceberg/table/locations.py

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -28,8 +28,6 @@
2828

2929
logger = logging.getLogger(__name__)
3030

31-
WRITE_DATA_PATH = "write.data.path"
32-
3331

3432
class LocationProvider(ABC):
3533
"""A base class for location providers, that provide data file locations for a table's write tasks.
@@ -48,7 +46,7 @@ def __init__(self, table_location: str, table_properties: Properties):
4846
self.table_location = table_location
4947
self.table_properties = table_properties
5048

51-
if path := table_properties.get(WRITE_DATA_PATH):
49+
if path := table_properties.get(TableProperties.WRITE_DATA_PATH):
5250
self.data_path = path.rstrip("/")
5351
else:
5452
self.data_path = f"{self.table_location.rstrip('/')}/data"

tests/table/test_locations.py

Lines changed: 17 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,8 @@
2020

2121
from pyiceberg.partitioning import PartitionField, PartitionFieldValue, PartitionKey, PartitionSpec
2222
from pyiceberg.schema import Schema
23-
from pyiceberg.table.locations import WRITE_DATA_PATH, LocationProvider, load_location_provider
23+
from pyiceberg.table import TableProperties
24+
from pyiceberg.table.locations import LocationProvider, load_location_provider
2425
from pyiceberg.transforms import IdentityTransform
2526
from pyiceberg.typedef import EMPTY_DICT
2627
from pyiceberg.types import NestedField, StringType
@@ -135,11 +136,24 @@ def test_hash_injection(data_file_name: str, expected_hash: str) -> None:
135136
assert provider.new_data_location(data_file_name) == f"table_location/data/{expected_hash}/{data_file_name}"
136137

137138

138-
def test_write_data_path() -> None:
139+
def test_object_location_provider_write_data_path() -> None:
139140
provider = load_location_provider(
140-
table_location="s3://table-location/table", table_properties={WRITE_DATA_PATH: "s3://table-location/custom/data/path"}
141+
table_location="s3://table-location/table",
142+
table_properties={TableProperties.WRITE_DATA_PATH: "s3://table-location/custom/data/path"},
141143
)
142144

143145
assert (
144146
provider.new_data_location("file.parquet") == "s3://table-location/custom/data/path/0010/1111/0101/11011101/file.parquet"
145147
)
148+
149+
150+
def test_simple_location_provider_write_data_path() -> None:
151+
provider = load_location_provider(
152+
table_location="table_location",
153+
table_properties={
154+
TableProperties.WRITE_DATA_PATH: "s3://table-location/custom/data/path",
155+
"write.object-storage.enabled": "false",
156+
},
157+
)
158+
159+
assert provider.new_data_location("file.parquet") == "s3://table-location/custom/data/path/file.parquet"

0 commit comments

Comments
 (0)