Skip to content

Commit

Permalink
Convert Hugo versioned docs to mkdocs format (#9591)
Browse files Browse the repository at this point in the history
  • Loading branch information
bitsondatadev authored Feb 1, 2024
1 parent 6bbf70a commit ed28898
Show file tree
Hide file tree
Showing 45 changed files with 432 additions and 638 deletions.
22 changes: 7 additions & 15 deletions docs/java-api.md → docs/docs/api.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,5 @@
---
title: "Java API"
url: api
aliases:
- "java/api"
menu:
main:
parent: "API"
identifier: java_api
weight: 200
---
<!--
- Licensed to the Apache Software Foundation (ASF) under one or more
Expand Down Expand Up @@ -36,11 +28,11 @@ Table metadata and operations are accessed through the `Table` interface. This i

### Table metadata

The [`Table` interface](../../../javadoc/{{% icebergVersion %}}/index.html?org/apache/iceberg/Table.html) provides access to the table metadata:
The [`Table` interface](../../javadoc/{{ icebergVersion }}/index.html?org/apache/iceberg/Table.html) provides access to the table metadata:

* `schema` returns the current table [schema](../schemas)
* `schema` returns the current table [schema](schemas.md)
* `spec` returns the current table partition spec
* `properties` returns a map of key-value [properties](../configuration)
* `properties` returns a map of key-value [properties](configuration.md)
* `currentSnapshot` returns the current table snapshot
* `snapshots` returns all valid snapshots for the table
* `snapshot(id)` returns a specific snapshot by ID
Expand Down Expand Up @@ -108,7 +100,7 @@ where `Record` is Iceberg record for iceberg-data module `org.apache.iceberg.dat

### Update operations

`Table` also exposes operations that update the table. These operations use a builder pattern, [`PendingUpdate`](../../../javadoc/{{% icebergVersion %}}/index.html?org/apache/iceberg/PendingUpdate.html), that commits when `PendingUpdate#commit` is called.
`Table` also exposes operations that update the table. These operations use a builder pattern, [`PendingUpdate`](../../javadoc/{{ icebergVersion }}/index.html?org/apache/iceberg/PendingUpdate.html), that commits when `PendingUpdate#commit` is called.

For example, updating the table schema is done by calling `updateSchema`, adding updates to the builder, and finally calling `commit` to commit the pending changes to the table:

Expand Down Expand Up @@ -150,7 +142,7 @@ t.commitTransaction();

## Types

Iceberg data types are located in the [`org.apache.iceberg.types` package](../../../javadoc/{{% icebergVersion %}}/index.html?org/apache/iceberg/types/package-summary.html).
Iceberg data types are located in the [`org.apache.iceberg.types` package](../../javadoc/{{ icebergVersion }}/index.html?org/apache/iceberg/types/package-summary.html).

### Primitives

Expand All @@ -166,7 +158,7 @@ Types.DecimalType.of(9, 2) // decimal(9, 2)

Structs, maps, and lists are created using factory methods in type classes.

Like struct fields, map keys or values and list elements are tracked as nested fields. Nested fields track [field IDs](../evolution#correctness) and nullability.
Like struct fields, map keys or values and list elements are tracked as nested fields. Nested fields track [field IDs](evolution.md#correctness) and nullability.

Struct fields are created using `NestedField.optional` or `NestedField.required`. Map value and list element nullability is set in the map and list factory methods.

Expand All @@ -193,7 +185,7 @@ ListType list = ListType.ofRequired(1, IntegerType.get());

## Expressions

Iceberg's expressions are used to configure table scans. To create expressions, use the factory methods in [`Expressions`](../../../javadoc/{{% icebergVersion %}}/index.html?org/apache/iceberg/expressions/Expressions.html).
Iceberg's expressions are used to configure table scans. To create expressions, use the factory methods in [`Expressions`](../../javadoc/{{ icebergVersion }}/index.html?org/apache/iceberg/expressions/Expressions.html).

Supported predicate expressions are:

Expand Down
Binary file added docs/docs/assets/images/audit-branch.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
39 changes: 18 additions & 21 deletions docs/aws.md → docs/docs/aws.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,5 @@
---
title: "AWS"
url: aws
menu:
main:
parent: Integrations
identifier: aws_integration
weight: 0
---
<!--
- Licensed to the Apache Software Foundation (ASF) under one or more
Expand Down Expand Up @@ -53,7 +47,7 @@ For example, to use AWS features with Spark 3.4 (with scala 2.12) and AWS client

```sh
# start Spark SQL client shell
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:{{% icebergVersion %}},org.apache.iceberg:iceberg-aws-bundle:{{% icebergVersion %}} \
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:{{ icebergVersion }},org.apache.iceberg:iceberg-aws-bundle:{{ icebergVersion }} \
--conf spark.sql.defaultCatalog=my_catalog \
--conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/my/key/prefix \
Expand All @@ -69,10 +63,12 @@ To use AWS module with Flink, you can download the necessary dependencies and sp

```sh
# download Iceberg dependency
ICEBERG_VERSION={{% icebergVersion %}}
ICEBERG_VERSION={{ icebergVersion }}
MAVEN_URL=https://repo1.maven.org/maven2
ICEBERG_MAVEN_URL=$MAVEN_URL/org/apache/iceberg

wget $ICEBERG_MAVEN_URL/iceberg-flink-runtime/$ICEBERG_VERSION/iceberg-flink-runtime-$ICEBERG_VERSION.jar

wget $ICEBERG_MAVEN_URL/iceberg-aws-bundle/$ICEBERG_VERSION/iceberg-aws-bundle-$ICEBERG_VERSION.jar

# start Flink SQL client shell
Expand Down Expand Up @@ -142,7 +138,7 @@ an Iceberg table is stored as a [Glue Table](https://docs.aws.amazon.com/glue/la
and every Iceberg table version is stored as a [Glue TableVersion](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-catalog-tables.html#aws-glue-api-catalog-tables-TableVersion).
You can start using Glue catalog by specifying the `catalog-impl` as `org.apache.iceberg.aws.glue.GlueCatalog`,
just like what is shown in the [enabling AWS integration](#enabling-aws-integration) section above.
More details about loading the catalog can be found in individual engine pages, such as [Spark](../spark-configuration/#loading-a-custom-catalog) and [Flink](../flink/#creating-catalogs-and-using-catalogs).
More details about loading the catalog can be found in individual engine pages, such as [Spark](spark-configuration.md#loading-a-custom-catalog) and [Flink](flink.md#creating-catalogs-and-using-catalogs).

#### Glue Catalog ID

Expand Down Expand Up @@ -181,17 +177,17 @@ If there is no commit conflict, the operation will be retried.
Optimistic locking guarantees atomic transaction of Iceberg tables in Glue.
It also prevents others from accidentally overwriting your changes.

{{< hint info >}}
Please use AWS SDK version >= 2.17.131 to leverage Glue's Optimistic Locking.
If the AWS SDK version is below 2.17.131, only in-memory lock is used. To ensure atomic transaction, you need to set up a [DynamoDb Lock Manager](#dynamodb-lock-manager).
{{< /hint >}}
!!! info
Please use AWS SDK version >= 2.17.131 to leverage Glue's Optimistic Locking.
If the AWS SDK version is below 2.17.131, only in-memory lock is used. To ensure atomic transaction, you need to set up a [DynamoDb Lock Manager](#dynamodb-lock-manager).


#### Warehouse Location

Similar to all other catalog implementations, `warehouse` is a required catalog property to determine the root path of the data warehouse in storage.
By default, Glue only allows a warehouse location in S3 because of the use of `S3FileIO`.
To store data in a different local or cloud store, Glue catalog can switch to use `HadoopFileIO` or any custom FileIO by setting the `io-impl` catalog property.
Details about this feature can be found in the [custom FileIO](../custom-catalog/#custom-file-io-implementation) section.
Details about this feature can be found in the [custom FileIO](custom-catalog.md#custom-file-io-implementation) section.

#### Table Location

Expand Down Expand Up @@ -267,7 +263,7 @@ This design has the following benefits:

Iceberg also supports the JDBC catalog which uses a table in a relational database to manage Iceberg tables.
You can configure to use the JDBC catalog with relational database services like [AWS RDS](https://aws.amazon.com/rds).
Read [the JDBC integration page](../jdbc/#jdbc-catalog) for guides and examples about using the JDBC catalog.
Read [the JDBC integration page](jdbc.md#jdbc-catalog) for guides and examples about using the JDBC catalog.
Read [this AWS documentation](https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/UsingWithRDS.IAMDBAuth.Connecting.Java.html) for more details about configuring the JDBC catalog with IAM authentication.

### Which catalog to choose?
Expand All @@ -293,7 +289,7 @@ This feature requires the following lock related catalog properties:
2. Set `lock.table` as the DynamoDB table name you would like to use. If the lock table with the given name does not exist in DynamoDB, a new table is created with billing mode set as [pay-per-request](https://aws.amazon.com/blogs/aws/amazon-dynamodb-on-demand-no-capacity-planning-and-pay-per-request-pricing).

Other lock related catalog properties can also be used to adjust locking behaviors such as heartbeat interval.
For more details, please refer to [Lock catalog properties](../configuration/#lock-catalog-properties).
For more details, please refer to [Lock catalog properties](configuration.md#lock-catalog-properties).


## S3 FileIO
Expand Down Expand Up @@ -347,7 +343,7 @@ Iceberg by default uses the Hive storage layout but can be switched to use the `
With `ObjectStoreLocationProvider`, a deterministic hash is generated for each stored file, with the hash appended
directly after the `write.data.path`. This ensures files written to s3 are equally distributed across multiple [prefixes](https://aws.amazon.com/premiumsupport/knowledge-center/s3-object-key-naming-pattern/) in the S3 bucket. Resulting in minimized throttling and maximized throughput for S3-related IO operations. When using `ObjectStoreLocationProvider` having a shared and short `write.data.path` across your Iceberg tables will improve performance.

For more information on how S3 scales API QPS, check out the 2018 re:Invent session on [Best Practices for Amazon S3 and Amazon S3 Glacier]( https://youtu.be/rHeTn9pHNKo?t=3219). At [53:39](https://youtu.be/rHeTn9pHNKo?t=3219) it covers how S3 scales/partitions & at [54:50](https://youtu.be/rHeTn9pHNKo?t=3290) it discusses the 30-60 minute wait time before new partitions are created.
For more information on how S3 scales API QPS, check out the 2018 re:Invent session on [Best Practices for Amazon S3 and Amazon S3 Glacier](https://youtu.be/rHeTn9pHNKo?t=3219). At [53:39](https://youtu.be/rHeTn9pHNKo?t=3219) it covers how S3 scales/partitions & at [54:50](https://youtu.be/rHeTn9pHNKo?t=3290) it discusses the 30-60 minute wait time before new partitions are created.

To use the `ObjectStorageLocationProvider` add `'write.object-storage.enabled'=true` in the table's properties.
Below is an example Spark SQL command to create a table using the `ObjectStorageLocationProvider`:
Expand Down Expand Up @@ -378,7 +374,7 @@ However, for the older versions up to 0.12.0, the logic is as follows:
- before 0.12.0, `write.object-storage.path` must be set.
- at 0.12.0, `write.object-storage.path` then `write.folder-storage.path` then `<tableLocation>/data`.

For more details, please refer to the [LocationProvider Configuration](../custom-catalog/#custom-location-provider-implementation) section.
For more details, please refer to the [LocationProvider Configuration](custom-catalog.md#custom-location-provider-implementation) section.

### S3 Strong Consistency

Expand Down Expand Up @@ -539,7 +535,7 @@ The Glue, S3 and DynamoDB clients are then initialized with the assume-role cred
Here is an example to start Spark shell with this client factory:

```shell
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:{{% icebergVersion %}},org.apache.iceberg:iceberg-aws-bundle:{{% icebergVersion %}} \
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:{{ icebergVersion }},org.apache.iceberg:iceberg-aws-bundle:{{ icebergVersion }} \
--conf spark.sql.catalog.my_catalog=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.my_catalog.warehouse=s3://my-bucket/my/key/prefix \
--conf spark.sql.catalog.my_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
Expand Down Expand Up @@ -618,13 +614,14 @@ For versions before 6.5.0, you can use a [bootstrap action](https://docs.aws.ama
```sh
#!/bin/bash

ICEBERG_VERSION={{% icebergVersion %}}
ICEBERG_VERSION={{ icebergVersion }}
MAVEN_URL=https://repo1.maven.org/maven2
ICEBERG_MAVEN_URL=$MAVEN_URL/org/apache/iceberg
# NOTE: this is just an example shared class path between Spark and Flink,
# please choose a proper class path for production.
LIB_PATH=/usr/share/aws/aws-java-sdk/


ICEBERG_PACKAGES=(
"iceberg-spark-runtime-3.3_2.12"
"iceberg-flink-runtime"
Expand Down Expand Up @@ -655,7 +652,7 @@ More details could be found [here](https://docs.aws.amazon.com/glue/latest/dg/aw
### AWS EKS

[AWS Elastic Kubernetes Service (EKS)](https://aws.amazon.com/eks/) can be used to start any Spark, Flink, Hive, Presto or Trino clusters to work with Iceberg.
Search the [Iceberg blogs](../../../blogs) page for tutorials around running Iceberg with Docker and Kubernetes.
Search the [Iceberg blogs](../../blogs.md) page for tutorials around running Iceberg with Docker and Kubernetes.

### Amazon Kinesis

Expand Down
28 changes: 10 additions & 18 deletions docs/branching-and-tagging.md → docs/docs/branching.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,5 @@
---
title: "Branching and Tagging"
url: branching
aliases:
- "tables/branching"
menu:
main:
parent: Tables
identifier: tables_branching
weight: 0
---

<!--
Expand All @@ -33,14 +25,14 @@ menu:

Iceberg table metadata maintains a snapshot log, which represents the changes applied to a table.
Snapshots are fundamental in Iceberg as they are the basis for reader isolation and time travel queries.
For controlling metadata size and storage costs, Iceberg provides snapshot lifecycle management procedures such as [`expire_snapshots`](../spark-procedures/#expire-snapshots) for removing unused snapshots and no longer necessary data files based on table snapshot retention properties.
For controlling metadata size and storage costs, Iceberg provides snapshot lifecycle management procedures such as [`expire_snapshots`](spark-procedures.md#expire-snapshots) for removing unused snapshots and no longer necessary data files based on table snapshot retention properties.

**For more sophisticated snapshot lifecycle management, Iceberg supports branches and tags which are named references to snapshots with their own independent lifecycles. This lifecycle is controlled by branch and tag level retention policies.**
Branches are independent lineages of snapshots and point to the head of the lineage.
Branches and tags have a maximum reference age property which control when the reference to the snapshot itself should be expired.
Branches have retention properties which define the minimum number of snapshots to retain on a branch as well as the maximum age of individual snapshots to retain on the branch.
These properties are used when the expireSnapshots procedure is run.
For details on the algorithm for expireSnapshots, refer to the [spec](../../../spec#snapshot-retention-policy).
For details on the algorithm for expireSnapshots, refer to the [spec](../../spec.md#snapshot-retention-policy).

## Use Cases

Expand All @@ -52,7 +44,7 @@ See below for some examples of how branching and tagging can facilitate these us

Tags can be used for retaining important historical snapshots for auditing purposes.

![Historical Tags](../img/historical-snapshot-tag.png)
![Historical Tags](assets/images/historical-snapshot-tag.png)

The above diagram demonstrates retaining important historical snapshot with the following retention policy, defined
via Spark SQL.
Expand Down Expand Up @@ -84,7 +76,7 @@ ALTER TABLE prod.db.table CREATE BRANCH `test-branch` RETAIN 7 DAYS WITH SNAPSHO

### Audit Branch

![Audit Branch](../img/audit-branch.png)
![Audit Branch](assets/images/audit-branch.png)

The above diagram shows an example of using an audit branch for validating a write workflow.

Expand Down Expand Up @@ -115,9 +107,9 @@ CALL catalog_name.system.fast_forward('prod.db.table', 'main', 'audit-branch');

Creating, querying and writing to branches and tags are supported in the Iceberg Java library, and in Spark and Flink engine integrations.

- [Iceberg Java Library](../java-api-quickstart/#branching-and-tagging)
- [Spark DDLs](../spark-ddl/#branching-and-tagging-ddl)
- [Spark Reads](../spark-queries/#time-travel)
- [Spark Branch Writes](../spark-writes/#writing-to-branches)
- [Flink Reads](../flink-queries/#reading-branches-and-tags-with-SQL)
- [Flink Branch Writes](../flink-writes/#branch-writes)
- [Iceberg Java Library](java-api-quickstart.md#branching-and-tagging)
- [Spark DDLs](spark-ddl.md#branching-and-tagging-ddl)
- [Spark Reads](spark-queries.md#time-travel)
- [Spark Branch Writes](spark-writes.md#writing-to-branches)
- [Flink Reads](flink-queries.md#reading-branches-and-tags-with-SQL)
- [Flink Branch Writes](flink-writes.md#branch-writes)
14 changes: 3 additions & 11 deletions docs/configuration.md → docs/docs/configuration.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,5 @@
---
title: "Configuration"
url: configuration
aliases:
- "tables/configuration"
menu:
main:
parent: Tables
identifier: tables_configuration
weight: 0
---
<!--
- Licensed to the Apache Software Foundation (ASF) under one or more
Expand Down Expand Up @@ -144,8 +136,8 @@ Iceberg catalogs support using catalog properties to configure catalog behaviors
`HadoopCatalog` and `HiveCatalog` can access the properties in their constructors.
Any other custom catalog can access the properties by implementing `Catalog.initialize(catalogName, catalogProperties)`.
The properties can be manually constructed or passed in from a compute engine like Spark or Flink.
Spark uses its session properties as catalog properties, see more details in the [Spark configuration](../spark-configuration#catalog-configuration) section.
Flink passes in catalog properties through `CREATE CATALOG` statement, see more details in the [Flink](../flink/#creating-catalogs-and-using-catalogs) section.
Spark uses its session properties as catalog properties, see more details in the [Spark configuration](spark-configuration.md#catalog-configuration) section.
Flink passes in catalog properties through `CREATE CATALOG` statement, see more details in the [Flink](flink.md#adding-catalogs) section.

### Lock catalog properties

Expand All @@ -154,7 +146,7 @@ Here are the catalog properties related to locking. They are used by some catalo
| Property | Default | Description |
| --------------------------------- | ------------------ | ------------------------------------------------------ |
| lock-impl | null | a custom implementation of the lock manager, the actual interface depends on the catalog used |
| lock.table | null | an auxiliary table for locking, such as in [AWS DynamoDB lock manager](../aws/#dynamodb-for-commit-locking) |
| lock.table | null | an auxiliary table for locking, such as in [AWS DynamoDB lock manager](aws.md#dynamodb-lock-manager) |
| lock.acquire-interval-ms | 5000 (5 s) | the interval to wait between each attempt to acquire a lock |
| lock.acquire-timeout-ms | 180000 (3 min) | the maximum time to try acquiring a lock |
| lock.heartbeat-interval-ms | 3000 (3 s) | the interval to wait between each heartbeat after acquiring a lock |
Expand Down
Loading

0 comments on commit ed28898

Please sign in to comment.