Skip to content

How to read/write Iceberg table from/to ADLS gen 2 container with PyIceberg #1588

Closed
@HungYangChang

Description

@HungYangChang

Question

I would like to read/write the Iceberg table from/to ADLS gen2 with PyIceberg

Background:

  • I know how to use Spark to read/write the file, and I has successfully done it locally
  • The code I use to upload Iceberg table with nessie server
ACCOUNT_NAME = "example-events"  # Storage account name
CONTAINER_NAME = "example-events-iceberg-debug-local"  # Container name
WAREHOUSE_PATH = f"abfs://{CONTAINER_NAME}@{ACCOUNT_NAME}.dfs.core.windows.net/"

spark = SparkSession.builder \
            .config("spark.jars", 'jars/bundle-2.17.178.jar,'
                                  'jars/iceberg-spark-runtime-3.5_2.12-1.5.2.jar,'
                                  'jars/nessie-spark-extensions-3.5_2.12-0.101.3.jar,'
                                  'jars/url-connection-client-2.17.178.jar,'
                                  'jars/postgresql-42.5.0.jar,'
                                  'jars/azure-storage-blob-12.20.0.jar,'
                                  'jars/azure-identity-1.8.1.jar,'
                                  "jars/hadoop-common-3.3.5.jar,jars/hadoop-client-3.3.5.jar,jars/hadoop-azure-3.3.5.jar,"
                                  'jars/hadoop-azure-datalake-3.3.5.jar') \
            .config("spark.hadoop.io.native.lib.available", "false") \
            .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,"
                                            "org.projectnessie.spark.extensions.NessieSparkSessionExtensions") \
            .config("spark.sql.catalog.nessie.uri", "http://localhost:19120/api/v1") \
            .config("spark.sql.catalog.nessie.ref", ref) \
            .config("spark.sql.catalog.nessie.catalog-impl", "org.apache.iceberg.nessie.NessieCatalog") \
            .config("spark.sql.catalog.nessie.warehouse", WAREHOUSE_PATH) \
            .config("spark.sql.catalog.nessie", "org.apache.iceberg.spark.SparkCatalog") \
            .config("spark.driver.extraJavaOptions", f"-Dhadoop.home.dir={hadoop_home}") \
            .config("spark.executor.extraJavaOptions", f"-Dhadoop.home.dir={hadoop_home}") \
            .config("spark.hadoop.fs.azure.account.auth.type.xuievents.dfs.core.windows.net", "OAuth") \
            .config("spark.hadoop.fs.azure.account.oauth.provider.type.xuievents.dfs.core.windows.net",
                    "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider") \
            .config("spark.hadoop.fs.azure.account.oauth2.client.id.xuievents.dfs.core.windows.net", AZURE_CLIENT_ID) \
            .config("spark.hadoop.fs.azure.account.oauth2.client.secret.xuievents.dfs.core.windows.net", AZURE_CLIENT_SECRET) \
            .config("spark.hadoop.fs.azure.account.oauth2.client.endpoint.xuievents.dfs.core.windows.net",
                    f"https://login.microsoftonline.com/{AZURE_TENANT_ID}/oauth2/token") \
            .getOrCreate()

        # Create namespace
        spark.sql(f"CREATE NAMESPACE IF NOT EXISTS nessie.{NAME_SPACE}")
        print(f"Namespace '{NAME_SPACE}' created successfully!")

        # Create a table
        spark.sql(f"CREATE TABLE nessie.{NAME_SPACE}.names (name STRING) USING iceberg")
        print("Table 'names' created successfully!")

        # Insert data
        spark.sql(f"INSERT INTO nessie.{NAME_SPACE}.names VALUES ('Alex Merced'), ('Dipankar Mazumdar'), ('Jason Hughes')")
        print("Data inserted successfully!")

I can confirm it has successfully uploaded to local nessie server http://localhost:19120 and data is stored in ADLS gen 2 container

Nessie server:
Image

ADLS gen 2:
Image

However, I still cannot figure out how to read/write the Iceberg table from/to ADLS gen2 with PyIceberg

I have reviewed https://py.iceberg.apache.org/#getting-started-with-pyiceberg but still have no luck

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions