Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to read/write Iceberg table from/to ADLS gen 2 container with PyIceberg #1588

Open
HungYangChang opened this issue Jan 28, 2025 · 1 comment

Comments

@HungYangChang
Copy link

HungYangChang commented Jan 28, 2025

Question

I would like to read/write the Iceberg table from/to ADLS gen2 with PyIceberg

Background:

  • I know how to use Spark to read/write the file, and I has successfully done it locally
  • The code I use to upload Iceberg table with nessie server
ACCOUNT_NAME = "example-events"  # Storage account name
CONTAINER_NAME = "example-events-iceberg-debug-local"  # Container name
WAREHOUSE_PATH = f"abfs://{CONTAINER_NAME}@{ACCOUNT_NAME}.dfs.core.windows.net/"

spark = SparkSession.builder \
            .config("spark.jars", 'jars/bundle-2.17.178.jar,'
                                  'jars/iceberg-spark-runtime-3.5_2.12-1.5.2.jar,'
                                  'jars/nessie-spark-extensions-3.5_2.12-0.101.3.jar,'
                                  'jars/url-connection-client-2.17.178.jar,'
                                  'jars/postgresql-42.5.0.jar,'
                                  'jars/azure-storage-blob-12.20.0.jar,'
                                  'jars/azure-identity-1.8.1.jar,'
                                  "jars/hadoop-common-3.3.5.jar,jars/hadoop-client-3.3.5.jar,jars/hadoop-azure-3.3.5.jar,"
                                  'jars/hadoop-azure-datalake-3.3.5.jar') \
            .config("spark.hadoop.io.native.lib.available", "false") \
            .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,"
                                            "org.projectnessie.spark.extensions.NessieSparkSessionExtensions") \
            .config("spark.sql.catalog.nessie.uri", "http://localhost:19120/api/v1") \
            .config("spark.sql.catalog.nessie.ref", ref) \
            .config("spark.sql.catalog.nessie.catalog-impl", "org.apache.iceberg.nessie.NessieCatalog") \
            .config("spark.sql.catalog.nessie.warehouse", WAREHOUSE_PATH) \
            .config("spark.sql.catalog.nessie", "org.apache.iceberg.spark.SparkCatalog") \
            .config("spark.driver.extraJavaOptions", f"-Dhadoop.home.dir={hadoop_home}") \
            .config("spark.executor.extraJavaOptions", f"-Dhadoop.home.dir={hadoop_home}") \
            .config("spark.hadoop.fs.azure.account.auth.type.xuievents.dfs.core.windows.net", "OAuth") \
            .config("spark.hadoop.fs.azure.account.oauth.provider.type.xuievents.dfs.core.windows.net",
                    "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider") \
            .config("spark.hadoop.fs.azure.account.oauth2.client.id.xuievents.dfs.core.windows.net", AZURE_CLIENT_ID) \
            .config("spark.hadoop.fs.azure.account.oauth2.client.secret.xuievents.dfs.core.windows.net", AZURE_CLIENT_SECRET) \
            .config("spark.hadoop.fs.azure.account.oauth2.client.endpoint.xuievents.dfs.core.windows.net",
                    f"https://login.microsoftonline.com/{AZURE_TENANT_ID}/oauth2/token") \
            .getOrCreate()

        # Create namespace
        spark.sql(f"CREATE NAMESPACE IF NOT EXISTS nessie.{NAME_SPACE}")
        print(f"Namespace '{NAME_SPACE}' created successfully!")

        # Create a table
        spark.sql(f"CREATE TABLE nessie.{NAME_SPACE}.names (name STRING) USING iceberg")
        print("Table 'names' created successfully!")

        # Insert data
        spark.sql(f"INSERT INTO nessie.{NAME_SPACE}.names VALUES ('Alex Merced'), ('Dipankar Mazumdar'), ('Jason Hughes')")
        print("Data inserted successfully!")

I can confirm it has successfully uploaded to local nessie server http://localhost:19120 and data is stored in ADLS gen 2 container

Nessie server:
Image

ADLS gen 2:
Image

However, I still cannot figure out how to read/write the Iceberg table from/to ADLS gen2 with PyIceberg

I have reviewed https://py.iceberg.apache.org/#getting-started-with-pyiceberg but still have no luck

@HungYangChang HungYangChang changed the title How to read/write Iceberg table from ADLS gen 2 container with PyIceberg How to read/write Iceberg table from/to ADLS gen 2 container with PyIceberg Jan 28, 2025
@kevinjqliu
Copy link
Contributor

I still cannot figure out how to read/write the Iceberg table from/to ADLS gen2 with PyIceberg

in pyiceberg, you would need to configure both the catalog and fileio

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants