Closed
Description
Question
I would like to read/write the Iceberg table from/to ADLS gen2 with PyIceberg
Background:
- I know how to use Spark to read/write the file, and I has successfully done it locally
- The code I use to upload Iceberg table with nessie server
ACCOUNT_NAME = "example-events" # Storage account name
CONTAINER_NAME = "example-events-iceberg-debug-local" # Container name
WAREHOUSE_PATH = f"abfs://{CONTAINER_NAME}@{ACCOUNT_NAME}.dfs.core.windows.net/"
spark = SparkSession.builder \
.config("spark.jars", 'jars/bundle-2.17.178.jar,'
'jars/iceberg-spark-runtime-3.5_2.12-1.5.2.jar,'
'jars/nessie-spark-extensions-3.5_2.12-0.101.3.jar,'
'jars/url-connection-client-2.17.178.jar,'
'jars/postgresql-42.5.0.jar,'
'jars/azure-storage-blob-12.20.0.jar,'
'jars/azure-identity-1.8.1.jar,'
"jars/hadoop-common-3.3.5.jar,jars/hadoop-client-3.3.5.jar,jars/hadoop-azure-3.3.5.jar,"
'jars/hadoop-azure-datalake-3.3.5.jar') \
.config("spark.hadoop.io.native.lib.available", "false") \
.config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,"
"org.projectnessie.spark.extensions.NessieSparkSessionExtensions") \
.config("spark.sql.catalog.nessie.uri", "http://localhost:19120/api/v1") \
.config("spark.sql.catalog.nessie.ref", ref) \
.config("spark.sql.catalog.nessie.catalog-impl", "org.apache.iceberg.nessie.NessieCatalog") \
.config("spark.sql.catalog.nessie.warehouse", WAREHOUSE_PATH) \
.config("spark.sql.catalog.nessie", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.driver.extraJavaOptions", f"-Dhadoop.home.dir={hadoop_home}") \
.config("spark.executor.extraJavaOptions", f"-Dhadoop.home.dir={hadoop_home}") \
.config("spark.hadoop.fs.azure.account.auth.type.xuievents.dfs.core.windows.net", "OAuth") \
.config("spark.hadoop.fs.azure.account.oauth.provider.type.xuievents.dfs.core.windows.net",
"org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider") \
.config("spark.hadoop.fs.azure.account.oauth2.client.id.xuievents.dfs.core.windows.net", AZURE_CLIENT_ID) \
.config("spark.hadoop.fs.azure.account.oauth2.client.secret.xuievents.dfs.core.windows.net", AZURE_CLIENT_SECRET) \
.config("spark.hadoop.fs.azure.account.oauth2.client.endpoint.xuievents.dfs.core.windows.net",
f"https://login.microsoftonline.com/{AZURE_TENANT_ID}/oauth2/token") \
.getOrCreate()
# Create namespace
spark.sql(f"CREATE NAMESPACE IF NOT EXISTS nessie.{NAME_SPACE}")
print(f"Namespace '{NAME_SPACE}' created successfully!")
# Create a table
spark.sql(f"CREATE TABLE nessie.{NAME_SPACE}.names (name STRING) USING iceberg")
print("Table 'names' created successfully!")
# Insert data
spark.sql(f"INSERT INTO nessie.{NAME_SPACE}.names VALUES ('Alex Merced'), ('Dipankar Mazumdar'), ('Jason Hughes')")
print("Data inserted successfully!")
I can confirm it has successfully uploaded to local nessie server http://localhost:19120 and data is stored in ADLS gen 2 container
However, I still cannot figure out how to read/write the Iceberg table from/to ADLS gen2 with PyIceberg
I have reviewed https://py.iceberg.apache.org/#getting-started-with-pyiceberg but still have no luck
Metadata
Metadata
Assignees
Labels
No labels