Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature] UpdateSchema.add_column supports both parent and child in the same transaction #1493

Open
3 tasks
kevinjqliu opened this issue Jan 7, 2025 · 8 comments
Assignees

Comments

@kevinjqliu
Copy link
Contributor

kevinjqliu commented Jan 7, 2025

Apache Iceberg version

None

Please describe the bug 🐞

Current we cannot add the parent field with its child nested field in the same transaction.
For example,

with table.update_schema() as update:
    # In a struct
    update.add_column("details", StructType())
    update.add_column(("details", "confirmed_by"), StringType(), "Name of the exchange")

We should update the API docs as well

To reproduce:

from pyiceberg.catalog.sql import SqlCatalog
from pyiceberg.schema import Schema
from pyiceberg.types import DoubleType, IntegerType, NestedField, StringType, StructType

warehouse_path = "/tmp/warehouse"
catalog = SqlCatalog(
    "default",
    uri=f"sqlite:///{warehouse_path}/pyiceberg_catalog.db", warehouse=f"file://{warehouse_path}",
)

schema = Schema(
    NestedField(1, "city", StringType(), required=False),
    NestedField(2, "lat", DoubleType(), required=False),
    NestedField(3, "long", DoubleType(), required=False),
)
catalog.create_namespace_if_not_exists("default")
try:
    catalog.drop_table("default.locations")
except:
    pass

table = catalog.create_table("default.locations", schema)

with table.update_schema() as update:
    update.add_column("retries", IntegerType(), "Number of retries to place the bid")
    # In a struct
    update.add_column("details", StructType())
    update.add_column(("details", "confirmed_by"), StringType(), "Name of the exchange")

Error:

Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/Users/kevinliu/repos/iceberg-python/pyiceberg/table/update/schema.py", line 192, in add_column
    parent_field = self._schema.find_field(parent_full_path, self._case_sensitive)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/kevinliu/repos/iceberg-python/pyiceberg/schema.py", line 215, in find_field
    raise ValueError(f"Could not find field with name {name_or_id}, case_sensitive={case_sensitive}")
ValueError: Could not find field with name details, case_sensitive=True

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time
@kevinjqliu
Copy link
Contributor Author

Example test

def test_add_and_evolve_schema(simple_table: Table) -> None:
    parent = "parent"
    child = ("parent", "child")
    schema = simple_table.schema()
    # "parent" not in schema yet
    assert any(parent != field.name for field in schema.fields)
    with simple_table.update_schema() as schema_update:
        schema_update.add_column(parent, StructType())
        schema_update.add_column(child, StringType())

@jiakai-li
Copy link
Contributor

I'm happy to pick this up if it's not assigned yet.

@kevinjqliu
Copy link
Contributor Author

sure @jiakai-li assigned to you! let me know if you have any questions

@jiakai-li
Copy link
Contributor

Hey @kevinjqliu , after some investigation, I realized this feature might require a bigger change than I originally expected (which could be a good thing though, so we can refactor the code a littble bit as well). Specifically, I think the logic in _ApplyChanges needs to be updated.

At the mean time, I would also like to put a note that below code acheives the same purpose (and I feel it's more natural as well):

...
table = catalog.create_table("default.test", schema)

with table.update_schema() as update:
    update.add_column("parent", StructType(
        NestedField(-1, "child", StructType(), required=False)
    ))

Please let me know if you have any thoughts, thanks!

@kevinjqliu
Copy link
Contributor Author

Thanks! Thats a good workaround.

I think a more generic use case is to be able to modify pending updates.
details here is a pending update, it has not yet been applied to the table

with table.update_schema() as update:
    update.add_column("details", StructType())
    update.add_column(("details", "confirmed_by"), StringType(), "Name of the exchange")

From an API perspective, i think this is a good use case to support. WDYT?

@jiakai-li
Copy link
Contributor

Sure @kevinjqliu , I think this is a very interesting API to support. I checked the java side as well to get some idea about how they tackle the issue. Seems it's not supported there either (based on what I see here). Do you think we need to raise an issue there as well?

I think a more generic use case is to be able to modify pending updates.
details here is a pending update, it has not yet been applied to the table

Can I please also have a bit more clarification on what do you mean by "modify pending updates" in this example? Thanks!

@kevinjqliu
Copy link
Contributor Author

Interesting that its not available in the java implementation.

Do you think we need to raise an issue there as well?

We can. I guess it hasn't been an issue over there. Most likely people are using the work around you described above.

Can I please also have a bit more clarification on what do you mean by "modify pending updates" in this example? Thanks!

yes, so

with table.update_schema() as update:
    update.add_column("details", StructType())
    update.add_column(("details", "confirmed_by"), StringType(), "Name of the exchange")

UpdateSchema appends the column to its own internal fields.

if parent_id in self._adds:
self._adds[parent_id].append(field)
else:
self._adds[parent_id] = [field]
return self

There's some logic to lookup the parent, but its based on the current schema. This ignores the columns already added to UpdateSchema

So the next add_column statement cannot find the parent details field, because its in UpdateSchema's _adds field and not in the table schema

@jiakai-li
Copy link
Contributor

jiakai-li commented Jan 17, 2025

Thank you @kevinjqliu , I think I have a better understanding now :-) I believe I have some idea, and will push an update once it gets closer. I'm currently out for a trip so it will take some more time to work on it, but I think I'm getting there. Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants