Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spec: Update partition stats for V3 #12098

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

aokolnychyi
Copy link
Contributor

@aokolnychyi aokolnychyi commented Jan 24, 2025

This PR adapts our partition stats spec to accommodate DVs:

  • position_delete_record_count includes a sum of position deletes across position delete files and DVs
  • position_delete_file_count still counts only position delete files (ignoring deletion vectors)
  • dv_count counts the number of deletion vectors

@github-actions github-actions bot added the Specification Issues that may introduce spec changes. label Jan 24, 2025
| _required_ | _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
| _optional_ | _optional_ | _required_ | **`6 position_delete_record_count`** | `long` | Count of position deletes across position delete files and deletion vectors |
| _optional_ | _optional_ | _required_ | **`7 position_delete_file_count`** | `int` | Count of position delete files ignoring deletion vectors |
| | | _required_ | **`13 dv_count`** | `int` | Count of deletion vectors |
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using dv in the key to match what we did for snapshot summaries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should append this line to the end of this table, which reduces the code change and matches the field id in ascending order.

Copy link
Member

@ajantha-bhat ajantha-bhat Jan 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, keeping in the end maybe useful for choosing next schema field id easily. Else user might have to go through each field to see whether next field id is reused.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@advancedxy @ajantha-bhat Agreed, I think the original intention was to group it with similar fields but I feel like it's better to optimize this table for future evolution

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I OK either way. Adding it to the end won't reduce the amount of changes, though. We still need to add the V3 column.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we do group by logic in other places. For instance, nan_value_counts that was added much later was placed next to other counts. I feel like it makes sense to group by logic as adding to the end won't reduce the number of changes. We need to add the v3 column.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point, I'm good either way

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aokolnychyi do we need a new field for dv_record_count?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stevenzwu, I don't think so but can be convinced otherwise. From the spec perspective, the content of DVs is position deletes. That's what we store in manifests. Also, we combine the number of position deletes across the old-style position delete files and DVs in snapshot summaries. Therefore, the decision to reuse the existing field for the position delete record count is consistent.

Are there use cases that would benefit if we had dv_record_count as an independent field?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking about future deprecation. When position delete file is deprecated in the future, position_delete_record_count and position_delete_file_count fields can be deprecated and removed.

@aokolnychyi
Copy link
Contributor Author

aokolnychyi commented Jan 24, 2025

cc @ajantha-bhat @rdblue @nastra @Fokko @RussellSpitzer @huaxingao @amogh-jahagirdar

Let me know if there is any early feedback prior to starting the vote.

| _required_ | _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
| _required_ | _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
| _optional_ | _optional_ | _required_ | **`6 position_delete_record_count`** | `long` | Count of position deletes across position delete files and deletion vectors |
| _optional_ | _optional_ | _required_ | **`7 position_delete_file_count`** | `int` | Count of position delete files ignoring deletion vectors |
Copy link
Member

@ajantha-bhat ajantha-bhat Jan 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering why the delete stats are required field in v3? IIRC, Tables may not have delete operations. So, it is was made optional in V2. So, it should be optional for v3 also?

Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, there's value in having these be required because having null in these fields isn't really that informative since it just means that the state is unknown and needs to be derived using other metadata. Populating these fields accurately isn't that much of a burden for writers (so if there are no deletes, it'll just be 0) and eliminates any special casing around having to deal with null values here. I'd imagine the potential extra storage from these fields is also pretty negligible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, having this as optional is ambiguous. In fact, we should clarify that NULL means 0 in v1 and v2. What do you think about this, @ajantha-bhat? If you agree, can you start a discussion on the dev list as you know this part best?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for starting the discussions.

| | | _required_ | **`13 dv_count`** | `int` | Count of deletion vectors |
| _optional_ | _optional_ | _required_ | **`8 equality_delete_record_count`** | `long` | Count of records in equality delete files |
| _optional_ | _optional_ | _required_ | **`9 equality_delete_file_count`** | `int` | Count of equality delete files |
| _optional_ | _optional_ | _optional_ | **`10 total_record_count`** | `long` | Accurate count of records in a partition after applying deletes if any |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be required aswell for v3? Since we decided to keep it same as other counters during implementation.
https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/PartitionStats.java#L36

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can make this required and I am afraid we have to fix the implementation. If there are equality deletes, determining this would be very expensive. If we don't have equality deletes and have only DVs, then we can populate this value without an expensive computation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether we compute or not is optional and I agree.

But we decided that all the counters must be of same behavior (zero initialized if not computed or not present). So, making it null in the implementation is not a good idea? It can be required and initialized to zero always if we don't compute it?

If not, I am even thinking to remove this field as it is hard to compute. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added more context to this PR around total_record_count. I think it differs from other counters and DVs actually enable cheap and precise computation of it without reading data!

| _required_ | _required_ | _required_ | **`3 data_record_count`** | `long` | Count of records in data files |
| _required_ | _required_ | _required_ | **`4 data_file_count`** | `int` | Count of data files |
| _required_ | _required_ | _required_ | **`5 total_data_file_size_in_bytes`** | `long` | Total size of data files in bytes |
| _optional_ | _optional_ | _required_ | **`6 position_delete_record_count`** | `long` | Count of position deletes across position delete files and deletion vectors |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale behind combining the DV and position delete record count in the position_delete_record_count ? Better to keep a new record counter for DVs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've elaborated here. Let me know what you think, @ajantha-bhat!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Specification Issues that may introduce spec changes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants