Core: Mark `503: added_snapshot_id` as required by Fokko · Pull Request #8673 · apache/iceberg

Fokko · 2023-09-28T13:10:09Z

This is required in the spec, but optional in the code.

Might want to check if it reads fields that are written as optional.

This is required in the spec, but optional in the code

ajantha-bhat · 2023-09-28T14:51:24Z

api/src/main/java/org/apache/iceberg/ManifestFile.java

          "Lowest sequence number in the manifest");
  Types.NestedField SNAPSHOT_ID =
-      optional(
+      required(


How do we maintain the compatibility of reading old manifestFiles?

The manifest reader will use new schema and try to read old file and it may fail if the data was not written?

Are we sure we were always writing the data even for this optional field?

Are we sure we were always writing the data even for this optional field?

I need to check the code for this. Doing something else in parallel today. Will check it tomorrow.

Good question. I haven't seen any cases where there is no data. The thing I want to double-check if that we read a required field, but the schema in the field is optional.

spark-sql ()> select * from examples.nyc_taxi_yellow limit 22; 23/09/29 15:28:54 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. 2 2003-03-27 06:07:28 193 2003-03-27 22:30:46 193 1 0.0 1 1 3.31 2.5 0.01 0.0 0.5 0.3 0.0 0.0 N 2 2009-01-02 01:17:42 193 2009-01-02 18:19:52 193 1 0.0 1 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 N 2 2009-01-01 08:08:07 193 2009-01-01 08:08:12 193 1 0.0 1 2 3.3 2.5 0.0 0.0 0.5 0.3 0.0 0.0 N 2 2009-01-01 09:34:29 132 2009-01-01 09:52:13 203 1 4.99 1 2 17.3 16.0 0.0 0.0 0.5 0.3 0.0 0.5 N 2 2009-01-01 09:46:10 132 2009-01-01 10:16:52 39 2 9.36 1 2 29.3 28.5 0.0 0.0 0.5 0.3 0.0 0.0 N 2 2009-01-01 11:26:49 7 2009-01-01 23:37:23 7 1 0.0 1 2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 N 2 2009-01-02 03:33:44 193 2009-01-02 03:34:54 193 1 0.0 1 2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 N 2 2020-11-30 22:23:38 138 2020-12-01 21:44:12 116 1 8.64 1 1 37.92 26.5 4.0 6.12 0.5 0.3 0.0 0.5 N NULL 2020-12-01 09:01:00 42 2020-12-01 09:08:00 41 NULL 1.4 NULL NULL 10.31 9.51 0.0 0.0 0.5 0.3 0.0 0.0 NULL NULL 2020-12-01 09:02:49 230 2020-12-01 09:34:29 61 NULL 9.69 NULL NULL 37.9 34.35 2.75 0.0 0.5 0.3 0.0 0.0 NULL NULL 2020-12-01 09:04:00 215 2020-12-01 09:07:00 215 NULL 0.51 NULL NULL 59.84 56.29 2.75 0.0 0.5 0.3 0.0 0.0 NULL NULL 2020-12-01 09:06:00 192 2020-12-01 09:27:00 70 NULL 3.83 NULL NULL 25.5 24.7 0.0 0.0 0.5 0.3 0.0 0.0 NULL NULL 2020-12-01 09:07:00 136 2020-12-01 09:21:00 243 NULL 3.03 NULL NULL 22.0 18.45 2.75 0.0 0.5 0.3 0.0 0.0 NULL NULL 2020-12-01 09:07:28 137 2020-12-01 09:23:35 69 NULL 8.09 NULL NULL 35.59 32.04 2.75 0.0 0.5 0.3 0.0 0.0 NULL NULL 2020-12-01 09:08:00 242 2020-12-01 09:25:00 18 NULL 4.55 NULL NULL 23.85 20.3 2.75 0.0 0.5 0.3 0.0 0.0 NULL NULL 2020-12-01 09:08:00 255 2020-12-01 09:24:00 146 NULL 4.74 NULL NULL 19.43 15.05 3.58 0.0 0.5 0.3 0.0 0.0 NULL NULL 2020-12-01 09:08:37 61 2020-12-01 09:36:29 137 NULL 9.79 NULL NULL 30.95 27.65 0.0 0.0 0.5 0.3 2.5 0.0 NULL NULL 2020-12-01 09:09:00 168 2020-12-01 09:20:00 212 NULL 3.52 NULL NULL 25.5 21.95 2.75 0.0 0.5 0.3 0.0 0.0 NULL NULL 2020-12-01 09:12:10 140 2020-12-01 09:31:20 213 NULL 8.99 NULL NULL 34.82 19.03 2.75 12.24 0.5 0.3 0.0 0.0 NULL NULL 2020-12-01 09:14:00 239 2020-12-01 09:43:00 49 NULL 9.84 NULL NULL 36.36 28.85 4.21 0.0 0.5 0.3 2.5 0.0 NULL NULL 2020-12-01 09:16:00 226 2020-12-01 09:41:00 219 NULL 14.79 NULL NULL 48.17 44.62 2.75 0.0 0.5 0.3 0.0 0.0 NULL NULL 2020-12-01 09:24:00 140 2020-12-01 09:44:00 116 NULL 5.51 NULL NULL 30.08 21.9 4.88 0.0 0.5 0.3 2.5 0.0 NULL Time taken: 16.631 seconds, Fetched 22 row(s)

It just works

Ran this with checking out this branch:

bin/spark-sql \ --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.4.0-SNAPSHOT,software.amazon.awssdk:bundle:2.20.18 \ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ --conf spark.sql.defaultCatalog=sandbox \ --conf spark.sql.catalog.sandbox=org.apache.iceberg.spark.SparkCatalog \ --conf spark.sql.catalog.sandbox.catalog-impl=org.apache.iceberg.rest.RESTCatalog \ ...

cc @nastra what do you think?

My point is that the code should follow the spec, and not the other way around. However, if we've been writing nullable, then we can't make it required anymore, but my experiment below shows otherwise. We just updated it a week ago, but I think that's incorrect: #8600

On the current docs it is still both required:

bin/spark-sql \ --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1,software.amazon.awssdk:bundle:2.20.18 \ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions ... spark-sql (spielerij)> create table null_or_not_null AS SELECT 1 as id UNION ALL select 2 UNION ALL select 3; spark-sql (spielerij)> describe table extended null_or_not_null; id int # Metadata Columns _spec_id int _partition struct<> _file string _pos bigint _deleted boolean # Detailed Table Information Name sandbox.spielerij.null_or_not_null Type MANAGED Location s3://tabular-wh-us-east-1/6efbcaf4-21ae-499d-b340-3bc1a7003f52/45bc0c6c-b104-4b82-8b18-12d0f7acc287 Provider iceberg Table Properties [current-snapshot-id=1823638994492416744,format=iceberg/parquet,format-version=1,manifest-rewrite.submitted=2023-09-29T16:53:31.667746002Z,optimizer.enabled=true,write.delete.parquet.compression-codec=zstd,write.metadata.compression-codec=gzip,write.object-storage.enabled=true,write.parquet.compression-codec=zstd,write.summary.partition-limit=100]

Cleared the caches:

rm -rf ~/.m2/repository/org/apache/iceberg rm -rf /Users/fokkodriesprong/.ivy2/cache

And confirmed:

:::::::::::::::::::::::::::::::::::::::::::::: :: UNRESOLVED DEPENDENCIES :: :::::::::::::::::::::::::::::::::::::::::::::: :: org.apache.iceberg#iceberg-spark-runtime-3.5_2.12;1.4.0-SNAPSHOT: not found ::::::::::::::::::::::::::::::::::::::::::::::

Ran locally ./gradlew publishToMavenLocal:

bin/spark-sql \ --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.0-SNAPSHOT,software.amazon.awssdk:bundle:2.20.18 \ --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \ spark-sql ()> SELECT * FROM spielerij.null_or_not_null; 23/09/29 18:59:01 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder". SLF4J: Defaulting to no-operation (NOP) logger implementation SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details. 1 2 3 Time taken: 8.449 seconds, Fetched 3 row(s)

We just updated it a week ago, but I think that's incorrect: #8600

I thought it was optional from the beginning and hence asked about compatibility. Didn't know it was recently modified. Compatibility shouldn't matter now.

My point is that the code should follow the spec, and not the other way around

Agree.

Can we also revert the spec change from #8600 ?

I should have mentioned that, I'll do some more checks and sync up with Eduard next week

I talked with Ryan to get historical context, and this should be reverted and created a separate PR #8726 so we can already revert the spec itself.

As in, reading an optional field using a required read schema; this is allowed. I've added a test as well (but probably this is already tested somewhere).

amogh-jahagirdar

I'm not sure about this change, https://iceberg.apache.org/docs/1.3.1/configuration/#compatibility-flags check out this table property compatibility.snapshot-id-inheritance.enabled. Throughout the code (e.g. AppendFiles, RewriteManifests) I see manifest.snapshotId() == null checks combined with if snapshot ID inheritance is enabled. If snapshot ID inheritance is enabled, as far as I can tell we don't actually write the field in the manifest file itself, so it would be optional. Could we try running some tests with this property set to true? By default it's false

amogh-jahagirdar · 2023-10-05T13:26:24Z

core/src/main/java/org/apache/iceberg/GenericManifestFile.java

    this.sequenceNumber = 0;
    this.minSequenceNumber = 0;
-    this.snapshotId = null;
+    this.snapshotId = 0L;


This seems a bit dangerous to me. I get we're saying it's required, so we don't need the null value but I looked around to see where manifestFile.snapshotId() == null checks are performed, and there's quite a few. For example RewriteDataFiles, AppendFiles. This may break some of the existing logic there by exercising the other branch in the code?

Hmm, I thought that this was used only by tests, but there is one path from actual code that also uses this. Let me check.

This constructor is used when there is no manifest-list (v1 tables). That's quite rare nowadays, but we should still support it.

Hey @amogh-jahagirdar I looked at it once more, and it looks like the Snapshot-ID is available, so no need in setting it to 0L. Thanks for pointing this out

rdblue · 2023-10-05T23:25:50Z

core/src/main/java/org/apache/iceberg/V2Metadata.java

          ManifestFile.SEQUENCE_NUMBER.asRequired(),
          ManifestFile.MIN_SEQUENCE_NUMBER.asRequired(),
-          ManifestFile.SNAPSHOT_ID.asRequired(),
+          ManifestFile.SNAPSHOT_ID,


I believe that this is actually required because if the manifest isn't written with a snapshot ID, the manifest list's snapshot ID is inherited. And that must be present.

Yes this is the case, but since the ManifestFile.SNAPSHOT_ID is already required:

https://github.com/apache/iceberg/pull/8673/files#diff-063af8055f56892097a498f7de319845da2e0d328f8241a5fc1b88a0822ac08fL51-L52

There is no need to mark it as required again. With the change in this PR, it is required for both V1 and V2, we don't need to mark it required anymore when constructing the individual schemas for V1 and V2.

Keep in mind that the variable name is very confusing, it actually refers to the ADDED_SNAPSHOT_ID

rdblue · 2023-10-05T23:27:52Z

core/src/main/java/org/apache/iceberg/GenericManifestFile.java

+    this(file, specId, null);
+  }
+
+  GenericManifestFile(InputFile file, int specId, Long snapshotId) {


I think I understand the rationale for adding this, but I'm pretty sure that we didn't do it this way on purpose. By adding this option, you can easily pass in the snapshot ID for a manfiest file so that you're deciding when it is created what snapshot it has.

This is something that we wanted to avoid so that people didn't accidentally corrupt metadata by setting the snapshot ID directly. Instead, callers can only construct manifests with snapshot IDs that will be inherited from manifest list metadata. At least, that's the idea. It's probably possible to work around it, but I think it is more correct for people to not have the option when creating the manifest metadata.

Also, I think this is something we use so that you can write a manifest of files (like how Flink stores checkpoint state) and later add it to the table once the snapshot ID is known.

This change was required otherwise we'll end up writing a null for the added_snapshot_id.

This method is only used in cases for V1. Mostly in the test folder (as the PR shows), but also when there is no manifest-list, and it directly points to the manifest:

I've reverted this change, and used the copyOf instead. LMKWYT

copyOf loses the reference to the file, and therefore the length will be null, resulting in an NPE. Let me dig into this.

…quired

github-actions · 2024-09-20T00:15:27Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

github-actions · 2024-09-27T00:15:30Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

Core: Mark 503: added_snapshot_id as required

02186ba

This is required in the spec, but optional in the code

github-actions bot added API core labels Sep 28, 2023

ajantha-bhat reviewed Sep 28, 2023

View reviewed changes

Fokko added the Specification Issues that may introduce spec changes. label Oct 2, 2023

amogh-jahagirdar reviewed Oct 5, 2023

View reviewed changes

Fokko added 3 commits October 5, 2023 21:32

Include the snapshot as well

349e827

Make the CI happy

73a7945

Remove unnecessary and potentially confusion .asRequired()

61e1d89

Fokko force-pushed the fd-mark-as-required branch from 9cf0dec to 61e1d89 Compare October 5, 2023 19:32

Add test

5ba9485

rdblue reviewed Oct 5, 2023

View reviewed changes

Fokko added 6 commits October 6, 2023 10:38

Use the builder instead

5ccde24

Merge branch 'master' of github.com:apache/iceberg into fd-mark-as-re…

73947f2

…quired

Use copyOf instead

fd4de66

Call .length() instead of .length

a30514d

There is no ManifestList

7ccaa3b

Alternative approach

0a3b1d1

Fokko added this to the Iceberg 2.0.0 milestone Oct 17, 2023

github-actions bot added the stale label Sep 20, 2024

github-actions bot closed this Sep 27, 2024

Conversation

Fokko commented Sep 28, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko Sep 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko Sep 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajantha-bhat Sep 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko Sep 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amogh-jahagirdar left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko Oct 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 20, 2024

Uh oh!

github-actions bot commented Sep 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fokko commented Sep 28, 2023 •

edited

Loading

Fokko Sep 29, 2023 •

edited

Loading

Fokko Sep 29, 2023 •

edited

Loading

ajantha-bhat Sep 29, 2023 •

edited

Loading

Fokko Sep 29, 2023 •

edited

Loading

amogh-jahagirdar left a comment •

edited

Loading

Fokko Oct 5, 2023 •

edited

Loading