Skip to content

Core: Mark 503: added_snapshot_id as required #8673

Closed
Fokko wants to merge 11 commits intoapache:mainfrom
Fokko:fd-mark-as-required
Closed

Core: Mark 503: added_snapshot_id as required #8673
Fokko wants to merge 11 commits intoapache:mainfrom
Fokko:fd-mark-as-required

Conversation

@Fokko
Copy link
Contributor

@Fokko Fokko commented Sep 28, 2023

This is required in the spec, but optional in the code.

Might want to check if it reads fields that are written as optional.

This is required in the spec, but optional in the code
"Lowest sequence number in the manifest");
Types.NestedField SNAPSHOT_ID =
optional(
required(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we maintain the compatibility of reading old manifestFiles?

The manifest reader will use new schema and try to read old file and it may fail if the data was not written?

Are we sure we were always writing the data even for this optional field?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we were always writing the data even for this optional field?

I need to check the code for this. Doing something else in parallel today. Will check it tomorrow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I haven't seen any cases where there is no data. The thing I want to double-check if that we read a required field, but the schema in the field is optional.

Copy link
Contributor Author

@Fokko Fokko Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spark-sql ()> select * from examples.nyc_taxi_yellow limit 22;
23/09/29 15:28:54 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
2	2003-03-27 06:07:28	193	2003-03-27 22:30:46	193	1	0.0	1	1	3.31	2.5	0.01	0.0	0.5	0.3	0.0	0.0	N
2	2009-01-02 01:17:42	193	2009-01-02 18:19:52	193	1	0.0	1	1	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	N
2	2009-01-01 08:08:07	193	2009-01-01 08:08:12	193	1	0.0	1	2	3.3	2.5	0.0	0.0	0.5	0.3	0.0	0.0	N
2	2009-01-01 09:34:29	132	2009-01-01 09:52:13	203	1	4.99	1	2	17.3	16.0	0.0	0.0	0.5	0.3	0.0	0.5	N
2	2009-01-01 09:46:10	132	2009-01-01 10:16:52	39	2	9.36	1	2	29.3	28.5	0.0	0.0	0.5	0.3	0.0	0.0	N
2	2009-01-01 11:26:49	7	2009-01-01 23:37:23	7	1	0.0	1	2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	N
2	2009-01-02 03:33:44	193	2009-01-02 03:34:54	193	1	0.0	1	2	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	N
2	2020-11-30 22:23:38	138	2020-12-01 21:44:12	116	1	8.64	1	1	37.92	26.5	4.0	6.12	0.5	0.3	0.0	0.5	N
NULL	2020-12-01 09:01:00	42	2020-12-01 09:08:00	41	NULL	1.4	NULL	NULL	10.31	9.51	0.0	0.0	0.5	0.3	0.0	0.0	NULL
NULL	2020-12-01 09:02:49	230	2020-12-01 09:34:29	61	NULL	9.69	NULL	NULL	37.9	34.35	2.75	0.0	0.5	0.3	0.0	0.0	NULL
NULL	2020-12-01 09:04:00	215	2020-12-01 09:07:00	215	NULL	0.51	NULL	NULL	59.84	56.29	2.75	0.0	0.5	0.3	0.0	0.0	NULL
NULL	2020-12-01 09:06:00	192	2020-12-01 09:27:00	70	NULL	3.83	NULL	NULL	25.5	24.7	0.0	0.0	0.5	0.3	0.0	0.0	NULL
NULL	2020-12-01 09:07:00	136	2020-12-01 09:21:00	243	NULL	3.03	NULL	NULL	22.0	18.45	2.75	0.0	0.5	0.3	0.0	0.0	NULL
NULL	2020-12-01 09:07:28	137	2020-12-01 09:23:35	69	NULL	8.09	NULL	NULL	35.59	32.04	2.75	0.0	0.5	0.3	0.0	0.0	NULL
NULL	2020-12-01 09:08:00	242	2020-12-01 09:25:00	18	NULL	4.55	NULL	NULL	23.85	20.3	2.75	0.0	0.5	0.3	0.0	0.0	NULL
NULL	2020-12-01 09:08:00	255	2020-12-01 09:24:00	146	NULL	4.74	NULL	NULL	19.43	15.05	3.58	0.0	0.5	0.3	0.0	0.0	NULL
NULL	2020-12-01 09:08:37	61	2020-12-01 09:36:29	137	NULL	9.79	NULL	NULL	30.95	27.65	0.0	0.0	0.5	0.3	2.5	0.0	NULL
NULL	2020-12-01 09:09:00	168	2020-12-01 09:20:00	212	NULL	3.52	NULL	NULL	25.5	21.95	2.75	0.0	0.5	0.3	0.0	0.0	NULL
NULL	2020-12-01 09:12:10	140	2020-12-01 09:31:20	213	NULL	8.99	NULL	NULL	34.82	19.03	2.75	12.24	0.5	0.3	0.0	0.0	NULL
NULL	2020-12-01 09:14:00	239	2020-12-01 09:43:00	49	NULL	9.84	NULL	NULL	36.36	28.85	4.21	0.0	0.5	0.3	2.5	0.0	NULL
NULL	2020-12-01 09:16:00	226	2020-12-01 09:41:00	219	NULL	14.79	NULL	NULL	48.17	44.62	2.75	0.0	0.5	0.3	0.0	0.0	NULL
NULL	2020-12-01 09:24:00	140	2020-12-01 09:44:00	116	NULL	5.51	NULL	NULL	30.08	21.9	4.88	0.0	0.5	0.3	2.5	0.0	NULL
Time taken: 16.631 seconds, Fetched 22 row(s)

It just works

Ran this with checking out this branch:

bin/spark-sql  \
  --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.4.0-SNAPSHOT,software.amazon.awssdk:bundle:2.20.18 \
  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
  --conf spark.sql.defaultCatalog=sandbox \
  --conf spark.sql.catalog.sandbox=org.apache.iceberg.spark.SparkCatalog \
  --conf spark.sql.catalog.sandbox.catalog-impl=org.apache.iceberg.rest.RESTCatalog \
...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @nastra what do you think?

Copy link
Contributor Author

@Fokko Fokko Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that the code should follow the spec, and not the other way around. However, if we've been writing nullable, then we can't make it required anymore, but my experiment below shows otherwise. We just updated it a week ago, but I think that's incorrect: #8600

On the current docs it is still both required:

image

bin/spark-sql  \
  --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.3.1,software.amazon.awssdk:bundle:2.20.18 \
  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

...

spark-sql (spielerij)> create table null_or_not_null AS SELECT 1 as id UNION ALL select 2 UNION ALL select 3;
spark-sql (spielerij)> describe table extended null_or_not_null;
id                  	int                 	                    
                    	                    	                    
# Metadata Columns  	                    	                    
_spec_id            	int                 	                    
_partition          	struct<>            	                    
_file               	string              	                    
_pos                	bigint              	                    
_deleted            	boolean             	                    
                    	                    	                    
# Detailed Table Information	                    	                    
Name                	sandbox.spielerij.null_or_not_null	                    
Type                	MANAGED             	                    
Location            	s3://tabular-wh-us-east-1/6efbcaf4-21ae-499d-b340-3bc1a7003f52/45bc0c6c-b104-4b82-8b18-12d0f7acc287	                    
Provider            	iceberg             	                    
Table Properties    	[current-snapshot-id=1823638994492416744,format=iceberg/parquet,format-version=1,manifest-rewrite.submitted=2023-09-29T16:53:31.667746002Z,optimizer.enabled=true,write.delete.parquet.compression-codec=zstd,write.metadata.compression-codec=gzip,write.object-storage.enabled=true,write.parquet.compression-codec=zstd,write.summary.partition-limit=100]

Cleared the caches:

rm -rf ~/.m2/repository/org/apache/iceberg
rm -rf /Users/fokkodriesprong/.ivy2/cache

And confirmed:

		::::::::::::::::::::::::::::::::::::::::::::::

		::          UNRESOLVED DEPENDENCIES         ::

		::::::::::::::::::::::::::::::::::::::::::::::

		:: org.apache.iceberg#iceberg-spark-runtime-3.5_2.12;1.4.0-SNAPSHOT: not found

		::::::::::::::::::::::::::::::::::::::::::::::

Ran locally ./gradlew publishToMavenLocal:

bin/spark-sql  \                                                                                       
  --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.4.0-SNAPSHOT,software.amazon.awssdk:bundle:2.20.18 \
  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \


spark-sql ()> SELECT * FROM spielerij.null_or_not_null;
23/09/29 18:59:01 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
1
2
3
Time taken: 8.449 seconds, Fetched 3 row(s)

Copy link
Member

@ajantha-bhat ajantha-bhat Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just updated it a week ago, but I think that's incorrect: #8600

I thought it was optional from the beginning and hence asked about compatibility. Didn't know it was recently modified. Compatibility shouldn't matter now.

My point is that the code should follow the spec, and not the other way around

Agree.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also revert the spec change from #8600 ?

Copy link
Contributor Author

@Fokko Fokko Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have mentioned that, I'll do some more checks and sync up with Eduard next week

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I talked with Ryan to get historical context, and this should be reverted and created a separate PR #8726 so we can already revert the spec itself.

As in, reading an optional field using a required read schema; this is allowed. I've added a test as well (but probably this is already tested somewhere).

@Fokko Fokko added the Specification Issues that may introduce spec changes. label Oct 2, 2023
Copy link
Contributor

@amogh-jahagirdar amogh-jahagirdar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this change, https://iceberg.apache.org/docs/1.3.1/configuration/#compatibility-flags check out this table property compatibility.snapshot-id-inheritance.enabled. Throughout the code (e.g. AppendFiles, RewriteManifests) I see manifest.snapshotId() == null checks combined with if snapshot ID inheritance is enabled. If snapshot ID inheritance is enabled, as far as I can tell we don't actually write the field in the manifest file itself, so it would be optional. Could we try running some tests with this property set to true? By default it's false

this.sequenceNumber = 0;
this.minSequenceNumber = 0;
this.snapshotId = null;
this.snapshotId = 0L;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit dangerous to me. I get we're saying it's required, so we don't need the null value but I looked around to see where manifestFile.snapshotId() == null checks are performed, and there's quite a few. For example RewriteDataFiles, AppendFiles. This may break some of the existing logic there by exercising the other branch in the code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I thought that this was used only by tests, but there is one path from actual code that also uses this. Let me check.

Copy link
Contributor Author

@Fokko Fokko Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This constructor is used when there is no manifest-list (v1 tables). That's quite rare nowadays, but we should still support it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @amogh-jahagirdar I looked at it once more, and it looks like the Snapshot-ID is available, so no need in setting it to 0L. Thanks for pointing this out

@Fokko Fokko force-pushed the fd-mark-as-required branch from 9cf0dec to 61e1d89 Compare October 5, 2023 19:32
ManifestFile.SEQUENCE_NUMBER.asRequired(),
ManifestFile.MIN_SEQUENCE_NUMBER.asRequired(),
ManifestFile.SNAPSHOT_ID.asRequired(),
ManifestFile.SNAPSHOT_ID,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this is actually required because if the manifest isn't written with a snapshot ID, the manifest list's snapshot ID is inherited. And that must be present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is the case, but since the ManifestFile.SNAPSHOT_ID is already required:

https://github.com/apache/iceberg/pull/8673/files#diff-063af8055f56892097a498f7de319845da2e0d328f8241a5fc1b88a0822ac08fL51-L52

There is no need to mark it as required again. With the change in this PR, it is required for both V1 and V2, we don't need to mark it required anymore when constructing the individual schemas for V1 and V2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep in mind that the variable name is very confusing, it actually refers to the ADDED_SNAPSHOT_ID

this(file, specId, null);
}

GenericManifestFile(InputFile file, int specId, Long snapshotId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the rationale for adding this, but I'm pretty sure that we didn't do it this way on purpose. By adding this option, you can easily pass in the snapshot ID for a manfiest file so that you're deciding when it is created what snapshot it has.

This is something that we wanted to avoid so that people didn't accidentally corrupt metadata by setting the snapshot ID directly. Instead, callers can only construct manifests with snapshot IDs that will be inherited from manifest list metadata. At least, that's the idea. It's probably possible to work around it, but I think it is more correct for people to not have the option when creating the manifest metadata.

Also, I think this is something we use so that you can write a manifest of files (like how Flink stores checkpoint state) and later add it to the table once the snapshot ID is known.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change was required otherwise we'll end up writing a null for the added_snapshot_id.

This method is only used in cases for V1. Mostly in the test folder (as the PR shows), but also when there is no manifest-list, and it directly points to the manifest:

image

I've reverted this change, and used the copyOf instead. LMKWYT

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copyOf loses the reference to the file, and therefore the length will be null, resulting in an NPE. Let me dig into this.

@Fokko Fokko added this to the Iceberg 2.0.0 milestone Oct 17, 2023
@github-actions
Copy link

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@iceberg.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Sep 20, 2024
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Sep 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API core Specification Issues that may introduce spec changes. stale

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants