Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Distinct Counts in Manifest #1613

Open
wants to merge 1 commit into
base: pyiceberg-0.8.x
Choose a base branch
from

Conversation

jpugliesi
Copy link

No description provided.

@Fokko
Copy link
Contributor

Fokko commented Feb 5, 2025

@jpugliesi Thanks for raising this. Unfortunally, the distinct_counts has been deprecated: apache/iceberg#767

@jpugliesi
Copy link
Author

@Fokko ah thanks for pointing this out... I was just referencing the spec, which does not indicate distinct_counts is deprecated. It looks like others have tried to update the spec/docs, but for some reason those changes weren't merged. Any idea why the spec won't be updated to reflect this?

@Fokko
Copy link
Contributor

Fokko commented Feb 5, 2025

@jpugliesi I agree that it should be indicated in the spec, sorry about that. I was pretty convinced that it was marked as deprecated at some point. Let's follow up on apache/iceberg#12183

Quick question, apache/iceberg#12183 mentioned:

PyIceberg fails to read data_file Avro objects containing this field, and #1613.

Do you know if this is the case? It should just silently ignore the field and skip over it when reading.

cc @jacobmarble

@jacobmarble
Copy link

It should just silently ignore the field and skip over it when reading.

FWIW just yesterday I used PyIceberg to read some manifests generated by Starburst in feb-2024 (one year ago), and it threw an exception with no apparent workaround. I've attached those here.

starburst sample 2024-feb.zip

@Fokko
Copy link
Contributor

Fokko commented Feb 6, 2025

@jacobmarble I'm able to read both of them:

def test_starburst():
    io = load_file_io()
    snapshot = Snapshot(
        snapshot_id=25,
        parent_snapshot_id=19,
        timestamp_ms=1602638573590,
        manifest_list="/Users/fokko.driesprong/Downloads/starburst.sample.2024-feb/manifest-list-starburst.avro",
        summary=Summary(Operation.APPEND),
        schema_id=3,
    )
    entries = list(snapshot.manifests(io))
    entries

    input_file = io.new_input("/Users/fokko.driesprong/Downloads/starburst.sample.2024-feb/manifest-starburst.avro")
    with AvroFile[ManifestEntry](
            input_file,
            MANIFEST_ENTRY_SCHEMAS[2],
            read_types={-1: ManifestEntry, 2: DataFile},
            read_enums={0: ManifestEntryStatus, 101: FileFormat, 134: DataFileContent},
    ) as reader:
        data_files = [
            entry
            for entry in reader
        ]

    data_files

Can you share the stacktrace you're seeing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants