Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return an empty dict if nan values is not provided by the catalog #1575

Conversation

summermousa-vendia
Copy link
Contributor

@summermousa-vendia summermousa-vendia commented Jan 24, 2025

Fixes: #1574

Demo (redacted):

>>> table.inspect.entries()
pyarrow.Table
status: int8 not null
snapshot_id: int64 not null
sequence_number: int64 not null
file_sequence_number: int64 not null
data_file: struct<content: int8 not null, file_path: string not null, file_format: string not null, partition: struct<> not null, record_count: int64 not null, file_size_in_bytes: int64 not null, column_sizes: map<int32, int64>, value_counts: map<int32, int64>, null_value_counts: map<int32, int64>, nan_value_counts: map<int32, int64>, lower_bounds: map<int32, binary>, upper_bounds: map<int32, binary>, key_metadata: binary, split_offsets: list<item: int64>, equality_ids: list<item: int32>, sort_order_id: int32> not null
  child 0, content: int8 not null
  child 1, file_path: string not null
  child 2, file_format: string not null
  child 3, partition: struct<> not null
  child 4, record_count: int64 not null
  child 5, file_size_in_bytes: int64 not null
  child 6, column_sizes: map<int32, int64>
      child 0, entries: struct<key: int32 not null, value: int64> not null
          child 0, key: int32 not null
          child 1, value: int64
  child 7, value_counts: map<int32, int64>
      child 0, entries: struct<key: int32 not null, value: int64> not null
          child 0, key: int32 not null
          child 1, value: int64
  child 8, null_value_counts: map<int32, int64>
      child 0, entries: struct<key: int32 not null, value: int64> not null
          child 0, key: int32 not null
          child 1, value: int64
  child 9, nan_value_counts: map<int32, int64>
      child 0, entries: struct<key: int32 not null, value: int64> not null
          child 0, key: int32 not null
          child 1, value: int64
  child 10, lower_bounds: map<int32, binary>
      child 0, entries: struct<key: int32 not null, value: binary> not null
          child 0, key: int32 not null
          child 1, value: binary
  child 11, upper_bounds: map<int32, binary>
      child 0, entries: struct<key: int32 not null, value: binary> not null
          child 0, key: int32 not null
          child 1, value: binary
  child 12, key_metadata: binary
  child 13, split_offsets: list<item: int64>
      child 0, item: int64
  child 14, equality_ids: list<item: int32>
      child 0, item: int32
  child 15, sort_order_id: int32
readable_metrics: struct<age: struct<column_size: int64, value_count: int64, null_value_count: int64, nan_value_count: int64, lower_bound: double, upper_bound: double> not null, name: struct<column_size: int64, value_count: int64, null_value_count: int64, nan_value_count: int64, lower_bound: large_string, upper_bound: large_string> not null, weight: struct<column_size: int64, value_count: int64, null_value_count: int64, nan_value_count: int64, lower_bound: double, upper_bound: double> not null>
  child 0, age: struct<column_size: int64, value_count: int64, null_value_count: int64, nan_value_count: int64, lower_bound: double, upper_bound: double> not null
      child 0, column_size: int64
      child 1, value_count: int64
      child 2, null_value_count: int64
      child 3, nan_value_count: int64
      child 4, lower_bound: double
      child 5, upper_bound: double
  child 1, name: struct<column_size: int64, value_count: int64, null_value_count: int64, nan_value_count: int64, lower_bound: large_string, upper_bound: large_string> not null
      child 0, column_size: int64
      child 1, value_count: int64
      child 2, null_value_count: int64
      child 3, nan_value_count: int64
      child 4, lower_bound: large_string
      child 5, upper_bound: large_string
  child 2, weight: struct<column_size: int64, value_count: int64, null_value_count: int64, nan_value_count: int64, lower_bound: double, upper_bound: double> not null
      child 0, column_size: int64
      child 1, value_count: int64
      child 2, null_value_count: int64
      child 3, nan_value_count: int64
      child 4, lower_bound: double
      child 5, upper_bound: double
----
status: [[1,1,1,1,1,1,1,1,1]]
snapshot_id: [[1838977523369912061,1838977523369912061,1838977523369912061,1838977523369912061,1838977523369912061,1838977523369912061,1838977523369912061,1838977523369912061,1838977523369912061]]
sequence_number: [[2,2,2,2,2,2,2,2,2]]
file_sequence_number: [[2,2,2,2,2,2,2,2,2]]
data_file: [
  -- is_valid: all not null
  -- child 0 type: int8
[0,0,0,0,0,0,0,0,0]
  -- child 1 type: string
["s3://***","s3://***","s3://***","s3://***","s3://***","s3://***","s3://***","s3://***","s3://***"]
  -- child 2 type: string
["PARQUET","PARQUET","PARQUET","PARQUET","PARQUET","PARQUET","PARQUET","PARQUET","PARQUET"]
  -- child 3 type: struct<>
    -- is_valid: all not null
  -- child 4 type: int64
[1,1,1,1,1,1,1,1,1]
  -- child 5 type: int64
[991,992,985,963,984,971,957,978,992]
  -- child 6 type: map<int32, int64>
[keys:[1,2,3]values:[46,55,45],keys:[1,2,3]values:[46,55,46],keys:[1,2,3]values:[46,54,46],keys:[1,2,3]values:[46,50,46],keys:[1,2,3]values:[45,54,46],keys:[1,2,3]values:[46,52,46],keys:[1,2,3]values:[46,50,46],keys:[1,2,3]values:[46,53,46],keys:[1,2,3]values:[46,55,46]]
  -- child 7 type: map<int32, int64>
[keys:[1,2,3]values:[1,1,1],keys:[1,2,3]values:[1,1,1],keys:[1,2,3]values:[1,1,1],keys:[1,2,3]values:[1,1,1],keys:[1,2,3]values:[1,1,1],keys:[1,2,3]values:[1,1,1],keys:[1,2,3]values:[1,1,1],keys:[1,2,3]values:[1,1,1],keys:[1,2,3]values:[1,1,1]]
  -- child 8 type: map<int32, int64>
[keys:[1,2,3]values:[0,0,0],keys:[1,2,3]values:[0,0,0],keys:[1,2,3]values:[0,0,0],keys:[1,2,3]values:[0,0,0],keys:[1,2,3]values:[0,0,0],keys:[1,2,3]values:[0,0,0],keys:[1,2,3]values:[0,0,0],keys:[1,2,3]values:[0,0,0],keys:[1,2,3]values:[0,0,0]]
  -- child 9 type: map<int32, int64>
[keys:[1,3]values:[0,0],keys:[1,3]values:[0,0],keys:[1,3]values:[0,0],keys:[1,3]values:[0,0],keys:[1,3]values:[0,0],keys:[1,3]values:[0,0],keys:[1,3]values:[0,0],keys:[1,3]values:[0,0],keys:[1,3]values:[0,0]]
  -- child 10 type: map<int32, binary>
[keys:[1,2,3]values:[0000000000003640,436861726C6965204461766973,0000000000406540],keys:[1,2,3]values:[0000000000804640,42696C6C792042757463686572,0000000000006940],keys:[1,2,3]values:[0000000000804040,48616E6E616820477265656E,0000000000406040],keys:[1,2,3]values:[0000000000804140,426F622042726F776E,0000000000006940],keys:[1,2,3]values:[0000000000004440,47656F72676520426C61636B,0000000000406A40],keys:[1,2,3]values:[0000000000003940,4A616E6520536D697468,0000000000806140],keys:[1,2,3]values:[0000000000003E40,4A6F686E20446F65,0000000000806640],keys:[1,2,3]values:[0000000000003B40,456D696C79205768697465,0000000000006440],keys:[1,2,3]values:[0000000000003C40,416C696365204A6F686E736F6E,0000000000C06240]]
  -- child 11 type: map<int32, binary>
[keys:[1,2,3]values:[0000000000003640,436861726C6965204461766973,0000000000406540],keys:[1,2,3]values:[0000000000804640,42696C6C792042757463686572,0000000000006940],keys:[1,2,3]values:[0000000000804040,48616E6E616820477265656E,0000000000406040],keys:[1,2,3]values:[0000000000804140,426F622042726F776E,0000000000006940],keys:[1,2,3]values:[0000000000004440,47656F72676520426C61636B,0000000000406A40],keys:[1,2,3]values:[0000000000003940,4A616E6520536D697468,0000000000806140],keys:[1,2,3]values:[0000000000003E40,4A6F686E20446F65,0000000000806640],keys:[1,2,3]values:[0000000000003B40,456D696C79205768697465,0000000000006440],keys:[1,2,3]values:[0000000000003C40,416C696365204A6F686E736F6E,0000000000C06240]]
  -- child 12 type: binary
[null,null,null,null,null,null,null,null,null]
  -- child 13 type: list<item: int64>
[[4],[4],...,[4],[4]]
  -- child 14 type: list<item: int32>
[null,null,...,null,null]
  -- child 15 type: int32
[0,0,0,0,0,0,0,0,0]]
readable_metrics: [
  -- is_valid: all not null
  -- child 0 type: struct<column_size: int64, value_count: int64, null_value_count: int64, nan_value_count: int64, lower_bound: double, upper_bound: double>
    -- is_valid: all not null
    -- child 0 type: int64
[46,46,46,46,45,46,46,46,46]
    -- child 1 type: int64
[1,1,1,1,1,1,1,1,1]
    -- child 2 type: int64
[0,0,0,0,0,0,0,0,0]
    -- child 3 type: int64
[0,0,0,0,0,0,0,0,0]
    -- child 4 type: double
[22,45,33,35,40,25,30,27,28]
    -- child 5 type: double
[22,45,33,35,40,25,30,27,28]
  -- child 1 type: struct<column_size: int64, value_count: int64, null_value_count: int64, nan_value_count: int64, lower_bound: large_string, upper_bound: large_string>
    -- is_valid: all not null
    -- child 0 type: int64
[55,55,54,50,54,52,50,53,55]
    -- child 1 type: int64
[1,1,1,1,1,1,1,1,1]
    -- child 2 type: int64
[0,0,0,0,0,0,0,0,0]
    -- child 3 type: int64
[null,null,null,null,null,null,null,null,null]
    -- child 4 type: large_string
["Charlie Davis","Billy Butcher","Hannah Green","Bob Brown","George Black","Jane Smith","John Doe","Emily White","Alice Johnson"]
    -- child 5 type: large_string
["Charlie Davis","Billy Butcher","Hannah Green","Bob Brown","George Black","Jane Smith","John Doe","Emily White","Alice Johnson"]
  -- child 2 type: struct<column_size: int64, value_count: int64, null_value_count: int64, nan_value_count: int64, lower_bound: double, upper_bound: double>
    -- is_valid: all not null
    -- child 0 type: int64
[45,46,46,46,46,46,46,46,46]
    -- child 1 type: int64
[1,1,1,1,1,1,1,1,1]
    -- child 2 type: int64
[0,0,0,0,0,0,0,0,0]
    -- child 3 type: int64
[0,0,0,0,0,0,0,0,0]
    -- child 4 type: double
[170,200,130,200,210,140,180,160,150]
    -- child 5 type: double
[170,200,130,200,210,140,180,160,150]]

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, thanks for the contribution!

@kevinjqliu kevinjqliu merged commit 7be5cf2 into apache:main Jan 26, 2025
7 checks passed
@summermousa-vendia summermousa-vendia deleted the ISSUE-1574_support_optional_nan_values branch January 27, 2025 13:44
@summermousa-vendia
Copy link
Contributor Author

Thank you for the quick turnaround on the review. Do you know when this might be released?

@kevinjqliu
Copy link
Contributor

hi @summermousa-vendia this would be part of the next release (0.9.0). I dont have a timeline yet, but it should be soon. There's a community sync tomorrow, I'll bring this up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KeyError raised when calling inspect.entries()
2 participants