New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add ResidualVisitor to compute residuals #1388

Open

tusharchou wants to merge 27 commits into apache:main from tusharchou:main

+709 −2

tusharchou commented Nov 28, 2024

closes issue: Count rows as a metadata-only operation #1223

tusharchou added 7 commits

November 28, 2024 19:02


          Create test_scan_count.py

731542e


          moved test_scan_count.py to tests

c6c971e


          implemented count in data scan

da18837


          tested table scan count in test_sql catalog

3104a2f


          refactoring

c2740ea


          make lint

90bca84


          Merge pull request #1 from tusharchou/apachegh-1223-count-rows-metada…

f7202b9

…ta-only-op

add count in data scan and test in catalog sql

kevinjqliu reviewed

View reviewed changes

pyiceberg/table/__init__.py Show resolved Hide resolved


          Merge branch 'apache:main' into main

c7205b3

Contributor

jayceslesar commented Dec 2, 2024

Question: Does it make sense to expose this as the __len__ dunder method because python? It would just return the self.count()

tusharchou mentioned this pull request

How do I find if there is residual in the table scan/plan files? #785

Open

tusharchou added 10 commits

December 11, 2024 11:39


          Merge branch 'apache:main' into main

09f9c10


          Merge branch 'apache:main' into main

1e9da22


          Merge branch 'apache:main' into main

3ab20d4


          implemeted residual_evaluator.py with tests

091c0af


          added license

3cd797d


          fixed lint

6b0924e


          fixed lint errors

96cb4e9


          Merge pull request #3 from tusharchou/apachegh-1223-metadata-only-row…

212c83b

…-count

Residual Evaluator with test


          Merge branch 'apache:main' into main

8bc65fa


          Gh 1223 metadata only row count (#4)

8bb039f

* added residual evaluator in plan files

* tested counts with positional deletes

* merged main

gli-chris-hao reviewed

View reviewed changes

pyiceberg/table/__init__.py Show resolved Hide resolved

tusharchou added 3 commits

January 4, 2025 11:32


          Merge branch 'apache:main' into main

0019f92


          Merge branch 'apache:main' into main

a372a93


          Gh 1223 metadata only row count (#5)

ab4c000

* added residual evaluator in plan files

* tested counts with positional deletes

* merged main

* implemented batch reader in count

* breaking integration test

* fixed integration test

* git pull main

* revert

* revert

* revert test_partitioning_key.py

* revert test_parser.py

* added residual evaluator in visitor

* deleted residual_evaluator.py

* removed test count from test_sql.py

* ignored lint type

* fixed lint

* working on plan_files

* type ignored

* make lint

Author

tusharchou commented Jan 6, 2025

Hi @Fokko @kevinjqliu @gli-chris-hao ,

I have implemented these suggestions with my best understanding.

residual evaluator
positional deletes
batch processing of files larger than 512mb

It would be helpful to get fresh review

tusharchou requested review from gli-chris-hao, kevinjqliu and Fokko

January 6, 2025 10:20

Fokko changed the title ~~Count rows as a metadata only operation~~ Add ResidualVisitor to compute residuals

Fokko reviewed

View reviewed changes

Contributor

Fokko left a comment

This is great @tusharchou Thanks for working on this. I left some comments, but this is a great start 🚀

pyiceberg/expressions/visitors.py Show resolved Hide resolved

pyiceberg/expressions/visitors.py Show resolved Hide resolved

pyiceberg/expressions/visitors.py Outdated Show resolved Hide resolved

pyiceberg/expressions/visitors.py Show resolved Hide resolved

pyiceberg/expressions/visitors.py Outdated Show resolved Hide resolved

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

pyiceberg/table/__init__.py Show resolved Hide resolved

pyiceberg/table/__init__.py Outdated Show resolved Hide resolved

pyiceberg/table/__init__.py Show resolved Hide resolved

tusharchou added 6 commits

January 18, 2025 16:28


          Merge branch 'apache:main' into main

f5a871b


          Merge branch 'apache:main' into main


          Merge branch 'apache:main' into main

899beb1


          Resolving review comments (#6)

2575cb8

* added residual evaluator in plan files

* tested counts with positional deletes

* merged main

* implemented batch reader in count

* breaking integration test

* fixed integration test

* git pull main

* revert

* revert

* revert test_partitioning_key.py

* revert test_parser.py

* added residual evaluator in visitor

* deleted residual_evaluator.py

* removed test count from test_sql.py

* ignored lint type

* fixed lint

* working on plan_files

* type ignored

* make lint

* explicit delete files len is zero

* residual eval only if manifest is true

* default residual is always true

* used projection schema

* refactored residual in plan files


          Merge branch 'apache:main' into main

557255b


          Gh 1223 metadata only row count (#7)

* added residual evaluator in plan files

* tested counts with positional deletes

* merged main

* implemented batch reader in count

* breaking integration test

* fixed integration test

* git pull main

* revert

* revert

* revert test_partitioning_key.py

* revert test_parser.py

* added residual evaluator in visitor

* deleted residual_evaluator.py

* removed test count from test_sql.py

* ignored lint type

* fixed lint

* working on plan_files

* type ignored

* make lint

* explicit delete files len is zero

* residual eval only if manifest is true

* default residual is always true

* used projection schema

* refactored residual in plan files

* fixed lint issue with isnan

Fokko reviewed

View reviewed changes

pyiceberg/expressions/visitors.py

Comment on lines +1746 to +1753

+. If d &gt; day(a) and d &lt; day(b), the residual is always true
+. If d == day(a) and d != day(b), the residual is utc_timestamp &gt;= a
+. if d == day(b) and d != day(a), the residual is utc_timestamp &lt;= b
+. If d == day(a) == day(b), the residual is utc_timestamp &gt;= a and utc_timestamp &lt;= b
+                  Partition data is passed using StructLike. Residuals are returned by residualFor(StructLike).
+                  This class is thread-safe.

Contributor

Fokko Feb 6, 2025

Suggested change

      
                1. If d &gt; day(a) and d &lt; day(b), the residual is always true
          
                2. If d == day(a) and d != day(b), the residual is utc_timestamp &gt;= a
          
                3. if d == day(b) and d != day(a), the residual is utc_timestamp &lt;= b
          
                4. If d == day(a) == day(b), the residual is utc_timestamp &gt;= a and utc_timestamp &lt;= b
          
                Partition data is passed using StructLike. Residuals are returned by residualFor(StructLike).
          
                This class is thread-safe.
          
                1. If d > day(a) and d &lt; day(b), the residual is always true
          
                2. If d == day(a) and d != day(b), the residual is utc_timestamp > a
          
                3. if d == day(b) and d != day(a), the residual is utc_timestamp < b
          
                4. If d == day(a) == day(b), the residual is utc_timestamp > a and utc_timestamp < b
          
                Partition data is passed using StructLike. Residuals are returned by residualFor(StructLike).

Fokko reviewed

View reviewed changes

pyiceberg/expressions/visitors.py

+                  A residual expression is made by partially evaluating an expression using partition values.
+                  For example, if a table is partitioned by day(utc_timestamp) and is read with a filter expression
+                  utc_timestamp &gt;= a and utc_timestamp &lt;= b, then there are 4 possible residuals expressions

Contributor

Fokko Feb 6, 2025

Suggested change

      
                utc_timestamp &gt;= a and utc_timestamp &lt;= b, then there are 4 possible residuals expressions
          
                utc_timestamp > a and utc_timestamp < b, then there are 4 possible residuals expressions

Fokko reviewed

View reviewed changes

pyiceberg/expressions/visitors.py

+                                  # if the result is not a predicate, then it must be a constant like alwaysTrue or alwaysFalse
+                                  strict_result = bound
+                          if strict_result is not None and isinstance(strict_result, AlwaysTrue):

Contributor

Fokko Feb 6, 2025 •

edited

Loading

In Python we can simplify this a bit:

Suggested change

      
                        if strict_result is not None and isinstance(strict_result, AlwaysTrue):
          
                        if isinstance(strict_result, AlwaysTrue):

>>> from pyiceberg.expressions import AlwaysTrue
>>> isinstance(None, AlwaysTrue)
False

Fokko reviewed

View reviewed changes

pyiceberg/expressions/visitors.py

+                          strict_result = None
+                          if strict_projection is not None:
+                              bound = strict_projection.bind(struct_to_schema(self.spec.partition_type(self.schema)))

Contributor

Fokko Feb 6, 2025

Suggested change

      
                            bound = strict_projection.bind(struct_to_schema(self.spec.partition_type(self.schema)))
          
                            bound = strict_projection.bind(struct_to_schema(self.spec.partition_type(self.schema)), case_sensitive=self.case_sensitive)

Fokko reviewed

View reviewed changes

pyiceberg/expressions/visitors.py

+                          inclusive_projection = part.transform.project(part.name, predicate)
+                          inclusive_result = None
+                          if inclusive_projection is not None:
+                              bound_inclusive = inclusive_projection.bind(struct_to_schema(self.spec.partition_type(self.schema)))

Contributor

Fokko Feb 6, 2025

Suggested change

      
                            bound_inclusive = inclusive_projection.bind(struct_to_schema(self.spec.partition_type(self.schema)))
          
                            bound_inclusive = inclusive_projection.bind(struct_to_schema(self.spec.partition_type(self.schema)), case_sensitive=self.case_sensitive)

Fokko reviewed

View reviewed changes

pyiceberg/expressions/visitors.py

+                      return predicate
+                  def visit_unbound_predicate(self, predicate: UnboundPredicate[L]) -> BooleanExpression:
+                      bound = predicate.bind(self.schema, case_sensitive=True)

Contributor

Fokko Feb 6, 2025

Suggested change

      
                    bound = predicate.bind(self.schema, case_sensitive=True)
          
                    bound = predicate.bind(self.schema, case_sensitive=self.case_sensitive)

Fokko reviewed

View reviewed changes

pyiceberg/expressions/visitors.py

+                                  # if the result is not a predicate, then it must be a constant like alwaysTrue or
+                                  # alwaysFalse
+                                  inclusive_result = bound_inclusive
+                          if inclusive_result is not None and isinstance(inclusive_result, AlwaysFalse):

Contributor

Fokko Feb 6, 2025

Suggested change

      
                        if inclusive_result is not None and isinstance(inclusive_result, AlwaysFalse):
          
                        if isinstance(inclusive_result, AlwaysFalse):

Fokko reviewed

View reviewed changes

pyiceberg/expressions/visitors.py

+                      if isinstance(bound, BoundPredicate):
+                          bound_residual = self.visit_bound_predicate(predicate=bound)
+                          # if isinstance(bound_residual, BooleanExpression):
+                          if bound_residual not in (AlwaysFalse(), AlwaysTrue()):

Contributor

Fokko Feb 6, 2025

Suggested change

      
                        if bound_residual not in (AlwaysFalse(), AlwaysTrue()):
          
                        if not isinstance(bound_residual, (AlwaysFalse, AlwaysTrue)):

Fokko reviewed

View reviewed changes

pyiceberg/expressions/visitors.py

Comment on lines +1958 to +1961

+                  if len(spec.fields) != 0:
+                      return ResidualEvaluator(spec=spec, expr=expr, schema=schema, case_sensitive=case_sensitive)
+                  else:
+                      return UnpartitionedResidualEvaluator(schema=schema, expr=expr)

Contributor

Fokko Feb 6, 2025

Just as style thing :)

Suggested change

      
                if len(spec.fields) != 0:
          
                    return ResidualEvaluator(spec=spec, expr=expr, schema=schema, case_sensitive=case_sensitive)
          
                else:
          
                    return UnpartitionedResidualEvaluator(schema=schema, expr=expr)
          
                return UnpartitionedResidualEvaluator(schema=schema, expr=expr) if spec.is_unpartitioned() else ResidualEvaluator(spec=spec, expr=expr, schema=schema, case_sensitive=case_sensitive)

Fokko reviewed

View reviewed changes

pyiceberg/table/__init__.py

Comment on lines +1683 to +1693

+                              if task.file.file_size_in_bytes > 512 * 1024 * 1024:
+                                  target_schema = schema_to_pyarrow(self.projection())
+                                  batches = arrow_scan.to_record_batches([task])
+                                  from pyarrow import RecordBatchReader
+                                  reader = RecordBatchReader.from_batches(target_schema, batches)
+                                  for batch in reader:
+                                      res += batch.num_rows
+                              else:
+                                  tbl = arrow_scan.to_table([task])
+                                  res += len(tbl)

Contributor

Fokko Feb 6, 2025

Let's keep it simple for now, I don't think we cover the other case in a test

Suggested change

      
                            if task.file.file_size_in_bytes > 512 * 1024 * 1024:
          
                                target_schema = schema_to_pyarrow(self.projection())
          
                                batches = arrow_scan.to_record_batches([task])
          
                                from pyarrow import RecordBatchReader
          
                                reader = RecordBatchReader.from_batches(target_schema, batches)
          
                                for batch in reader:
          
                                    res += batch.num_rows
          
                            else:
          
                                tbl = arrow_scan.to_table([task])
          
                                res += len(tbl)
          
                            tbl = arrow_scan.to_table([task])
          
                            res += len(tbl)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet