-
Notifications
You must be signed in to change notification settings - Fork 1
Skip row groups based on statistics #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
marcin-krystianc
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good; I've just left some comments about considering the use of parameterized tests.
| public void TestFilterPartitionColumn() | ||
| { | ||
| using var tmpDir = new DisposableDirectory(); | ||
| var filePath = tmpDir.AbsPath("test.parquet"); | ||
|
|
||
| var batch0 = GenerateBatch(0, 10); | ||
| var batch1 = GenerateBatch(10, 20); | ||
| WriteParquetFile(filePath, new[] { batch0, batch1 }, includeStats: true); | ||
|
|
||
| var filter = Col.Named("part").IsEqualTo(5); | ||
| var rowGroupSelector = new RowGroupSelector(filter); | ||
|
|
||
| using var reader = new FileReader(filePath); | ||
| var rowGroups = rowGroupSelector.GetRequiredRowGroups(reader); | ||
| Assert.That(rowGroups, Is.Null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent of this tests is not obvious to me.
Does it test filtering on non-existing column?
It would be great to use parametrised test to show the difference in behaviour when statistics are enabled and disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yeah this tests filtering on a column that doesn't exist in the data file, with the expectation that it comes from the partitioning schema instead. I'll add a comment to clarify this. In this test having statistics enabled or disabled should have no effect, but it probably makes sense to verify the behaviour is the same either way.
| [Test] | ||
| public void TestNoStatistics() | ||
| { | ||
| using var tmpDir = new DisposableDirectory(); | ||
| var filePath = tmpDir.AbsPath("test.parquet"); | ||
|
|
||
| var batch0 = GenerateBatch(0, 10); | ||
| var batch1 = GenerateBatch(10, 20); | ||
| WriteParquetFile(filePath, new[] { batch0, batch1 }, includeStats: false); | ||
|
|
||
| var filter = Col.Named("id").IsEqualTo(5); | ||
| var rowGroupSelector = new RowGroupSelector(filter); | ||
|
|
||
| using var reader = new FileReader(filePath); | ||
| var rowGroups = rowGroupSelector.GetRequiredRowGroups(reader); | ||
| Assert.That(rowGroups, Is.EqualTo(new[] { 0, 1 })); | ||
| } | ||
|
|
||
| [Test] | ||
| public void TestFilterIntColumnValue() | ||
| { | ||
| using var tmpDir = new DisposableDirectory(); | ||
| var filePath = tmpDir.AbsPath("test.parquet"); | ||
|
|
||
| var batch0 = GenerateBatch(0, 10); | ||
| var batch1 = GenerateBatch(10, 20); | ||
| var batch2 = GenerateBatch(20, 30); | ||
| WriteParquetFile(filePath, new[] { batch0, batch1, batch2 }, includeStats: true); | ||
|
|
||
| var filter = Col.Named("id").IsEqualTo(15); | ||
| var rowGroupSelector = new RowGroupSelector(filter); | ||
|
|
||
| using var reader = new FileReader(filePath); | ||
| var rowGroups = rowGroupSelector.GetRequiredRowGroups(reader); | ||
| Assert.That(rowGroups, Is.EqualTo(new[] { 1 })); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these two tests should be merged into single parametrised test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I also changed the below test using an integer range filter to be parametrised too.
This PR adds support for skipping reading a row group when the statistics can be used to infer that no selected rows are in that row group.
This required upgrading to the latest beta version of ParquetSharp which exposed the schema mapping to allow interpreting statistics when using the Arrow API (G-Research/ParquetSharp#430).
The statistics store Parquet physical values rather than logical values, so they need to be converted first. I've added a new
LogicalStatisticsclass to represent statistics that have been converted to the relevant logical type.