Skip to content

Conversation

@adamreeve
Copy link
Contributor

This PR adds support for skipping reading a row group when the statistics can be used to infer that no selected rows are in that row group.

This required upgrading to the latest beta version of ParquetSharp which exposed the schema mapping to allow interpreting statistics when using the Arrow API (G-Research/ParquetSharp#430).

The statistics store Parquet physical values rather than logical values, so they need to be converted first. I've added a new LogicalStatistics class to represent statistics that have been converted to the relevant logical type.

Copy link

@marcin-krystianc marcin-krystianc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good; I've just left some comments about considering the use of parameterized tests.

Comment on lines 14 to 28
public void TestFilterPartitionColumn()
{
using var tmpDir = new DisposableDirectory();
var filePath = tmpDir.AbsPath("test.parquet");

var batch0 = GenerateBatch(0, 10);
var batch1 = GenerateBatch(10, 20);
WriteParquetFile(filePath, new[] { batch0, batch1 }, includeStats: true);

var filter = Col.Named("part").IsEqualTo(5);
var rowGroupSelector = new RowGroupSelector(filter);

using var reader = new FileReader(filePath);
var rowGroups = rowGroupSelector.GetRequiredRowGroups(reader);
Assert.That(rowGroups, Is.Null);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent of this tests is not obvious to me.
Does it test filtering on non-existing column?
It would be great to use parametrised test to show the difference in behaviour when statistics are enabled and disabled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah this tests filtering on a column that doesn't exist in the data file, with the expectation that it comes from the partitioning schema instead. I'll add a comment to clarify this. In this test having statistics enabled or disabled should have no effect, but it probably makes sense to verify the behaviour is the same either way.

Comment on lines 31 to 66
[Test]
public void TestNoStatistics()
{
using var tmpDir = new DisposableDirectory();
var filePath = tmpDir.AbsPath("test.parquet");

var batch0 = GenerateBatch(0, 10);
var batch1 = GenerateBatch(10, 20);
WriteParquetFile(filePath, new[] { batch0, batch1 }, includeStats: false);

var filter = Col.Named("id").IsEqualTo(5);
var rowGroupSelector = new RowGroupSelector(filter);

using var reader = new FileReader(filePath);
var rowGroups = rowGroupSelector.GetRequiredRowGroups(reader);
Assert.That(rowGroups, Is.EqualTo(new[] { 0, 1 }));
}

[Test]
public void TestFilterIntColumnValue()
{
using var tmpDir = new DisposableDirectory();
var filePath = tmpDir.AbsPath("test.parquet");

var batch0 = GenerateBatch(0, 10);
var batch1 = GenerateBatch(10, 20);
var batch2 = GenerateBatch(20, 30);
WriteParquetFile(filePath, new[] { batch0, batch1, batch2 }, includeStats: true);

var filter = Col.Named("id").IsEqualTo(15);
var rowGroupSelector = new RowGroupSelector(filter);

using var reader = new FileReader(filePath);
var rowGroups = rowGroupSelector.GetRequiredRowGroups(reader);
Assert.That(rowGroups, Is.EqualTo(new[] { 1 }));
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these two tests should be merged into single parametrised test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I also changed the below test using an integer range filter to be parametrised too.

@adamreeve adamreeve merged commit 7125245 into G-Research:main Apr 4, 2024
@adamreeve adamreeve deleted the predicate_pushdown branch April 4, 2024 20:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants