Row group reading support? #587

cornhundred · 2024-08-24T14:34:57Z

cornhundred
Aug 24, 2024

Hi, thanks for the library it has been very useful for us. Is it possible to read specific row groups using search criteria (predicate pushdown I think it is called)? We're testing the use of parquet files for tiling spatial biology data. But it would be cool to be able to have one big file instead of lots of small files that could serve the same purpose.

Answered by kylebarron

Aug 26, 2024

You can specify row group indexes in the ReaderOptions: https://kylebarron.dev/parquet-wasm/types/esm_parquet_wasm.ReaderOptions.html

View full answer

kylebarron · 2024-08-26T12:40:11Z

kylebarron
Aug 26, 2024
Maintainer

You can specify row group indexes in the ReaderOptions: https://kylebarron.dev/parquet-wasm/types/esm_parquet_wasm.ReaderOptions.html

0 replies

cornhundred · 2024-08-26T14:28:05Z

cornhundred
Aug 26, 2024
Author

Thanks, is it possible to query the metadata for the row groups first to determine which row groups need to be loaded? I'm thinking that could be done once and then referenced whenever new row groups are needed.

1 reply

kylebarron Aug 26, 2024
Maintainer

You can use ParquetFile to access the ParquetMetaData and then iterate through the metadata of all row groups. That should probably give you enough of what you need

cornhundred · 2025-03-08T01:26:04Z

cornhundred
Mar 8, 2025
Author

Hi @kylebarron, I'm trying to use ReaderOptions to read a a single row group from a small parquet file that has 5 row groups.

I'm making my parquet file in Python like this

# Create a sample DataFrame
df = pd.DataFrame({
    "id": range(10000),
    "category": ["A"] * 5000 + ["B"] * 5000,
    "value": range(10000, 20000)
})

# Convert to PyArrow Table
table = pa.Table.from_pandas(df)

# Write to a Parquet file with row groups (each batch of 2000 rows will be a row group)
file_path = "data/test_row_groups.parquet"
row_group_size = 2000

pq.write_table(table, file_path, row_group_size=row_group_size)

print(f"Parquet file '{file_path}' created with {len(df) // row_group_size} row groups.")

and I'm using ReaderOptions here (the parquet file is being hosted on a local server)

const response = await fetch(url, fetch_options);
const arrayBuffer = await response.arrayBuffer();

const pq = await getPq();
const arr = new Uint8Array(arrayBuffer);

var ReaderOptions = {
    rowGroups: [0]
}

const arrowIPC = pq.readParquet(arr, ReaderOptions);
const arrowTable = arrow.tableFromIPC(arrowIPC)

console.log('after using ReaderOptions')
console.log(arrowTable)

and my arrowTable console log looks like this

g {schema: g, batches: Array(5), _offsets: Uint32Array(6)}
batches
: 
(5) [g, g, g, g, g]
schema
: 
g {fields: Array(3), metadata: Map(1), dictionaries: Map(0), metadataVersion: 4}
_offsets
: 
Uint32Array(6) [0, 2000, 4000, 6000, 8000, 10000, buffer: ArrayBuffer(24), byteLength: 24, byteOffset: 0, length: 6, Symbol(Symbol.toStringTag): 'Uint32Array']
data

As far as I can tell this is reading all row groups. Does it look like I am using the ReaderOptions incorrectly?

More broadly, if I intend to have one very large parquet file that I would like parquet-wasm to read specific row groups from, without downloading/reading the entire file, do I need to modify how my fetch command is working (HTTP range request)? Will I by default fetch the entire parquet file and do I need to fetch a subset of the file?

Any help would be greatly appreciated.

1 reply

kylebarron Mar 11, 2025
Maintainer

As far as I can tell this is reading all row groups. Does it look like I am using the ReaderOptions incorrectly?

The ReaderOptions is passed along here:

parquet-wasm/src/reader.rs

Lines 18 to 20 in e85322b

    
           if let Some(row_groups) = options.row_groups { 
        
               builder = builder.with_row_groups(row_groups); 
        
           }

So it should only be reading the first one.

More broadly, if I intend to have one very large parquet file that I would like parquet-wasm to read specific row groups from, without downloading/reading the entire file, do I need to modify how my fetch command is working (HTTP range request)? Will I by default fetch the entire parquet file and do I need to fetch a subset of the file?

You need to use the ParquetFile API: https://kylebarron.dev/parquet-wasm/classes/esm_parquet_wasm.ParquetFile.html to read via HTTP range requests.

cornhundred · 2025-03-13T20:55:48Z

cornhundred
Mar 13, 2025
Author

Thanks @kylebarron for the help! I am trying to use the latest 0.6.1 version of parquet-wasm, but I am getting an error with ParquetFile.read: parquetfile_read is not a function.

I am importing parquet-wasm from the bundler version (I am using esbuild)

import * as pq from 'parquet-wasm/bundler'

I instantiate an instance of the class ParquetFile like this and trying to use the read method (link to code in branch)

    const { ParquetFile } = pq;
    console.log('ParquetFile', ParquetFile)

    const pq_file = new ParquetFile();

    console.log('pq_file', pq_file)
    console.log('pq_file.read', pq_file.read)

    const cell_url = base_url + `/test_row_groups.parquet`;
    pq_file.read(cell_url)

but I am getting the message that parquetfile_read is not a function:

It looks like the pq_file instance is being made correctly and appears to have a read method, but the read method is giving an error. Do you think this might be caused by the WASM code not being initialized correctly? I am using esbuild and I saw one of the documents say that Webpack was preferred.

Previously, I was unable to get the WASM code to load correctly (using version 0.4 of parquet-wasm) so I copied the code and license to a vendor section, saved the WASM code as a base64 string in a wasmModuleBase64.js file (export const wasmBase64 = AGFzbQEAAA...`), and finally modified the code to import this and convert to binary in order to initialize WASM. I'm wondering if I need to try something similar again?

It does however look like my esbuild is able to import wasm files (from the vendor folder)

import wasm from '../vendor/parquet-wasm/bundler/parquet_wasm_bg.wasm';

console.log('Loaded WASM:', wasm);

yields

Loaded WASM: g=>WebAssembly.instantiate(MW,g).then(A=>A.instance.exports)

3 replies

cornhundred Mar 14, 2025
Author

@kylebarron - I can open this in a new discussion or issue if you would like.

kylebarron Mar 14, 2025
Maintainer

Sorry, I don't really have time to support this. I've only personally had success using the esm export, not the bundler export.

cornhundred Mar 14, 2025
Author

No problem, I'll post here if I can figure it out.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Row group reading support? #587

{{title}}

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Row group reading support? #587

cornhundred Aug 24, 2024

Replies: 4 comments · 5 replies

kylebarron Aug 26, 2024 Maintainer

cornhundred Aug 26, 2024 Author

kylebarron Aug 26, 2024 Maintainer

cornhundred Mar 8, 2025 Author

kylebarron Mar 11, 2025 Maintainer

cornhundred Mar 13, 2025 Author

cornhundred Mar 14, 2025 Author

kylebarron Mar 14, 2025 Maintainer

cornhundred Mar 14, 2025 Author

cornhundred
Aug 24, 2024

Replies: 4 comments 5 replies

kylebarron
Aug 26, 2024
Maintainer

cornhundred
Aug 26, 2024
Author

kylebarron Aug 26, 2024
Maintainer

cornhundred
Mar 8, 2025
Author

kylebarron Mar 11, 2025
Maintainer

cornhundred
Mar 13, 2025
Author

cornhundred Mar 14, 2025
Author

kylebarron Mar 14, 2025
Maintainer

cornhundred Mar 14, 2025
Author