Row group reading support? #587
-
Hi, thanks for the library it has been very useful for us. Is it possible to read specific row groups using search criteria (predicate pushdown I think it is called)? We're testing the use of parquet files for tiling spatial biology data. But it would be cool to be able to have one big file instead of lots of small files that could serve the same purpose. |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 5 replies
-
You can specify row group indexes in the |
Beta Was this translation helpful? Give feedback.
-
Thanks, is it possible to query the metadata for the row groups first to determine which row groups need to be loaded? I'm thinking that could be done once and then referenced whenever new row groups are needed. |
Beta Was this translation helpful? Give feedback.
-
Hi @kylebarron, I'm trying to use I'm making my parquet file in Python like this # Create a sample DataFrame
df = pd.DataFrame({
"id": range(10000),
"category": ["A"] * 5000 + ["B"] * 5000,
"value": range(10000, 20000)
})
# Convert to PyArrow Table
table = pa.Table.from_pandas(df)
# Write to a Parquet file with row groups (each batch of 2000 rows will be a row group)
file_path = "data/test_row_groups.parquet"
row_group_size = 2000
pq.write_table(table, file_path, row_group_size=row_group_size)
print(f"Parquet file '{file_path}' created with {len(df) // row_group_size} row groups.") and I'm using const response = await fetch(url, fetch_options);
const arrayBuffer = await response.arrayBuffer();
const pq = await getPq();
const arr = new Uint8Array(arrayBuffer);
var ReaderOptions = {
rowGroups: [0]
}
const arrowIPC = pq.readParquet(arr, ReaderOptions);
const arrowTable = arrow.tableFromIPC(arrowIPC)
console.log('after using ReaderOptions')
console.log(arrowTable) and my arrowTable console log looks like this g {schema: g, batches: Array(5), _offsets: Uint32Array(6)}
batches
:
(5) [g, g, g, g, g]
schema
:
g {fields: Array(3), metadata: Map(1), dictionaries: Map(0), metadataVersion: 4}
_offsets
:
Uint32Array(6) [0, 2000, 4000, 6000, 8000, 10000, buffer: ArrayBuffer(24), byteLength: 24, byteOffset: 0, length: 6, Symbol(Symbol.toStringTag): 'Uint32Array']
data As far as I can tell this is reading all row groups. Does it look like I am using the More broadly, if I intend to have one very large parquet file that I would like parquet-wasm to read specific row groups from, without downloading/reading the entire file, do I need to modify how my fetch command is working (HTTP range request)? Will I by default fetch the entire parquet file and do I need to fetch a subset of the file? Any help would be greatly appreciated. |
Beta Was this translation helpful? Give feedback.
-
Thanks @kylebarron for the help! I am trying to use the latest 0.6.1 version of parquet-wasm, but I am getting an error with ParquetFile.read: I am importing parquet-wasm from the bundler version (I am using esbuild)
I instantiate an instance of the class ParquetFile like this and trying to use the read method (link to code in branch)
but I am getting the message that ![]() It looks like the pq_file instance is being made correctly and appears to have a read method, but the read method is giving an error. Do you think this might be caused by the WASM code not being initialized correctly? I am using esbuild and I saw one of the documents say that Webpack was preferred. Previously, I was unable to get the WASM code to load correctly (using version 0.4 of parquet-wasm) so I copied the code and license to a vendor section, saved the WASM code as a base64 string in a wasmModuleBase64.js file ( It does however look like my esbuild is able to import wasm files (from the vendor folder)
yields
|
Beta Was this translation helpful? Give feedback.
You can specify row group indexes in the
ReaderOptions
: https://kylebarron.dev/parquet-wasm/types/esm_parquet_wasm.ReaderOptions.html