Q: filter values from parquet/CSV based on existing table #2629

darked89 · 2021-11-17T15:22:36Z

darked89
Nov 17, 2021

I want to get all unique values from a set of either ~3000 CSV or parquet files.
Let's take a two columns parquet as faster to read.
The procedure:

read first parquet_01 into my_table
take parquet_02, read only values not already present in my_table
merge the values from step num 2
repeat with the rest of the parquet files

The values in parquet files are sorted.
I did skip sorting & indexing of my_table after each step for brevity.

Proper question:
what would be the fastest way to filter, be it on the fly during parquet reading or later if needed to get values not already present in my_table?

Answered by Mytherin

Nov 17, 2021

Perhaps use DISTINCT in combination with globbing, e.g.:

CREATE TABLE my_table AS SELECT DISTINCT * FROM 'my_directory/*.parquet'

View full answer

Mytherin · 2021-11-17T16:36:03Z

Mytherin
Nov 17, 2021
Maintainer

Perhaps use DISTINCT in combination with globbing, e.g.:

CREATE TABLE my_table AS SELECT DISTINCT * FROM 'my_directory/*.parquet'

0 replies

darked89 · 2021-11-17T17:42:10Z

darked89
Nov 17, 2021
Author

Thank you very much.

As far as I can tell it does work. Since I got a bunch of parquet files with the same md5sum, I got the idea of using Python API and doing something like:

conn.execute("CREATE TABLE my_table AS SELECT DISTINCT * FROM list_of_parquet_fn_with_uniq_md5sums)

so I will not try to ingest more than one parquet file with exactly the same 16M rows. But then since ingestion of ~150 such files took < 4mins using single thread duckdb I am not sure it is worthy the effort.

0 replies

Mytherin · 2021-11-17T18:14:15Z

Mytherin
Nov 17, 2021
Maintainer

That should work, but depending on how big the Parquet files are computing the MD5's might also take some time. I suppose whether or not that is worth it depends on how many files you can eliminate in this manner.

You can pass multiple parquet files to the parquet_scan function using the bracket operator:

select * from parquet_scan(['t1.parquet', 't2.parquet', ...]);

0 replies

darked89 · 2021-11-18T10:46:30Z

darked89
Nov 18, 2021
Author

Doing proof of concept/brain dead sequential CSV processing:

takes 14hrs to convert 3095 CSVs to zstd compressed parquets using Polars (~15s per file)
md5sums: ~12mins
1361 unique md5sums, so the reduction is significant.

This is trivial to run in parallel on a HPC cluster.

Which brings me to another question:

Is there some parquet format/compression option to optimize the DuckDB SELECT DISTINCT from parquet files?
I am thinking mostly about Partitioned Datasets and partitioning the data based on chr column.
Something doable asking polars to use Arrow to write parquet: https://arrow.apache.org/docs/python/parquet.html

This will certainly complicate things, but then it will permit to run duckdb in parallel on separate computing nodes processing location for one chromosome each.

This can be way to premature optimization, but since very likely there are way larger datasets to handle I may start investigating such option now.

0 replies

Mytherin · 2021-11-18T11:34:00Z

Mytherin
Nov 18, 2021
Maintainer

Perhaps you could partition the Parquet files somehow. DuckDB will not (yet!) natively take advantage of such a partitioning, but you can manually do the partition pruning, e.g.:

SELECT DISTINCT * FROM parquet_scan('partition1/*.parquet')
UNION ALL
SELECT DISTINCT * FROM parquet_scan('partition2/*.parquet')
UNION ALL
SELECT DISTINCT * FROM parquet_scan('partition3/*.parquet')
...

1 reply

darked89 Nov 19, 2021
Author

Writing partitioned datasets using Polars and Arrow did work as expected. I reduced the number of parquet files to SELECT DISTINCT using md5 checksums. Max number of files for chr_1 positions was +400 and the lowest 41 for chr_21. The DuckDB part for this data set :

command_A = f"CREATE TABLE chromosome{chrom_name} AS SELECT DISTINCT * FROM parquet_scan({chr_parquet_dict[chrom_name]}) ORDER BY pos ASC"

command_B = f"COPY (SELECT * FROM chromosome{chrom_name}) TO '/data/chromosome_{chrom_name}.uniq.ordered.parquet' (FORMAT 'parquet')"

took:

207.60user 6.40system 4:05.08elapsed 87%CPU (0avgtext+0avgdata 244844maxresident)k

Compared to line by line TSV files processing it is an amazing speed improvement.

DK

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Q: filter values from parquet/CSV based on existing table #2629

{{title}}

Replies: 5 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Q: filter values from parquet/CSV based on existing table #2629

darked89 Nov 17, 2021

Replies: 5 comments · 1 reply

Mytherin Nov 17, 2021 Maintainer

darked89 Nov 17, 2021 Author

Mytherin Nov 17, 2021 Maintainer

darked89 Nov 18, 2021 Author

Mytherin Nov 18, 2021 Maintainer

darked89 Nov 19, 2021 Author

darked89
Nov 17, 2021

Replies: 5 comments 1 reply

Mytherin
Nov 17, 2021
Maintainer

darked89
Nov 17, 2021
Author

Mytherin
Nov 17, 2021
Maintainer

darked89
Nov 18, 2021
Author

Mytherin
Nov 18, 2021
Maintainer

darked89 Nov 19, 2021
Author