-
I want to get all unique values from a set of either ~3000 CSV or parquet files.
The values in parquet files are sorted. Proper question: |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 1 reply
-
Perhaps use CREATE TABLE my_table AS SELECT DISTINCT * FROM 'my_directory/*.parquet' |
Beta Was this translation helpful? Give feedback.
-
Thank you very much. As far as I can tell it does work. Since I got a bunch of parquet files with the same md5sum, I got the idea of using Python API and doing something like:
so I will not try to ingest more than one parquet file with exactly the same 16M rows. But then since ingestion of ~150 such files took < 4mins using single thread duckdb I am not sure it is worthy the effort. |
Beta Was this translation helpful? Give feedback.
-
That should work, but depending on how big the Parquet files are computing the MD5's might also take some time. I suppose whether or not that is worth it depends on how many files you can eliminate in this manner. You can pass multiple parquet files to the parquet_scan function using the bracket operator: select * from parquet_scan(['t1.parquet', 't2.parquet', ...]);
|
Beta Was this translation helpful? Give feedback.
-
Doing proof of concept/brain dead sequential CSV processing:
This is trivial to run in parallel on a HPC cluster. Which brings me to another question: Is there some parquet format/compression option to optimize the DuckDB SELECT DISTINCT from parquet files? This will certainly complicate things, but then it will permit to run duckdb in parallel on separate computing nodes processing location for one chromosome each. This can be way to premature optimization, but since very likely there are way larger datasets to handle I may start investigating such option now. |
Beta Was this translation helpful? Give feedback.
-
Perhaps you could partition the Parquet files somehow. DuckDB will not (yet!) natively take advantage of such a partitioning, but you can manually do the partition pruning, e.g.: SELECT DISTINCT * FROM parquet_scan('partition1/*.parquet')
UNION ALL
SELECT DISTINCT * FROM parquet_scan('partition2/*.parquet')
UNION ALL
SELECT DISTINCT * FROM parquet_scan('partition3/*.parquet')
... |
Beta Was this translation helpful? Give feedback.
Perhaps use
DISTINCT
in combination with globbing, e.g.: