Replies: 2 comments 6 replies
-
Tentative conclusionsNo saltingWithout salting, DuckDB parallelizes the workload based on row groups - which is 122,880 rows. That means that if the input dataframe into a Splink routine (e.g. estimate u) has < 123k rows, it will use a single CPU core. The number of CPU cores used is equal to In practice, this means for SaltingSalting only affects parallelisation if a 'trigger' operation is used. However:
Estimate uPredictionsSuppose you have 1m input rows = 8 row groups You have 6 |
Beta Was this translation helpful? Give feedback.
-
Hi @RobinL, Sorry that I'm all over your discussion threads lately, but I'm curious if you've found a way to automatically trigger parallelism in the Reason I ask:
When I look into CPU usage, during I think that this could possibly be the source of my slowdown, and so I'm curious to see if you've discovered any more on this topic. Thank you, |
Beta Was this translation helpful? Give feedback.
-
For want of a better place to put it, I'm going to use this thread to document the work I've been doing to understand how to get duckdb to parallelise workflows efficiently.
Summary
In
splink==3.9.10
predict()
will only usenum_input_nodes/122_880
cores, irrespective of salting or the number of blocking rulesAdding
order by 1
in the right place makes it parallelise further. The parallelism is equal to(num_input_nodes/122_880) * num_blocking_rules * salting_per_blocking_rule
This means the benefits to the
order by 1
trick are variable, but often large, especially on big machines e.g. 5x fasterEstimate u does not parallelise at all in the current splink release (
3.9.10
)Adding salting=num_cpu_cores makes it parallelise across all available cores leading to 10x speedup (or more)
Experiments
Parallelising blocking (with salting or not)
order by 1
, small input data (100k rows)order by 1
, large input data (3m rows)order by 1
, small input data (100k rows)order by 1
, small input data (3m rows)These results suggest that you probably want to do salting and order by 1 to achieve parallelisaiton of all workloads
But unfortunately, on real Splink workloads, salting too heavily this seems to make things considerably slower for some workloads on high cpu count machines. So you can't just arbitrarily specify a high salt
Parallelising full cartesian join
order by 1
. Need a reprex of this behaviourRunnable examples
100k input rows, no order by
This example takes the same time to run irrespective of num_partitions and thread count
100k input rows,
order by 1
Using the same example as above, but adding
order by 1
makes it parallelise but only if salting is appliedThe level of parallelisation is related to the salting. Runtimes decrease as salt increases towards CPU count but not beyond
Runtime dramatically faster with salting and
order by
3m input rows, no order by
runnable code
On my 6core/12thread machine, get full parallelisation irrespective of salting.
3m input rows,
order by 1
On my 6core/12thread machine, get full parallelisation irrespective of salting.
order by
make runtime only slightly (5%) slower#Conclusions
No salting
Without salting, DuckDB parallelizes the workload based on row groups - which is 122,880 rows.
That means that if the input dataframe into a Splink routine (e.g. estimate u) has < 123k rows, it will use a single CPU core.
The number of CPU cores used is equal to
input nodes/122_880
.In practice, this means for
estimate_u
, you never achieve parallelism.Salting
Salting only affects parallelisation if a 'trigger' operation is used.
group by
andorder by
seem to be triggers, this may be relevant.However:
order by 1
statement does have an effect on performance.So in Splink:
order by 1
. Thegroup by
triggers parallelisationpredict()
, we want to salt, but only in the case that parrallelism is lower than cpu_core count. e.g. a 1m input dataset with 2 blocking rules will only use about 16 cores. So we would want a salt of 2Predictions
Suppose you have 1m input rows = 8 row groups
You have 6
Links and docs
DuckDB docs
Duckdb documentation:
Question on DuckDB discussion forums
See here
The above is quite confusing because:
order by 1
actually causes parallelism in my testsestimate_u
parallelises with very few input rows without anorder by
Todo:
order by 1
slowing things downBeta Was this translation helpful? Give feedback.
All reactions