Documenting findings re: parallelisation and duckdb #1830

RobinL · 2024-01-09T17:04:11Z

RobinL
Jan 9, 2024
Maintainer

For want of a better place to put it, I'm going to use this thread to document the work I've been doing to understand how to get duckdb to parallelise workflows efficiently.

Summary

In splink==3.9.10 predict() will only use num_input_nodes/122_880 cores, irrespective of salting or the number of blocking rules
- So on a 50k input dataframe, you never get parallelisation
- On a 1m input dataframe, you get parallelisation up to 8 cores but no more
Adding order by 1 in the right place makes it parallelise further. The parallelism is equal to (num_input_nodes/122_880) * num_blocking_rules * salting_per_blocking_rule
This means the benefits to the order by 1 trick are variable, but often large, especially on big machines e.g. 5x faster
- If you have. a large number of input nodes and a small machine, there’s little benefit because it’s already parallelising
- If you have a large machine, there’s usually a benefit, unless you have a very large number of input nodes.
Estimate u does not parallelise at all in the current splink release (3.9.10)
Adding salting=num_cpu_cores makes it parallelise across all available cores leading to 10x speedup (or more)

Experiments

Parallelising blocking (with salting or not)

With order by 1, small input data (100k rows)
- You get parallelisation of any salted query. These results suggest it has very little affect on speed (5% slower), but in Splink it seems to make it a lot slower
- Salting needed
With order by 1, large input data (3m rows)
- No salting needed
Without order by 1, small input data (100k rows)
- No parallelism irrespective of salting
Without order by 1, small input data (3m rows)
- Parallelisation, no salting needed

These results suggest that you probably want to do salting and order by 1 to achieve parallelisaiton of all workloads

But unfortunately, on real Splink workloads, salting too heavily this seems to make things considerably slower for some workloads on high cpu count machines. So you can't just arbitrarily specify a high salt

Parallelising full cartesian join

Salting triggers parallelisation in Splink without order by 1. Need a reprex of this behaviour

Runnable examples

100k input rows, no order by

This example takes the same time to run irrespective of num_partitions and thread count

import time

import duckdb

con = duckdb.connect(database=":memory:")


n = 100_000
sql = f"""
CREATE TABLE test_table AS
SELECT
  random() AS num,
  substr(characters, floor(random()*62)::integer + 1, 4) AS random_string_4,
  substr(characters, floor(random()*62)::integer + 1, 8) AS random_string_8,
  substr(characters, floor(random()*62)::integer + 1, 16) AS random_string_16,
  substr(BIN(CAST((RANDOM() * 1000000) AS INTEGER))::string, -6, 6) AS join_key,

FROM
  range(0, {n:.0f}, 1),
  (SELECT 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789' AS characters) AS c
"""
con.execute(sql).commit()

con.execute("SELECT * from test_table limit 20").df()
con.execute("SELECT count(*) from test_table limit 2").df()


times = []

# for i in [0, 1, 2, 4, 8, 16, 32]:
for i in [0]:
    if i != 0:
        con.execute(
            f"""
        SET threads TO {i};
        SET enable_progress_bar=true;
        SET external_threads TO {i};
        SET worker_threads to {i};
        """
        ).commit()

    start_time = time.time()

    query_template = """
    SELECT
      jaro_winkler_similarity(A.random_string_16, B.random_string_16) as jw_1,
      jaro_winkler_similarity(A.random_string_16, B.random_string_16) as jw_2,
      jaro_winkler_similarity(A.random_string_16, B.random_string_16) as jw_3,
      jaro_winkler_similarity(A.random_string_16, B.random_string_16) as jw_4,
    FROM test_table as A
    INNER JOIN test_table as B
    on 1=1
    where floor(A.num*{num_partitions}) = {partition}
    and A.join_key = B.join_key
    and jw_1 = 1 and jw_2 = 1 and jw_3 = 1 and jw_4 = 1
    """

    num_partitions = 10

    queries = [
        query_template.format(partition=i, num_partitions=num_partitions)
        for i in range(num_partitions)
    ]
    query = " UNION ALL ".join(queries)
    # query = query + "order by 1"
    print(query)
    con.execute(f"create table A as {query}").commit()
    display(con.execute(f"select count(*) from A").df())
    con.execute(f"drop table A").commit()

    end_time = time.time()

    times.append(end_time - start_time)
    print(times)

print(f"Average times: {times}")

100k input rows, `order by 1`

Using the same example as above, but adding order by 1 makes it parallelise but only if salting is applied

The level of parallelisation is related to the salting. Runtimes decrease as salt increases towards CPU count but not beyond

Runtime dramatically faster with salting and order by

3m input rows, no order by

runnable code

import time

import duckdb

con = duckdb.connect(database=":memory:")


n = 10_000_000
sql = f"""
CREATE TABLE test_table AS
SELECT
  random() AS num,
  substr(characters, floor(random()*62)::integer + 1, 4) AS random_string_4,
  substr(characters, floor(random()*62)::integer + 1, 8) AS random_string_8,
  substr(characters, floor(random()*62)::integer + 1, 16) AS random_string_16,
  substr(BIN(CAST((RANDOM() * 1000000) AS INTEGER))::string, -18, 18) AS join_key,

FROM
  range(0, {n:.0f}, 1),
  (SELECT 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789' AS characters) AS c
"""
con.execute(sql).commit()

con.execute("SELECT * from test_table limit 20").df()
con.execute("SELECT count(*) from test_table limit 2").df()


times = []

# for i in [0, 1, 2, 4, 8, 16, 32]:
for i in [0]:
    if i != 0:
        con.execute(
            f"""
        SET threads TO {i};
        SET enable_progress_bar=true;
        SET external_threads TO {i};
        SET worker_threads to {i};
        """
        ).commit()

    start_time = time.time()

    query_template = """
    SELECT
      jaro_winkler_similarity(A.random_string_16, B.random_string_16) as jw_1,
      jaro_winkler_similarity(A.random_string_16, B.random_string_16) as jw_2,
      jaro_winkler_similarity(A.random_string_16, B.random_string_4) as jw_3,

    FROM test_table as A
    INNER JOIN test_table as B
    on A.join_key = B.join_key
    where floor(A.num*{num_partitions}) = {partition}
    and jw_1 = 1 and jw_2 = 1 and jw_3 > 0.9

    """

    num_partitions = 1

    queries = [
        query_template.format(partition=i, num_partitions=num_partitions)
        for i in range(num_partitions)
    ]
    query = " UNION ALL ".join(queries)
    # query = query + "order by 1"
    print(query)
    con.execute(f"create table A as {query}").commit()
    display(con.execute(f"select count(*) from A").df())
    con.execute(f"drop table A").commit()

    end_time = time.time()

    times.append(end_time - start_time)
    print(times)

print(f"Average times: {times}")

On my 6core/12thread machine, get full parallelisation irrespective of salting.

3m input rows, `order by 1`

On my 6core/12thread machine, get full parallelisation irrespective of salting.

order by make runtime only slightly (5%) slower

#Conclusions

No salting

Without salting, DuckDB parallelizes the workload based on row groups - which is 122,880 rows.

That means that if the input dataframe into a Splink routine (e.g. estimate u) has < 123k rows, it will use a single CPU core.

The number of CPU cores used is equal to input nodes/122_880.

In practice, this means for estimate_u, you never achieve parallelism.

Salting

Salting only affects parallelisation if a 'trigger' operation is used. group by and order by seem to be triggers, this may be relevant.

However:

the order by 1 statement does have an effect on performance.
salting negatively affects performance quite badly. So if you already have enough parallelism due to your row counts, then you don't want to salt any more if it's set very high.

So in Splink:

For estimate u, we want to salt but do not use order by 1. The group by triggers parallelisation
For predict(), we want to salt, but only in the case that parrallelism is lower than cpu_core count. e.g. a 1m input dataset with 2 blocking rules will only use about 16 cores. So we would want a salt of 2

Predictions

Suppose you have 1m input rows = 8 row groups

You have 6

Links and docs

DuckDB docs

Duckdb documentation:

Parallelism (Multi-Core Processing)

The Effect of Row Groups on Parallelism

DuckDB parallelizes the workload based on row groups, i.e., groups of rows that are stored together at the storage level.
A row group in DuckDB's database format consists of max. 122,880 rows.
Parallelism starts at the level of row groups, therefore, for a query to run on k threads, it needs to scan at least k * 122,880 rows.

Too Many Threads

Note that in certain cases DuckDB may launch too many threads (e.g., due to HyperThreading), which can lead to slowdowns. In these cases, it’s worth manually limiting the number of threads using SET threads = X.

Larger-Than-Memory Workloads (Out-of-Core Processing)

A key strength of DuckDB is support for larger-than-memory workloads, i.e., it is able to process data sets that are larger than the available system memory (also known as out-of-core processing).
It can also run queries where the intermediate results cannot fit into memory.
This section explains the prerequisites, scope, and known limitations of larger-than-memory processing in DuckDB.

Question on DuckDB discussion forums

See here

The regular query doesn't parallelize because the table is very small, and the data set size is increased by an exploding join. DuckDB parallelizes based on the base table sizes so if a table has a size of 1000 tuples it will only ever use a single thread regardless of how many tuples are generated in subsequent exploding joins.

The union variant not parallelizing is indeed related to the output order issue - we currently only execute UNION pipelines in parallel if order is not required in the final stage, adding an ORDER BY resolves the issue, e.g.:
con.execute(f"create table A as {query} order by 1").commit()

The unit of parallelism is a single row group (120K rows), so with 1M rows you should see up to 8 cores being utilized.

Finally, other than the UNION ALL Do you know any 'tricks' that would trigger parallelisation of the exploding join query?

You could materialize the result of the join in a table (or perhaps even a materialized CTE).

The above is quite confusing because:

Adding order by 1 actually causes parallelism in my tests
estimate_u parallelises with very few input rows without an order by

Todo:

Reprex of salting/order by 1 slowing things down
Reprex of full parallelisation of estimate_u without order by 1
Why does heavy salting in Splink have such a bad effect when in my reprexes here it makes little difference

RobinL · 2024-01-10T08:31:30Z

RobinL
Jan 10, 2024
Maintainer Author

Tentative conclusions

No salting

Without salting, DuckDB parallelizes the workload based on row groups - which is 122,880 rows.

That means that if the input dataframe into a Splink routine (e.g. estimate u) has < 123k rows, it will use a single CPU core.

The number of CPU cores used is equal to input nodes/122_880.

In practice, this means for estimate_u, you never achieve parallelism.

Salting

Salting only affects parallelisation if a 'trigger' operation is used. group by and order by seem to be triggers, this may be relevant.

However:

the order by 1 statement does have an affect on performance.
salting negatively affects performance quite badly. So if you already have enough parallelism due to your row counts, then you don't want to salt any more if it's set very high.

Estimate u

Predictions

Suppose you have 1m input rows = 8 row groups

You have 6

0 replies

jlb52 · 2024-04-03T15:48:56Z

jlb52
Apr 3, 2024

Hi @RobinL,

Sorry that I'm all over your discussion threads lately, but I'm curious if you've found a way to automatically trigger parallelism in the predict method for DuckDB. (Happy to move this to a separate discussion or issue if you'd prefer to keep this one clean, just let me know.)

Reason I ask:

After attempting to run the embedding cosine similarity (per the embedding discussion thread last week) in Spark and the job always stalling in the predict stage (with no error message but also no progress after +5hrs), I resorted to DuckDB.
I extended the codebase to allow for calling list_cosine_similarity (happy to contribute, this but guessing it may already be in dev for Splink4). estimate_u (for 1e8) and estimate_m ran in no time (~5min for u) with 300K input rows and 1.2M comparisons after blocking. However, again, the predict stage didn't complete after +10hrs so I killed it.
My last resort was to convert the embeddings to binary embedding strings and compute the hamming distance (see here if interested). When I did a speed test in duckdb for 250,000 cosine similarity vs. hamming distance row comparisons, the former ran in 30s while the latter in .5s so I figured it could give me a speed-up. It did speed up in the estimate_u and estimate_m phase but predict still isn't completing after many hours.

When I look into CPU usage, during estimate_u, I utilize all 16 cores to 100% capacity, but in the predict phase, I'm only using 3 as can be seen in the screenshot below.

I think that this could possibly be the source of my slowdown, and so I'm curious to see if you've discovered any more on this topic.

Thank you,
Jackson

6 replies

jlb52 Apr 3, 2024

Thanks as usual for such a quick response.

The count of comparisons is only 1.2M as I use strict blocking rules, much lower than the 1e8 I'm able to run with estimate_u in just a few minutes... I guess even if it only uses 3 of the 16 cores for predict it should finish relatively quickly.

When I run my predict as such pairs = linker.predict(threshold_match_probability=0.5), I only am consuming 12GB out of my 128GB of RAM, so I don't think it's spilling to disk at all.

Maybe something with my custom comparison is causing it to stall in the predict without generating an error message so I can try to debug a bit.

RobinL Apr 3, 2024
Maintainer Author

Not sure. If you're able to post the full script it might help get to the bottom of it. It's certainly strange - with only 1.2 mil comparisons, it should be very fast

jlb52 Apr 3, 2024

Happy to share a slightly abbreviated version (but which has the same issues).

bit_string is a string len=1024 with only '0' and '1' characters.

conn = duckdb.connect(database=":memory:", read_only=False)

query = """
SELECT unique_id, first_name, last_name, bit_string
FROM table
"""
df = conn.execute(query).df()

# Added new subclass of DistanceFunctionAtThresholdsBase
EMBEDDING_COMPARISON = cl.hamming_at_thresholds(
    col_name="bit_string",
    distance_threshold_or_thresholds=[120, 130, 140, 150, 160, 170, 180, 190, 200, 210],
    include_exact_match_level=False,
)

blocking_rules = [
    block_on(["first_name", 'last_name'], 16),
]

settings = {
    "link_type": "dedupe_only",
    "comparisons": [EMBEDDING_COMPARISON],
    "retain_matching_columns": True,
    "retain_intermediate_calculation_columns": False,
    "max_iterations": 100,
    "probability_two_random_records_match": 1e-5,
}

linker = DuckDBLinker(df, settings)

# Runs in 8s
linker.estimate_u_using_random_sampling(1e8)

# Runs in 1.5min
linker.estimate_parameters_using_expectation_maximisation(
                block_on(["last_name"]))

# Never progresses past here
pairs = linker.predict(
            threshold_match_probability=.5
        )

clusters = linker.cluster_pairwise_predictions_at_threshold(
    pairs, threshold_match_probability=.5
)

If you don't see anything fishy, I'm tempted to think there's something wrong in my cl.hamming_at_thresholds function of how it works with predict so I can look into that.

RobinL Apr 3, 2024
Maintainer Author

If you don't have any blocking rules for prediction in your settings, then it will take the Cartesian product of the input data frame e.g. if that's 1mil records, it'll be creating 1 trillion comparisons.

jlb52 Apr 3, 2024

Right... of course this was it. Runs in 1min now and uses all cores. Thanks much.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documenting findings re: parallelisation and duckdb #1830

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Parallelism (Multi-Core Processing)

The Effect of Row Groups on Parallelism

Too Many Threads

Larger-Than-Memory Workloads (Out-of-Core Processing)

Replies: 2 comments 6 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Documenting findings re: parallelisation and duckdb #1830

RobinL Jan 9, 2024 Maintainer

Summary

Experiments

Parallelising blocking (with salting or not)

Parallelising full cartesian join

Runnable examples

100k input rows, no order by

100k input rows, order by 1

3m input rows, no order by

3m input rows, order by 1

No salting

Salting

Predictions

Links and docs

DuckDB docs

Parallelism (Multi-Core Processing)

The Effect of Row Groups on Parallelism

Too Many Threads

Larger-Than-Memory Workloads (Out-of-Core Processing)

Question on DuckDB discussion forums

Todo:

Replies: 2 comments · 6 replies

RobinL Jan 10, 2024 Maintainer Author

Tentative conclusions

No salting

Salting

Estimate u

Predictions

jlb52 Apr 3, 2024

jlb52 Apr 3, 2024

RobinL Apr 3, 2024 Maintainer Author

jlb52 Apr 3, 2024

RobinL Apr 3, 2024 Maintainer Author

jlb52 Apr 3, 2024

RobinL
Jan 9, 2024
Maintainer

100k input rows, `order by 1`

3m input rows, `order by 1`

Replies: 2 comments 6 replies

RobinL
Jan 10, 2024
Maintainer Author

jlb52
Apr 3, 2024

RobinL Apr 3, 2024
Maintainer Author

RobinL Apr 3, 2024
Maintainer Author