Memory issue with ample compute #2155

lamaeldo · 2024-04-30T10:18:15Z

lamaeldo
Apr 30, 2024

Hello, I've been using splink for a bit, experimenting its performance for a record linking task I have to perform, which is to link a dataset of 100k to a dataset of 6M rows.

I have access to a machine with 64 cores/128 threads and 1 TB of RAM, on which I am using the duckdb backend. I was thinking that performance wouldn't be an issue, but I've kept encountering OOM, so I've gradually reduced my blocking to a point where my rules (I currrently use the same rules for blocking and predicting) generate only 10m comparisons, and yet they still crash (depending on the cases, it can be when estimating U, M or when predicting).

I'm aware that my blocking rules are not strictly equi-join conditions as they involve OR, but I thought that it was still a native SQL function (as opposed to say, a Levenshtein), and hence would be fast enough. I'm still not sure that this is the problem because linker.cumulative_num_comparisons_from_blocking_rules_chart() runs in 1m15 with the 24 rules.

With regards to comparisons, most are customs but the only functions that I use are SUBSTRING, ABS and jaro_winkler_similarity. I do use term frequency adjustments in most of them, and intentionally set estimate_without_term_frequencies=False in my M training, but still, I didn't assume that 10m comparisons would have been enough to saturate 1TB of RAM.

Another - perhaps unrelated - issue is that in the same notebook, I actually run splink twice, removing from my dataframes the pairs that were predicted in the first round. In the rare cases when all 3 steps of the first round (U estimation, M estimation and prediction) run without issue, and i get to the second round, I get another error from duckdb, alon the lines of 'ressource temporarily unavailable'. I'm surprised, because I have already run invalidate_cache, delete_tables_created_by_splink_from_db and then deleted the linker by the time I create the second linker, so I assumed that DuckDB would be free for use by a new linker.

Beyond reducing the size of my blocking rules, I have replaced a long string value on which I did exact matching by an identifier, and tried to optimize memory use by deleting unused objects but I still encounter issue.

I attach to this a notebook showing my code and the output until the crash. Thanks for anyone's help
splink.ipynb.txt

RobinL · 2024-04-30T10:41:16Z

RobinL
Apr 30, 2024
Maintainer

it's almost certainly the OR condition in your blocking rules. An OR condition in a blocking rule automatically triggers a full cartesian product (100k * 6m rows) in most database engines.

You would need to look to split the or condiiton into two rules

e.g. very simple example, if you had:
l.first_name = r.first_name OR l.surname = r.surname

you would instead want two blocking rules,
[
"l.first_name = r.first_name"
" l.surname = r.surname"
]
which achives the same thing, but the sql engine will run it MUCH faster.

re: the cumulative count, it might be that somehow the SQL engine can optimise counts a bit further and avoid the cartesian join in that instance

6 replies

RobinL Apr 30, 2024
Maintainer

It could paradoxically be a case where loosening the rules (by removing the OR conditions) might make the algorithm run faster by making the joins more efficient. You could them implement the equality or value conditions after post-linkage.

It's also possible other forms of the logic may work better: for example, an IN condition may not force the cartesian join.

It may also be possible to represent the logic using some sort of preprocessing step with a case statement that would then allow the rule to be expressed more simply as a join condition

You can investigate some of this yourself using:
https://github.com/moj-analytical-services/splink/blob/86f955c84959fcbf1c198b8235749f210796b3d7/splink/blocking.py#L137C5-L137C37

splink/splink/blocking.py

Line 169 in 86f955c

def _filter_conditions(self):

Or by writing the join condition manually and doing a sql EXPLAIN
https://duckdb.org/docs/guides/meta/explain.html
e.g.

explain
select *
from table as l
left join table as r
on l.pname = r.pname and ((l.cnti = r.cnti) or (l.cnti = 'UNK') or (r.cnti = 'UNK'))

there are also some docs here
https://moj-analytical-services.github.io/splink/topic_guides/blocking/performance.html?h=equi+join#equi-join-conditions

lamaeldo Apr 30, 2024
Author

Thanks for this. I have tried again with removing any "OR" condition in my blocking, now only using equalities, > < operators and substring operations. This leaves my total comparisons at just over 10m, and I still encounter out of memory errors, which still seems surprising given 1 TB of ram, unless < operations also lead to a cartesian join (?)
Or could there be heavy costs to having many blocking rules? (I have 24)
I have also tried on another machine to check if it wasn't a configuration issue, and I face the same issue (although this is with only ~30GB of RAM)
Would you have any further tips?

RobinL Apr 30, 2024
Maintainer

You probably need to go through them one by one to identify which ones are causing the problem. If you can do that and identify and example of one that's OK, and an example of one that causes problems, I might be able to help more

lamaeldo May 4, 2024
Author

After some more investigation, this issue of OOM errors when plenty of memory is still unused seems to be caused by duckdb. It uses jemalloc which is supposed to be a better memory allocator than the system default, but in some cases, can be counterproductive (presumably by over-fragmenting the memory). The relevant issue in DuckDB is this one. I think (although I have no proof of it) that it could be caused by strange interactions between slurm-like hypervisors and jemalloc.

In my case, building the code with BUILD_JEMALLOC=0 make, and then changing line 135 of setup.py to pass seems to have resolved the issue.

RobinL May 4, 2024
Maintainer

thanks - that's super useful. It would never have occurred to me to look into how it allocates memory in that much detail. Definitely useful for us to be aware of this in case any one else faces the same issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory issue with ample compute #2155

{{title}}

Replies: 1 comment 6 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Memory issue with ample compute #2155

lamaeldo Apr 30, 2024

Replies: 1 comment · 6 replies

RobinL Apr 30, 2024 Maintainer

RobinL Apr 30, 2024 Maintainer

lamaeldo Apr 30, 2024 Author

RobinL Apr 30, 2024 Maintainer

lamaeldo May 4, 2024 Author

RobinL May 4, 2024 Maintainer

lamaeldo
Apr 30, 2024

Replies: 1 comment 6 replies

RobinL
Apr 30, 2024
Maintainer

RobinL Apr 30, 2024
Maintainer

lamaeldo Apr 30, 2024
Author

RobinL Apr 30, 2024
Maintainer

lamaeldo May 4, 2024
Author

RobinL May 4, 2024
Maintainer