Replies: 1 comment 6 replies
-
it's almost certainly the You would need to look to split the or condiiton into two rules e.g. very simple example, if you had: you would instead want two blocking rules, re: the cumulative count, it might be that somehow the SQL engine can optimise counts a bit further and avoid the cartesian join in that instance |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hello, I've been using splink for a bit, experimenting its performance for a record linking task I have to perform, which is to link a dataset of 100k to a dataset of 6M rows.
I have access to a machine with 64 cores/128 threads and 1 TB of RAM, on which I am using the duckdb backend. I was thinking that performance wouldn't be an issue, but I've kept encountering OOM, so I've gradually reduced my blocking to a point where my rules (I currrently use the same rules for blocking and predicting) generate only 10m comparisons, and yet they still crash (depending on the cases, it can be when estimating U, M or when predicting).
I'm aware that my blocking rules are not strictly equi-join conditions as they involve OR, but I thought that it was still a native SQL function (as opposed to say, a Levenshtein), and hence would be fast enough. I'm still not sure that this is the problem because
linker.cumulative_num_comparisons_from_blocking_rules_chart()
runs in 1m15 with the 24 rules.With regards to comparisons, most are customs but the only functions that I use are SUBSTRING, ABS and jaro_winkler_similarity. I do use term frequency adjustments in most of them, and intentionally set
estimate_without_term_frequencies=False
in my M training, but still, I didn't assume that 10m comparisons would have been enough to saturate 1TB of RAM.Another - perhaps unrelated - issue is that in the same notebook, I actually run splink twice, removing from my dataframes the pairs that were predicted in the first round. In the rare cases when all 3 steps of the first round (U estimation, M estimation and prediction) run without issue, and i get to the second round, I get another error from duckdb, alon the lines of 'ressource temporarily unavailable'. I'm surprised, because I have already run
invalidate_cache
,delete_tables_created_by_splink_from_db
and then deleted the linker by the time I create the second linker, so I assumed that DuckDB would be free for use by a new linker.Beyond reducing the size of my blocking rules, I have replaced a long string value on which I did exact matching by an identifier, and tried to optimize memory use by deleting unused objects but I still encounter issue.
I attach to this a notebook showing my code and the output until the crash. Thanks for anyone's help
splink.ipynb.txt
Beta Was this translation helpful? Give feedback.
All reactions