Replies: 3 comments 2 replies
-
Also interested in if there's a recommended approach to this, as combining term-frequencies with a double-metaphone column creates significant slowdown (which is to be expected really). Dropping a VARCHAR[] column gives me 6x speedup when creating my __splink__df_concat and even better gains on creating __splink__df_concat_with_tf. It also seems to let me use 100% of the CPU. I may try expanding the VARCHAR[] columns into two separate columns, have you tried this? Edit: to clarify, dmetaphone works for me, it's just very slow due to handling arrays |
Beta Was this translation helpful? Give feedback.
-
In reply to the main question, I think you probably want to avoid the If you want to use array/dmetaphones, I'd probaby do an array intersect level. if you're in duckdb, for speed, I'd probably write a comparison using custom sql with the new list_intersect function in duckdb |
Beta Was this translation helpful? Give feedback.
-
All the comments are helpful I was getting an error with duckdb and switched to phonetics.metaphone() with success. I have now switched to spark. I have a large dataset to compare so I will just use metaphone, purely based the comments about performance. |
Beta Was this translation helpful? Give feedback.
-
In the documentation on Phonetic transformations you show several ways to use
.name_comparison(...phonetic_col_name=..)
, in addition within the splink code it performs an exact match on phonetic_col_name. But when attempting to use phonetics.dmetaphone() a tuple is produces the main and alternative metaphone. Currently dmetaphone produces an error when running splink because it is attempting to cast VARCHAR[] > VARCHAR I read on another discussion that matches on arrays is problematic.Do you have a recommendation for the compability with
phonetic_col_name
andphonetics.dmetaphone()
?Below is how I am adding dmetaphone to the pd.DataFrame() I could cast to a string since it is an exact match anyway but was looking for a recommendation.
Beta Was this translation helpful? Give feedback.
All reactions