Replies: 1 comment 8 replies
-
Performance is dependent on the comparisons columns as you stated + plus the data, and also the blocking rules. Generally if you have the total number of comparisons that is better information to determine time performance. Combing these two details plus the state of your data will inform you of the results you are looking. In my model which has similar comparison levels plus a few more I can predict on 19 mil records and cluster those predictions, then save them to s3 in 30 minutes. The slowness for me is the .as_pandas_dataframe(). But again my blocking if different and my data is different. |
Beta Was this translation helpful? Give feedback.
8 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm a newbie with Splink. I want to discover Spark Splink performance metrics on 100M golden records.
What is the min/max time value to find a matching record? I use first_name, last_name, email, phone, and address to dedup the record
Does anyone have those metrics?
Thanks in advance.
Beta Was this translation helpful? Give feedback.
All reactions