Spark Splink performance metrics #2245

hadoan88 · 2024-07-11T09:59:27Z

hadoan88
Jul 11, 2024

I'm a newbie with Splink. I want to discover Spark Splink performance metrics on 100M golden records.
What is the min/max time value to find a matching record? I use first_name, last_name, email, phone, and address to dedup the record

Does anyone have those metrics?

Thanks in advance.

vfrank66 · 2024-07-11T15:11:48Z

vfrank66
Jul 11, 2024

Performance is dependent on the comparisons columns as you stated + plus the data, and also the blocking rules. Generally if you have the total number of comparisons that is better information to determine time performance.

Combing these two details plus the state of your data will inform you of the results you are looking. In my model which has similar comparison levels plus a few more I can predict on 19 mil records and cluster those predictions, then save them to s3 in 30 minutes. The slowness for me is the .as_pandas_dataframe(). But again my blocking if different and my data is different.

8 replies

RobinL Jul 12, 2024
Maintainer

Haha no sir I did not catch that originally, I see it now in the source code. I based my spark off the doc's which only used as_pandas_dataframe() so good to know. That would greatly reduce the 30 minutes.

That's a good point, i'll try to remember to update that example in the docs to use the method. You're not the first to miss it! Edit: done in splink4

hadoan88 Jul 15, 2024
Author

Hi @RobinL, I tested find_matches_to_new_records method and got a different execution time since the blocking condition is the same

`new_record_df = pd.DataFrame(
{
"unique_id": [90000],
"first_name": ['chloe'],
"surname": ['santos'],
"dob": ['2022-01-25'],
"city": ['Ho Chi Minh city'],
"email": ['[email protected]'],
}
)

record_predictions = (linker.find_matches_to_new_records(new_record_df, blocking_rules=[block_on("first_name"), block_on("surname")])

--> took 3(s) to complete

record_predictions = (linker.find_matches_to_new_records(new_record_df, blocking_rules=[block_on("l.first_name = r.surname"), block_on("surname")])

--> took 4 (mins) to complete

Are there any magic behind?

vfrank66 Jul 16, 2024

@hadoan88 your blocking conditions are not the same therefore you are creating different comparisons. Luckily it is not magical, but to help quantify your research you should read through and test this very good documentation: https://moj-analytical-services.github.io/splink/demos/tutorials/03_Blocking.html?h=count#counting-the-number-of-comparisons-created-by-a-list-of-blocking-rules

hadoan88 Jul 16, 2024
Author

Thanks @vfrank66, it's weird. In my mind, I think block_on("first_name") is equivalent to block_on("l.first_name = r.surname")

RobinL Jul 16, 2024
Maintainer

block_on("l.first_name = r.surname") will produce strange results. The correct syntax for this rule is just "l.first_name = r.surname" i.e. a string

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark Splink performance metrics #2245

{{title}}

Replies: 1 comment 8 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Spark Splink performance metrics #2245

hadoan88 Jul 11, 2024

Replies: 1 comment · 8 replies

vfrank66 Jul 11, 2024

RobinL Jul 12, 2024 Maintainer

hadoan88 Jul 15, 2024 Author

vfrank66 Jul 16, 2024

hadoan88 Jul 16, 2024 Author

RobinL Jul 16, 2024 Maintainer

hadoan88
Jul 11, 2024

Replies: 1 comment 8 replies

vfrank66
Jul 11, 2024

RobinL Jul 12, 2024
Maintainer

hadoan88 Jul 15, 2024
Author

hadoan88 Jul 16, 2024
Author

RobinL Jul 16, 2024
Maintainer