One to Many mapping query on "link_type" = "dedup" through unique_id_l and unique_id_r grouping order. #2087

ABJ66 · 2024-03-21T15:06:03Z

ABJ66
Mar 21, 2024

Hi all,

I have 2 source datasets on which I would like to run a de-duplication (on individual datasets) followed by a linkage task (between both datasets) using Splink. My hypothesis is that running a de-dup logic can help trim candidates in both datasets that can be sent for final step of linkage task.

While performing de-dup on a single data source I get around 60 pairs of unique_ids. I want to identify which unique_ids from 'left' or 'right' need to be dropped from the linkage task. Most of them show 1:1 mapping while few are 1:many. I am highlighting 2 scenarios as below for 1:many mapping only:

Scenario 1 output: Group on 'unique_id_l' and capture all 'unique_id_r' with their counts:

unique_id_l	unique_id_r_items	unique_id_r_cnt
1000010628	[1000044057, 1000208184]	2
1000188607	[1000223138, 1000242441]	2

Scenario 2 output: Group on 'unique_id_r' and capture all 'unique_id_l' with their counts:

unique_id_r	unique_id_l_items	unique_id_l_cnt
1000208184	[1000010628, 1000044057]	2
1000213642	[1000010159, 1000213641]	2
1000242441	[1000188607, 1000223138]	2

As one can notice in the 2nd scenario I get an additional pair (in bold) which is not captured in the 1st scenario. I would like to understand if this is the expected behavior and if yes, does one need to check for both scenarios to identify which one has more 'row' count for 1:many mapping (e.g. scenario 2 has 3 v/s scenario 1 which has 2).

I would like to share that though link and dedup is available in Splink, there is a requirement to do this in a sequence rather than in one-go.

Thanks in advance to your inputs/suggestions.

Regards.

ABJ66 · 2024-04-11T06:24:19Z

ABJ66
Apr 11, 2024
Author

Hi all,

Just to add another observation.

Even though the 2nd scenario has an additional row:
1000213642 <->1000010159 & 1000213642 <-> 1000213641 as duplicates, the actual de-duped output contains the following combinations:
1000010159 <-> 1000213642 & 1000213641 <-> 1000213642 as separate rows.

Now, since the order is reversed with reference to the first set of pairs, these combinations cannot be marked as duplicates with the same "duplicate_id" (generated through a custom code). It seems that Splink prefers to keep the lowest ids as the unique_id_l over unique_id_r.

Could someone kindly provide their inputs/comments on the above qns/observation?

Thanks in advance.

1 reply

RobinL Apr 19, 2024
Maintainer

Sorry for the delay, was on leave.

If I understand the question correctly, I think the simplest way would be to use clustering to perform the deduplication. The critical part is at the end of this script:

from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.fake_1000

settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on(["first_name"]),
        block_on(["surname"]),
    ],
    "comparisons": [
        levenshtein_at_thresholds("first_name", 2),
        exact_match("surname"),
        exact_match("dob"),
        exact_match("city", term_frequency_adjustments=True),
        exact_match("email"),
    ],
    "retain_intermediate_calculation_columns": True,
    "additional_columns_to_retain": ["cluster"],
}


linker = DuckDBLinker(df, settings)


linker.estimate_probability_two_random_records_match(
    [
        block_on(["first_name", "surname"]),
    ],
    recall=0.7,
)

linker.estimate_u_using_random_sampling(target_rows=1e6)


blocking_rule = "l.first_name = r.first_name and l.surname = r.surname"
linker.estimate_parameters_using_expectation_maximisation(blocking_rule)


blocking_rule = "l.dob = r.dob"
linker.estimate_parameters_using_expectation_maximisation(blocking_rule)


df_predict = linker.predict()

df_clusters = linker.cluster_pairwise_predictions_at_threshold(df_predict,0.9)

linker.query_sql(
    f"""
select cluster_id, list(unique_id) as unique_id_items, count(*) as cluster_size
from {df_clusters.physical_name}
group by cluster_id
    """
)

Results in:

|   cluster_id | unique_id_items   |   cluster_size |
|-------------:|:------------------|---------------:|
|            0 | [0, 3]            |              2 |
|            1 | [1]               |              1 |
|            2 | [2]               |              1 |
|            4 | [4, 5]            |              2 |
|            6 | [6]               |              1 |

Note that df_clusters looks like:

|   cluster_id |   unique_id | first_name   | surname   | dob        | city   | email                   |   cluster |   __splink_salt |      tf_city |
|-------------:|------------:|:-------------|:----------|:-----------|:-------|:------------------------|----------:|----------------:|-------------:|
|            0 |           0 | Robert       | Alan      | 1971-06-24 |        | [email protected]     |         0 |        0.826796 | nan          |
|            1 |           1 | Robert       | Allen     | 1971-05-24 |        | [email protected]     |         0 |        0.125242 | nan          |
|            2 |           2 | Rob          | Allen     | 1971-06-24 | London | [email protected]     |         0 |        0.1533   |   0.212792   |
|            0 |           3 | Robert       | Alen      | 1971-06-24 | Lonon  |                         |         0 |        0.800333 |   0.00738007 |
|            4 |           4 | Grace        |           | 1997-04-26 | Hull   | [email protected] |         1 |        0.90222  |   0.00123001 |

and that the cluster_id is guaranteed to be one of the items in the cluster

ABJ66 · 2024-04-22T05:10:50Z

ABJ66
Apr 22, 2024
Author

Hi Robin,

Thanks for the response. I've tested using a slightly different code but including the cluster_pairwise_predictions_at_threshold() method and confirming that it does help capture duplicates in both direct and indirect sense (an example for indirect would be: A --> B, B--> C their cluster id shows the following 2 rows: A--> B & A --> C).

I am able to use this feature from Splink and further perform transformations as required. So thanks for your help here.

A follow up question: Since A --> B and B --> C are the valid pairs from pairwise predictions while A--> B and A--> C are the ones from clustering of the pairwise predictions, metrics like match_weights and match_probability are not available for A-->C. In such cases, is it possible for you to guide me on how to get these metrics. I think using the trained model could help here - but few details on the code/logic would be beneficial.

Thanks.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One to Many mapping query on "link_type" = "dedup" through unique_id_l and unique_id_r grouping order. #2087

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

One to Many mapping query on "link_type" = "dedup" through unique_id_l and unique_id_r grouping order. #2087

ABJ66 Mar 21, 2024

Replies: 2 comments · 1 reply

ABJ66 Apr 11, 2024 Author

RobinL Apr 19, 2024 Maintainer

ABJ66 Apr 22, 2024 Author

ABJ66
Mar 21, 2024

Replies: 2 comments 1 reply

ABJ66
Apr 11, 2024
Author

RobinL Apr 19, 2024
Maintainer

ABJ66
Apr 22, 2024
Author