Integrating a Non-Match Indicator Column into the Matching Algorithm #2168

Ahosseinzadeh723 · 2024-05-07T15:32:37Z

Ahosseinzadeh723
May 7, 2024

My question is primarily methodological. I am working on a project where I match individuals based on first name, last name, date of birth, and other criteria. I encounter situations involving, for example newborn babies and twins, where the similarity in variables could mistakenly identify them as the same individual. I have additional variables (external to the algorithm) that help me identify such cases, allowing me to note in the data that two records (for example, in a new column), while appearing identical, represent different individuals (e.g., twins). How can I integrate this information so that the final clustering results automatically distinguish between such cases? This would prevent the need for manually adjusting the clustering, such as assigning different cluster IDs to twins or newborns.

Let me simplify it with an example: I have Person 1 and Person 2, whose details appear identical, but I've identified them as twins in a new column ('twin_identifier') and assigned different numbers to indicate they are not the same person. How can I adjust the algorithm to ensure that this probabilistic model recognizes them as distinct individuals?

aalexandersson · 2024-05-07T19:36:53Z

aalexandersson
May 7, 2024

Duplicate of "twin issue" in discussion/2023?

1 reply

Ahosseinzadeh723 May 7, 2024
Author

@aalexandersson Thank you so much, it was really helpful discussion!

RobinL · 2024-05-12T09:09:00Z

RobinL
May 12, 2024
Maintainer

Good question. I think the best approach here is to override the trained match weights with a very strong positive match weights when twin_identifier matches, and a very strong negative match weight when it edoes not.

It's a bit fiddly to do this in Splink 3 (we're working on a better API in the forthcoming Splink 4), but here's a working example of how you can do it:

from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.comparison_library import (
    exact_match,
    levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.fake_1000
df["twin_identifier"] = df["cluster"]

settings = {
    "probability_two_random_records_match": 0.01,
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        block_on(["first_name"]),
        block_on(["surname"]),
    ],
    "comparisons": [
        levenshtein_at_thresholds("first_name", 2),
        exact_match("surname"),
        exact_match("dob"),
        exact_match("city", term_frequency_adjustments=True),
        exact_match("email"),
        exact_match("twin_identifier")
    ],
    "retain_intermediate_calculation_columns": True,
}

linker = DuckDBLinker(df, settings)

linker.estimate_probability_two_random_records_match(
    [
        block_on(["first_name", "surname"]),
    ],
    recall=0.7,
)

linker.estimate_u_using_random_sampling(target_rows=1e6)

blocking_rule = "l.first_name = r.first_name and l.surname = r.surname"
linker.estimate_parameters_using_expectation_maximisation(blocking_rule)

blocking_rule = "l.dob = r.dob"
linker.estimate_parameters_using_expectation_maximisation(blocking_rule)

# Override the parameters on the twin_identifier column 

trained_twin_comparison = linker._settings_obj.comparisons[5]
trained_twin_comparison.comparison_levels[1].m_probability = 1.0
trained_twin_comparison.comparison_levels[1].u_probability = 1e-9

trained_twin_comparison.comparison_levels[2].m_probability = 1e-9
trained_twin_comparison.comparison_levels[2].u_probability = 1.0

# Check the match weights look right
linker.match_weights_chart()

df_predict = linker.predict()

A second alternative option is, after calling df_predict, you can then query the resultant dataframe:


df_predict = linker.predict()

sql = f"""
select *, 
case when twin_identifier_l = twin_identifier_r then 1.0
when twin_identifier_l != twin_identifier_r then 0.0
end
   as adjusted_probability
from {df_predict.physical_name}

"""
linker.query_sql(sql, output_type="splink_df")

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrating a Non-Match Indicator Column into the Matching Algorithm #2168

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Integrating a Non-Match Indicator Column into the Matching Algorithm #2168

Ahosseinzadeh723 May 7, 2024

Replies: 2 comments · 1 reply

aalexandersson May 7, 2024

Ahosseinzadeh723 May 7, 2024 Author

RobinL May 12, 2024 Maintainer

Ahosseinzadeh723
May 7, 2024

Replies: 2 comments 1 reply

aalexandersson
May 7, 2024

Ahosseinzadeh723 May 7, 2024
Author

RobinL
May 12, 2024
Maintainer