Integrating a Non-Match Indicator Column into the Matching Algorithm #2168
Ahosseinzadeh723
started this conversation in
Ideas
Replies: 2 comments 1 reply
-
Duplicate of "twin issue" in discussion/2023? |
Beta Was this translation helpful? Give feedback.
1 reply
-
Good question. I think the best approach here is to override the trained match weights with a very strong positive match weights when It's a bit fiddly to do this in Splink 3 (we're working on a better API in the forthcoming Splink 4), but here's a working example of how you can do it: from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.comparison_library import (
exact_match,
levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker
df = splink_datasets.fake_1000
df["twin_identifier"] = df["cluster"]
settings = {
"probability_two_random_records_match": 0.01,
"link_type": "dedupe_only",
"blocking_rules_to_generate_predictions": [
block_on(["first_name"]),
block_on(["surname"]),
],
"comparisons": [
levenshtein_at_thresholds("first_name", 2),
exact_match("surname"),
exact_match("dob"),
exact_match("city", term_frequency_adjustments=True),
exact_match("email"),
exact_match("twin_identifier")
],
"retain_intermediate_calculation_columns": True,
}
linker = DuckDBLinker(df, settings)
linker.estimate_probability_two_random_records_match(
[
block_on(["first_name", "surname"]),
],
recall=0.7,
)
linker.estimate_u_using_random_sampling(target_rows=1e6)
blocking_rule = "l.first_name = r.first_name and l.surname = r.surname"
linker.estimate_parameters_using_expectation_maximisation(blocking_rule)
blocking_rule = "l.dob = r.dob"
linker.estimate_parameters_using_expectation_maximisation(blocking_rule)
# Override the parameters on the twin_identifier column
trained_twin_comparison = linker._settings_obj.comparisons[5]
trained_twin_comparison.comparison_levels[1].m_probability = 1.0
trained_twin_comparison.comparison_levels[1].u_probability = 1e-9
trained_twin_comparison.comparison_levels[2].m_probability = 1e-9
trained_twin_comparison.comparison_levels[2].u_probability = 1.0
# Check the match weights look right
linker.match_weights_chart()
df_predict = linker.predict() A second alternative option is, after calling
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
My question is primarily methodological. I am working on a project where I match individuals based on first name, last name, date of birth, and other criteria. I encounter situations involving, for example newborn babies and twins, where the similarity in variables could mistakenly identify them as the same individual. I have additional variables (external to the algorithm) that help me identify such cases, allowing me to note in the data that two records (for example, in a new column), while appearing identical, represent different individuals (e.g., twins). How can I integrate this information so that the final clustering results automatically distinguish between such cases? This would prevent the need for manually adjusting the clustering, such as assigning different cluster IDs to twins or newborns.
Let me simplify it with an example: I have Person 1 and Person 2, whose details appear identical, but I've identified them as twins in a new column ('twin_identifier') and assigned different numbers to indicate they are not the same person. How can I adjust the algorithm to ensure that this probabilistic model recognizes them as distinct individuals?
Beta Was this translation helpful? Give feedback.
All reactions