Replies: 2 comments 1 reply
-
Hi all, Just to add another observation. Even though the 2nd scenario has an additional row: Now, since the order is reversed with reference to the first set of pairs, these combinations cannot be marked as duplicates with the same "duplicate_id" (generated through a custom code). It seems that Splink prefers to keep the lowest ids as the unique_id_l over unique_id_r. Could someone kindly provide their inputs/comments on the above qns/observation? Thanks in advance. |
Beta Was this translation helpful? Give feedback.
-
Hi Robin, Thanks for the response. I've tested using a slightly different code but including the cluster_pairwise_predictions_at_threshold() method and confirming that it does help capture duplicates in both direct and indirect sense (an example for indirect would be: A --> B, B--> C their cluster id shows the following 2 rows: A--> B & A --> C). I am able to use this feature from Splink and further perform transformations as required. So thanks for your help here. A follow up question: Since A --> B and B --> C are the valid pairs from pairwise predictions while A--> B and A--> C are the ones from clustering of the pairwise predictions, metrics like match_weights and match_probability are not available for A-->C. In such cases, is it possible for you to guide me on how to get these metrics. I think using the trained model could help here - but few details on the code/logic would be beneficial. Thanks. |
Beta Was this translation helpful? Give feedback.
-
Hi all,
I have 2 source datasets on which I would like to run a de-duplication (on individual datasets) followed by a linkage task (between both datasets) using Splink. My hypothesis is that running a de-dup logic can help trim candidates in both datasets that can be sent for final step of linkage task.
While performing de-dup on a single data source I get around 60 pairs of unique_ids. I want to identify which unique_ids from 'left' or 'right' need to be dropped from the linkage task. Most of them show 1:1 mapping while few are 1:many. I am highlighting 2 scenarios as below for 1:many mapping only:
Scenario 1 output: Group on 'unique_id_l' and capture all 'unique_id_r' with their counts:
Scenario 2 output: Group on 'unique_id_r' and capture all 'unique_id_l' with their counts:
As one can notice in the 2nd scenario I get an additional pair (in bold) which is not captured in the 1st scenario. I would like to understand if this is the expected behavior and if yes, does one need to check for both scenarios to identify which one has more 'row' count for 1:many mapping (e.g. scenario 2 has 3 v/s scenario 1 which has 2).
I would like to share that though link and dedup is available in Splink, there is a requirement to do this in a sequence rather than in one-go.
Thanks in advance to your inputs/suggestions.
Regards.
Beta Was this translation helpful? Give feedback.
All reactions