-
Hello, community. We are new to record linkage and have some basic questions. We are working on Master Patient Index implementation. The key function of Index is to match incoming patient records against existing records to prevent duplicates in a system. In most of cases we have a legacy database with records already deduplicated by some legacy algorithm. The size of database may vary from 10k to 100M. Some databases preserve variations of fields by aggregating multiple values into arrays of names, telecoms or addresses - but some lose this information and keep just last values. Sometimes we have an access to historical incoming data (not deduped and merged). Question 1: What should we use as a dataset to train the model? My (PhD in chemistry) intuition tells me that we can't use deduped database for that - is she right? Question 2: Sometimes original dataset could be extremely huge - for example for db with 100M patients, it could be 10B incoming messages (records). Can we use some type of sampling for original records? Question 3: What other important requirements to training dataset we missed? With respect, Nikolai |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
👋 Question 1 - what dataset should be use to train the model Your intuition is correct: the m probabilities of the model of measure characteristics of the data amongst truly matching records. If there are no matches within the training dataset, they cannot be estimated correctly. You therefore probably want to estimate the m probabilities using a Conversely, you don't need to be so careful about the u probabilities which measure the charactersitics of the data amongst truly non-matching records; these can be estimated without there being any duplicates. Finally, a saving grace here is that model accuracy is much less sensitive to poor estimates of the m probabilities than the u probabilities. Question 2 - huge datasets The estimation methodology for the u probabilities is largely invariant (in terms of performance) to the size of the input dataset. The reason for this is that the first thing the algorithm does is sample the input dataset to achieve the For estimating m probabilities, things are a little trickier (unless you have a sample of labelled data, in which case you can simply use If we don't have labelled data, we typically need to either:
If we want estimate the m probabilities using EM, we need to find a sampling methodology that contains matching pairs. Not only matching pairs, that's what the EM algorithm solves for us, but there needs to be a sufficient number of matching pairs in the sample. The issue is that, for very large datasets, the likelihood that two records sampled at random are a match is small. So if we take a random sample, we're unlikely to get many matching pairs. The next problem is that most strategies that 'select for' matching records also bias the sample (and thus bias the estimates of the m probabilities). Having said all that, if you're in a Another thing you could try is taking a deliberately biased sample on a high cadinality field. For example, people born on a specific day, or on a set of days. Then you can train the model, and your m values should be valid for every column expect date of birth. It might even be possible to do this automatically by giving The topic of training m values on very large datasets definitely gets quite complex, and there are various different strategies depending on your data. I wouldn't discount using expert judgement either, since accuracy of record linkage models often isn't too sensitive to getting the m values a bit wrong. The main part of the docs that explains the approach to EM used by Splink is here. Specifically note the following parts:
The larger the input dataset, the tighter these blocking passes need to be. Finally, whatever you're doing, there's a not-well-documented option like:
which makes EM faster at the expense of accuracy. See also here |
Beta Was this translation helpful? Give feedback.
👋
Question 1 - what dataset should be use to train the model
Your intuition is correct: the m probabilities of the model of measure characteristics of the data amongst truly matching records. If there are no matches within the training dataset, they cannot be estimated correctly.
You therefore probably want to estimate the m probabilities using a
link_type=link_only
job, matching one or more of your 'incoming record' datasets to the master patient index. For a generalisable model that works against many different 'incoming record' sources, you might want to use a dataset that includes records from a variety of incoming sources, to 'average out' the parameter estimates.Conversely, you don…