Master (Patient) Index and training dataset #2200

niquola · 2024-06-05T21:41:58Z

niquola
Jun 5, 2024

Hello, community. We are new to record linkage and have some basic questions.

We are working on Master Patient Index implementation. The key function of Index is to match incoming patient records against existing records to prevent duplicates in a system.

In most of cases we have a legacy database with records already deduplicated by some legacy algorithm. The size of database may vary from 10k to 100M. Some databases preserve variations of fields by aggregating multiple values into arrays of names, telecoms or addresses - but some lose this information and keep just last values. Sometimes we have an access to historical incoming data (not deduped and merged).

Question 1:

What should we use as a dataset to train the model?

My (PhD in chemistry) intuition tells me that we can't use deduped database for that - is she right?

Question 2:

Sometimes original dataset could be extremely huge - for example for db with 100M patients, it could be 10B incoming messages (records).

Can we use some type of sampling for original records?
Can we re-use deduped database to estimate u probabilities or for something else?

Question 3:

What other important requirements to training dataset we missed?

With respect, Nikolai

Answered by RobinL

Jun 7, 2024

👋

Question 1 - what dataset should be use to train the model

Your intuition is correct: the m probabilities of the model of measure characteristics of the data amongst truly matching records. If there are no matches within the training dataset, they cannot be estimated correctly.

You therefore probably want to estimate the m probabilities using a link_type=link_only job, matching one or more of your 'incoming record' datasets to the master patient index. For a generalisable model that works against many different 'incoming record' sources, you might want to use a dataset that includes records from a variety of incoming sources, to 'average out' the parameter estimates.

Conversely, you don…

View full answer

RobinL · 2024-06-07T18:15:47Z

RobinL
Jun 7, 2024
Maintainer

👋

Question 1 - what dataset should be use to train the model

Your intuition is correct: the m probabilities of the model of measure characteristics of the data amongst truly matching records. If there are no matches within the training dataset, they cannot be estimated correctly.

You therefore probably want to estimate the m probabilities using a link_type=link_only job, matching one or more of your 'incoming record' datasets to the master patient index. For a generalisable model that works against many different 'incoming record' sources, you might want to use a dataset that includes records from a variety of incoming sources, to 'average out' the parameter estimates.

Conversely, you don't need to be so careful about the u probabilities which measure the charactersitics of the data amongst truly non-matching records; these can be estimated without there being any duplicates.

Finally, a saving grace here is that model accuracy is much less sensitive to poor estimates of the m probabilities than the u probabilities.

Question 2 - huge datasets

The estimation methodology for the u probabilities is largely invariant (in terms of performance) to the size of the input dataset. The reason for this is that the first thing the algorithm does is sample the input dataset to achieve the max_pairs value given in linker.estimate_u_using_random_sampling(max_pairs=)

For estimating m probabilities, things are a little trickier (unless you have a sample of labelled data, in which case you can simply use linker.estimate_m_from_pairwise_labels)

If we don't have labelled data, we typically need to either:

Use an unsupervised algorithm called Expectation Maximisation
Use expert judgement to guess the m values

If we want estimate the m probabilities using EM, we need to find a sampling methodology that contains matching pairs. Not only matching pairs, that's what the EM algorithm solves for us, but there needs to be a sufficient number of matching pairs in the sample.

The issue is that, for very large datasets, the likelihood that two records sampled at random are a match is small. So if we take a random sample, we're unlikely to get many matching pairs.

The next problem is that most strategies that 'select for' matching records also bias the sample (and thus bias the estimates of the m probabilities).

Having said all that, if you're in a link_only scenario where you have various 'incoming records' datasets to matching against a master list, then taking a sample of incoming records should be fine, so long as you don't sample from the master dataset (i.e. ensure it remains complete). Because each sampled record in therefore in a sense guaranteed to have a match.

Another thing you could try is taking a deliberately biased sample on a high cadinality field. For example, people born on a specific day, or on a set of days. Then you can train the model, and your m values should be valid for every column expect date of birth. It might even be possible to do this automatically by giving estimate_parameters_using_expectation_maximisation a blocking rule like l.dob = r.dob and l.dob = '2000-01-01'. I haven't actually tried it, but it 'feels' like it should work.

The topic of training m values on very large datasets definitely gets quite complex, and there are various different strategies depending on your data. I wouldn't discount using expert judgement either, since accuracy of record linkage models often isn't too sensitive to getting the m values a bit wrong.

The main part of the docs that explains the approach to EM used by Splink is here.

Specifically note the following parts:

Each estimation pass requires the user to configure an estimation blocking rule to reduce the number of record comparisons generated to a manageable level.

The larger the input dataset, the tighter these blocking passes need to be.

Finally, whatever you're doing, there's a not-well-documented option like:

estimate_parameters_using_expectation_maximisation(.., estimate_without_term_frequencies=True)

which makes EM faster at the expense of accuracy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Master (Patient) Index and training dataset #2200

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Master (Patient) Index and training dataset #2200

niquola Jun 5, 2024

Replies: 1 comment

RobinL Jun 7, 2024 Maintainer

niquola
Jun 5, 2024

RobinL
Jun 7, 2024
Maintainer