[FEAT] Cluster evaluation - summary statistics #2190

OlivierBinette · 2024-05-20T13:17:02Z

I want to better understand the the clustering I get after estimating pairwise match probabilities, thresholding, and getting connected components.

It's useful to consider a quasi-identifier such as a name, and to compute the following two metrics:

Homonymy Rate: The proportion of clusters that share a name with another cluster.
Name Variation Rate: The proportion of clusters with name variation within them.

For instance, if I know that names are quite clean in my data, then I want the name variation rate to be very low.

The er-evaluation package implements the two metrics, but it uses Pandas and it's quite slow. The formulas are given in this paper (my paper): https://arxiv.org/pdf/2404.05622

OlivierBinette added the enhancement New feature or request label May 20, 2024

Provide feedback