Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Cluster evaluation - summary statistics #2190

Open
OlivierBinette opened this issue May 20, 2024 · 0 comments
Open

[FEAT] Cluster evaluation - summary statistics #2190

OlivierBinette opened this issue May 20, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@OlivierBinette
Copy link
Contributor

Is your proposal related to a problem?

I want to better understand the the clustering I get after estimating pairwise match probabilities, thresholding, and getting connected components.

Describe the solution you'd like

It's useful to consider a quasi-identifier such as a name, and to compute the following two metrics:

  • Homonymy Rate: The proportion of clusters that share a name with another cluster.
  • Name Variation Rate: The proportion of clusters with name variation within them.

For instance, if I know that names are quite clean in my data, then I want the name variation rate to be very low.

Describe alternatives you've considered

The er-evaluation package implements the two metrics, but it uses Pandas and it's quite slow. The formulas are given in this paper (my paper): https://arxiv.org/pdf/2404.05622

@OlivierBinette OlivierBinette added the enhancement New feature or request label May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant