Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: Indices for Clustering. #11369

Open
ntnshrav opened this issue Jun 27, 2018 · 5 comments
Open

Proposal: Indices for Clustering. #11369

ntnshrav opened this issue Jun 27, 2018 · 5 comments

Comments

@ntnshrav
Copy link

I have implemented cluster validation indices both internal and external Indices as part of my package.
I have 40 such indices which are tested and packaged as such into a Package called CRAVED. I would like to merge these indices as a part of SKlearns metrics for Clustering. Please let me how i can get started with this.

@jnothman
Copy link
Member

jnothman commented Jun 27, 2018

I don't think it is in our interest to implement and maintain numerous, rarely used metrics, when we're not able to also advise what the benefits or uses of them are. See related comments on inclusion criteria at our FAQ. I can also immediately see that you are missing several clustering metrics (many of which I know from the related coreference resolution evaluation literature)

I'd be interested in potentially supporting:

@jnothman
Copy link
Member

Every feature we include has a maintenance cost. Our maintainers are mostly
volunteers. For a new feature to be included, we need evidence that it is
often useful and, ideally, well-established in the literature or in
practice. That doesn't stop you implementing it for yourself and publishing
it in a separate repository.

@ntnshrav
Copy link
Author

Score Function : works good for hyper-spheroidal data..s. It is shown to work well on multidimensional
data sets and is able to accommodate unique and sub-cluster
cases.

Davies–Bouldin index - validation of how well the clustering has been done is made using quantities and features inherent to the dataset.

Dunn index -- ratio between the minimal intracluster distance to maximal intercluster distance
Drawbacks : computationally expensive and sensitive to noisy data.

Hartigan Index : generally used to find no. cluster in a dataset (used only for K-Means Algorithm)
advantage: can be used in combination with silhouette or GAP index to find no.of.classes in dataset more accurately than any other methodREF: [Hartigan Indexx]

and about external indices ,

Entropy : degree to which each cluster contains objects of a single class.

Kulczynski_index : arithmetic mean of the precision and recall coefficients.
They reveal some aspects of the clustering done.
Advantage of these External Indices are that They can be Used over In classification metrics also.
So ,I have already Implemented these Indices .Would like to Know what should i do next.

@BradKML
Copy link

BradKML commented Nov 18, 2021

@BradKML
Copy link

BradKML commented Jan 19, 2022

@cmarmo Thanks for the RFC. the best internal indices reference implementation is in https://github.com/Simon-Bertrand/Clusters-Features/blob/main/ClustersFeatures/src/_score_index.py

Also for external indices, I have noted some other table for external evaluation (which is suitable for clustering and community detection) GiulioRossetti/cdlib#147 (comment) calling back to #1362

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants