Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialization is using euclidean distance #18

Open
xinyangz opened this issue Jan 20, 2019 · 3 comments
Open

Initialization is using euclidean distance #18

xinyangz opened this issue Jan 20, 2019 · 3 comments

Comments

@xinyangz
Copy link

centers = _init_centroids(
X, n_clusters, init, random_state=random_state, x_squared_norms=x_squared_norms
)
if verbose:
print("Initialization complete")

I might be getting this wrong, but the code here seems to be using initialization function from sklearn. This could cause issue since the kmeans++ initialization in sklearn is based on euclidean distance. It should be replaced with cosine distance.

@jasonlaska
Copy link
Owner

Thank you, I'll look into it.

@rtrad89
Copy link

rtrad89 commented Jul 16, 2019

May I ask what the status of this issue is? I looked into the commits log but to no avail.
I am currently using init="random" as a safer option, and would like to switch to k-means++ if this issue is resolved.

Many thanks!

@jasonlaska
Copy link
Owner

jasonlaska commented Dec 27, 2019

Hi @rtrad89 or @t103z I'm not sure how much this will actually matter since this is just used for intiailization (and sklearn has chosen not to provide a distance-configurable version of kmeans++).

The code for the initialization is here: https://github.com/scikit-learn/scikit-learn/blob/0.22.X/sklearn/cluster/_k_means.py#L41, do you want to take a stab at a PR to port it for cosine dist?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants