-
Notifications
You must be signed in to change notification settings - Fork 50
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
9b7de8e
commit 01a7d60
Showing
20 changed files
with
56 additions
and
97 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,30 +1,12 @@ | ||
# Generalized K-Means Clustering | ||
# Introduction | ||
|
||
This project generalizes the Spark MLLIB Batch K-Means (v1.1.0) clusterer and the Spark MLLIB Streaming K-Means (v1.2.0) clusterer.  | ||
The goal of K-Means clustering is to produce a set of clusters of a set of points that satisfies certain optimality constraints. That model is called a **K-Means model** \[`trait KMeansModel]`. It is fundamentally a set of points and a function that defines the distance from an arbitrary point to a cluster center. | ||
|
||
Most practical variants of K-means clustering are implemented or can be implemented with this package. | ||
The K-Means algorithm computes a K-Means model using an iterative algorithm known as [Lloyd's algorithm](http://en.wikipedia.org/wiki/Lloyd's\_algorithm). Each iteration of Lloyd's algorithm assigns a set of points to clusters, then updates the cluster centers to acknowledge the assignment of the points to the cluster. | ||
|
||
If you find a novel variant of k-means clustering that is provably superior in some manner, implement it using the package and send a pull request along with the paper analyzing the variant! | ||
The update of clusters is a form of averaging. Newly added points are averaged into the cluster while (optionally) reassigned points are removed from their prior clusters. | ||
|
||
This code has been tested on data sets of tens of millions of points in a 700+ dimensional space using a variety of distance functions. Thanks to the excellent core Spark implementation, it rocks! | ||
A K-Means Model can be constructed from any set of cluster centers and distance function. However, the more interesting models satisfy an optimality constraint. If we sum the distances from the points in a given set to their closest cluster centers, we get a number called the "distortion" or "cost".  | ||
|
||
A K-Means Model is locally optimal with respect to a set of points if each cluster center is determined by the mean of the points assigned to that cluster. Computing such a `KMeansModel` given a set of points is called "training" the model on those points. | ||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
#### | ||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
#### | ||
|
||
```scala | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,35 +1,24 @@ | ||
# Table of contents | ||
|
||
* [Generalized K-Means Clustering](README.md) | ||
* [Introduction](introduction.md) | ||
* [Introduction](README.md) | ||
* [Relation to Spark K-Means Clusterer](introduction/relation-to-spark-k-means-clusterer.md) | ||
* [Algorithms Implemented](introduction/algorithms-implemented.md) | ||
* [Requirements](requirements.md) | ||
* [Quick Start](quick-start.md) | ||
* [Concepts](concepts/README.md) | ||
* [Bregman Divergence](concepts/bregman-divergence.md) | ||
* [WeightedVector](concepts/weightedvector.md) | ||
* [BregmanPoint, BregmanCenter, BregmanPointOps](concepts/bregmanpoint-bregmancenter-bregmanpointops.md) | ||
* [KMeansModel](concepts/kmeansmodel.md) | ||
* [MultiKMeansClusterer](concepts/multikmeansclusterer.md) | ||
* [WeightedVector](concepts/weightedvector.md) | ||
* [KMeansSelector](concepts/kmeansselector.md) | ||
* [Usage](usage/README.md) | ||
* [Distance Functions](usage/distance-functions.md) | ||
* [Selecting a Distance Function](usage/selecting-a-distance-function.md) | ||
* [Constructing K-Means Models using Clusterers](usage/constructing-k-means-models-using-clusterers.md) | ||
* [Using an Embedding](usage/using-an-embedding.md) | ||
* [Embedding Data](usage/embedding-data.md) | ||
* [Seeding the Set of Cluster Centers](usage/seeding-the-set-of-cluster-centers.md) | ||
* [Iterative Clustering](usage/iterative-clustering.md) | ||
* [Alternative KMeansModel Construction](usage/alternative-kmeansmodel-construction.md) | ||
* [Customizing](usage/customizing/README.md) | ||
* [Creating a Custom Distance Function](usage/customizing/creating-a-custom-distance-function.md) | ||
* [Creating a Custom Embedding](usage/customizing/creating-a-custom-embedding.md) | ||
|
||
## Algorithms | ||
|
||
* [Algorithms Implemented](algorithms/algorithms-implemented/README.md) | ||
* [Clustering using general distance functions (Bregman divergences)](algorithms/algorithms-implemented/clustering-using-general-distance-functions-bregman-divergences.md) | ||
* [Clustering large numbers of points using mini-batches](algorithms/algorithms-implemented/clustering-large-numbers-of-points-using-mini-batches.md) | ||
* [Clustering high dimensional Euclidean data](algorithms/algorithms-implemented/clustering-high-dimensional-euclidean-data.md) | ||
* [Clustering high dimensional time series data](algorithms/algorithms-implemented/clustering-high-dimensional-time-series-data.md) | ||
* [Clustering using symmetrized Bregman divergences](algorithms/algorithms-implemented/clustering-using-symmetrized-bregman-divergences.md) | ||
* [Clustering via bisection](algorithms/algorithms-implemented/clustering-via-bisection.md) | ||
* [Clustering with near-optimality](algorithms/algorithms-implemented/clustering-with-near-optimality.md) | ||
* [Clustering streaming data](algorithms/algorithms-implemented/clustering-streaming-data.md) |
This file was deleted.
Oops, something went wrong.
4 changes: 0 additions & 4 deletions
4
...algorithms/algorithms-implemented/clustering-high-dimensional-euclidean-data.md
This file was deleted.
Oops, something went wrong.
4 changes: 0 additions & 4 deletions
4
...gorithms/algorithms-implemented/clustering-high-dimensional-time-series-data.md
This file was deleted.
Oops, something went wrong.
4 changes: 0 additions & 4 deletions
4
...algorithms-implemented/clustering-large-numbers-of-points-using-mini-batches.md
This file was deleted.
Oops, something went wrong.
3 changes: 0 additions & 3 deletions
3
docs/algorithms/algorithms-implemented/clustering-streaming-data.md
This file was deleted.
Oops, something went wrong.
4 changes: 0 additions & 4 deletions
4
...-implemented/clustering-using-general-distance-functions-bregman-divergences.md
This file was deleted.
Oops, something went wrong.
4 changes: 0 additions & 4 deletions
4
...thms/algorithms-implemented/clustering-using-symmetrized-bregman-divergences.md
This file was deleted.
Oops, something went wrong.
4 changes: 0 additions & 4 deletions
4
docs/algorithms/algorithms-implemented/clustering-via-bisection.md
This file was deleted.
Oops, something went wrong.
4 changes: 0 additions & 4 deletions
4
docs/algorithms/algorithms-implemented/clustering-with-near-optimality.md
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# Algorithms Implemented | ||
|
||
Most practical variants of K-means clustering are implemented or can be implemented with this package. | ||
|
||
* [clustering using general distance functions (Bregman divergences)](http://www.cs.utexas.edu/users/inderjit/public\_papers/bregmanclustering\_jmlr.pdf) | ||
* [clustering large numbers of points using mini-batches](https://arxiv.org/abs/1108.1351) | ||
* [clustering high dimensional Euclidean data](http://www.ida.liu.se/\~arnjo/papers/pakdd-ws-11.pdf) | ||
* [clustering high dimensional time series data](http://www.cs.gmu.edu/\~jessica/publications/ikmeans\_sdm\_workshop03.pdf) | ||
* [clustering using symmetrized Bregman divergences](https://people.clas.ufl.edu/yun/files/article-8-1.pdf) | ||
* [clustering via bisection](http://www.siam.org/meetings/sdm01/pdf/sdm01\_05.pdf) | ||
* [clustering with near-optimality](http://theory.stanford.edu/\~sergei/papers/vldb12-kmpar.pdf) | ||
* [clustering streaming data](http://papers.nips.cc/paper/3812-streaming-k-means-approximation.pdf) | ||
|
||
If you find a novel variant of k-means clustering that is provably superior in some manner, implement it using the package and send a pull request along with the paper analyzing the variant!\ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# Relation to Spark K-Means Clusterer | ||
|
||
This project generalizes the Spark MLLIB Batch K-Means (v1.1.0) clusterer and the Spark MLLIB Streaming K-Means (v1.2.0) clusterer.  | ||
|
||
This code has been tested on data sets of tens of millions of points in a 700+ dimensional space using a variety of distance functions. Thanks to the excellent core Spark implementation, it rocks! | ||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
#### | ||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
#### | ||
|
||
```scala | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
File renamed without changes.
File renamed without changes.