diff --git a/docs/README.md b/docs/README.md index fc6fa8a..77df777 100644 --- a/docs/README.md +++ b/docs/README.md @@ -1,30 +1,12 @@ -# Generalized K-Means Clustering +# Introduction -This project generalizes the Spark MLLIB Batch K-Means (v1.1.0) clusterer and the Spark MLLIB Streaming K-Means (v1.2.0) clusterer. +The goal of K-Means clustering is to produce a set of clusters of a set of points that satisfies certain optimality constraints. That model is called a **K-Means model** \[`trait KMeansModel]`. It is fundamentally a set of points and a function that defines the distance from an arbitrary point to a cluster center. -Most practical variants of K-means clustering are implemented or can be implemented with this package. +The K-Means algorithm computes a K-Means model using an iterative algorithm known as [Lloyd's algorithm](http://en.wikipedia.org/wiki/Lloyd's\_algorithm). Each iteration of Lloyd's algorithm assigns a set of points to clusters, then updates the cluster centers to acknowledge the assignment of the points to the cluster. -If you find a novel variant of k-means clustering that is provably superior in some manner, implement it using the package and send a pull request along with the paper analyzing the variant! +The update of clusters is a form of averaging. Newly added points are averaged into the cluster while (optionally) reassigned points are removed from their prior clusters. -This code has been tested on data sets of tens of millions of points in a 700+ dimensional space using a variety of distance functions. Thanks to the excellent core Spark implementation, it rocks! +A K-Means Model can be constructed from any set of cluster centers and distance function. However, the more interesting models satisfy an optimality constraint. If we sum the distances from the points in a given set to their closest cluster centers, we get a number called the "distortion" or "cost". +A K-Means Model is locally optimal with respect to a set of points if each cluster center is determined by the mean of the points assigned to that cluster. Computing such a `KMeansModel` given a set of points is called "training" the model on those points. - - - - - - - -#### - - - - - - - -#### - -```scala -``` diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md index a3e9a82..f2cc2e3 100644 --- a/docs/SUMMARY.md +++ b/docs/SUMMARY.md @@ -1,35 +1,24 @@ # Table of contents -* [Generalized K-Means Clustering](README.md) -* [Introduction](introduction.md) +* [Introduction](README.md) + * [Relation to Spark K-Means Clusterer](introduction/relation-to-spark-k-means-clusterer.md) + * [Algorithms Implemented](introduction/algorithms-implemented.md) * [Requirements](requirements.md) * [Quick Start](quick-start.md) * [Concepts](concepts/README.md) * [Bregman Divergence](concepts/bregman-divergence.md) + * [WeightedVector](concepts/weightedvector.md) * [BregmanPoint, BregmanCenter, BregmanPointOps](concepts/bregmanpoint-bregmancenter-bregmanpointops.md) * [KMeansModel](concepts/kmeansmodel.md) * [MultiKMeansClusterer](concepts/multikmeansclusterer.md) - * [WeightedVector](concepts/weightedvector.md) * [KMeansSelector](concepts/kmeansselector.md) * [Usage](usage/README.md) - * [Distance Functions](usage/distance-functions.md) + * [Selecting a Distance Function](usage/selecting-a-distance-function.md) * [Constructing K-Means Models using Clusterers](usage/constructing-k-means-models-using-clusterers.md) - * [Using an Embedding](usage/using-an-embedding.md) + * [Embedding Data](usage/embedding-data.md) * [Seeding the Set of Cluster Centers](usage/seeding-the-set-of-cluster-centers.md) * [Iterative Clustering](usage/iterative-clustering.md) * [Alternative KMeansModel Construction](usage/alternative-kmeansmodel-construction.md) * [Customizing](usage/customizing/README.md) * [Creating a Custom Distance Function](usage/customizing/creating-a-custom-distance-function.md) * [Creating a Custom Embedding](usage/customizing/creating-a-custom-embedding.md) - -## Algorithms - -* [Algorithms Implemented](algorithms/algorithms-implemented/README.md) - * [Clustering using general distance functions (Bregman divergences)](algorithms/algorithms-implemented/clustering-using-general-distance-functions-bregman-divergences.md) - * [Clustering large numbers of points using mini-batches](algorithms/algorithms-implemented/clustering-large-numbers-of-points-using-mini-batches.md) - * [Clustering high dimensional Euclidean data](algorithms/algorithms-implemented/clustering-high-dimensional-euclidean-data.md) - * [Clustering high dimensional time series data](algorithms/algorithms-implemented/clustering-high-dimensional-time-series-data.md) - * [Clustering using symmetrized Bregman divergences](algorithms/algorithms-implemented/clustering-using-symmetrized-bregman-divergences.md) - * [Clustering via bisection](algorithms/algorithms-implemented/clustering-via-bisection.md) - * [Clustering with near-optimality](algorithms/algorithms-implemented/clustering-with-near-optimality.md) - * [Clustering streaming data](algorithms/algorithms-implemented/clustering-streaming-data.md) diff --git a/docs/algorithms/algorithms-implemented/README.md b/docs/algorithms/algorithms-implemented/README.md deleted file mode 100644 index cc0a01c..0000000 --- a/docs/algorithms/algorithms-implemented/README.md +++ /dev/null @@ -1,5 +0,0 @@ -# Algorithms Implemented - -Most practical variants of K-means clustering are implemented or can be implemented with this package. - -If you find a novel variant of k-means clustering that is provably superior in some manner, implement it using the package and send a pull request along with the paper analyzing the variant! diff --git a/docs/algorithms/algorithms-implemented/clustering-high-dimensional-euclidean-data.md b/docs/algorithms/algorithms-implemented/clustering-high-dimensional-euclidean-data.md deleted file mode 100644 index 82d48e2..0000000 --- a/docs/algorithms/algorithms-implemented/clustering-high-dimensional-euclidean-data.md +++ /dev/null @@ -1,4 +0,0 @@ -# Clustering high dimensional Euclidean data - -* [clustering high dimensional Euclidean data](http://www.ida.liu.se/\~arnjo/papers/pakdd-ws-11.pdf) -* diff --git a/docs/algorithms/algorithms-implemented/clustering-high-dimensional-time-series-data.md b/docs/algorithms/algorithms-implemented/clustering-high-dimensional-time-series-data.md deleted file mode 100644 index c0483f4..0000000 --- a/docs/algorithms/algorithms-implemented/clustering-high-dimensional-time-series-data.md +++ /dev/null @@ -1,4 +0,0 @@ -# Clustering high dimensional time series data - -* [clustering high dimensional time series data](http://www.cs.gmu.edu/\~jessica/publications/ikmeans\_sdm\_workshop03.pdf) -* diff --git a/docs/algorithms/algorithms-implemented/clustering-large-numbers-of-points-using-mini-batches.md b/docs/algorithms/algorithms-implemented/clustering-large-numbers-of-points-using-mini-batches.md deleted file mode 100644 index 32ed67f..0000000 --- a/docs/algorithms/algorithms-implemented/clustering-large-numbers-of-points-using-mini-batches.md +++ /dev/null @@ -1,4 +0,0 @@ -# Clustering large numbers of points using mini-batches - -* [clustering large numbers of points using mini-batches](https://arxiv.org/abs/1108.1351) -* diff --git a/docs/algorithms/algorithms-implemented/clustering-streaming-data.md b/docs/algorithms/algorithms-implemented/clustering-streaming-data.md deleted file mode 100644 index 856a5e5..0000000 --- a/docs/algorithms/algorithms-implemented/clustering-streaming-data.md +++ /dev/null @@ -1,3 +0,0 @@ -# Clustering streaming data - -[clustering streaming data](http://papers.nips.cc/paper/3812-streaming-k-means-approximation.pdf) diff --git a/docs/algorithms/algorithms-implemented/clustering-using-general-distance-functions-bregman-divergences.md b/docs/algorithms/algorithms-implemented/clustering-using-general-distance-functions-bregman-divergences.md deleted file mode 100644 index 6df284c..0000000 --- a/docs/algorithms/algorithms-implemented/clustering-using-general-distance-functions-bregman-divergences.md +++ /dev/null @@ -1,4 +0,0 @@ -# Clustering using general distance functions (Bregman divergences) - -* [clustering using general distance functions (Bregman divergences)](http://www.cs.utexas.edu/users/inderjit/public\_papers/bregmanclustering\_jmlr.pdf) -* diff --git a/docs/algorithms/algorithms-implemented/clustering-using-symmetrized-bregman-divergences.md b/docs/algorithms/algorithms-implemented/clustering-using-symmetrized-bregman-divergences.md deleted file mode 100644 index cb3438f..0000000 --- a/docs/algorithms/algorithms-implemented/clustering-using-symmetrized-bregman-divergences.md +++ /dev/null @@ -1,4 +0,0 @@ -# Clustering using symmetrized Bregman divergences - -* [clustering using symmetrized Bregman divergences](https://people.clas.ufl.edu/yun/files/article-8-1.pdf) -* diff --git a/docs/algorithms/algorithms-implemented/clustering-via-bisection.md b/docs/algorithms/algorithms-implemented/clustering-via-bisection.md deleted file mode 100644 index 9da4b88..0000000 --- a/docs/algorithms/algorithms-implemented/clustering-via-bisection.md +++ /dev/null @@ -1,4 +0,0 @@ -# Clustering via bisection - -* [clustering via bisection](http://www.siam.org/meetings/sdm01/pdf/sdm01\_05.pdf) -* diff --git a/docs/algorithms/algorithms-implemented/clustering-with-near-optimality.md b/docs/algorithms/algorithms-implemented/clustering-with-near-optimality.md deleted file mode 100644 index 7ca4841..0000000 --- a/docs/algorithms/algorithms-implemented/clustering-with-near-optimality.md +++ /dev/null @@ -1,4 +0,0 @@ -# Clustering with near-optimality - -* [clustering with near-optimality](http://theory.stanford.edu/\~sergei/papers/vldb12-kmpar.pdf) -* diff --git a/docs/concepts/kmeansmodel.md b/docs/concepts/kmeansmodel.md index 9086966..2852f91 100644 --- a/docs/concepts/kmeansmodel.md +++ b/docs/concepts/kmeansmodel.md @@ -1,6 +1,6 @@ # KMeansModel -We define our realization of a k-means model, `KMeansModel`, which we enrich with operations to find closest clusters to a point and to compute distances: +A K-means model is a set of cluster centers. We abstract the K-means model with the `KMeansModel` trait with methods to map an arbitrary point (viz. `Vector`, `WeightedVector`, or `BregmanPoint`) to the nearest cluster center and to compute the cost/distance to that center. ```scala package com.massivedatascience.clusterer diff --git a/docs/concepts/kmeansselector.md b/docs/concepts/kmeansselector.md index f82b0d1..4e9ad2e 100644 --- a/docs/concepts/kmeansselector.md +++ b/docs/concepts/kmeansselector.md @@ -1,10 +1,6 @@ # KMeansSelector -Any K-Means model may be used as seed value to Lloyd's algorithm. In fact, our clusterers accept multiple seed sets. The `K-Means.train` helper methods allows one to name an initialization method. - -Two algorithms are implemented that produce viable seed sets. They may be constructed by using the `apply` method of the companion object`KMeansSelector`. - -Initializers are implemented with the `KMeansSelector` trait. +The initial selection of cluster centers is called the initialization step. We abstract implementations of the initialization step with the `KMeansSelector` trait. ```scala package com.massivedatascience.clusterer diff --git a/docs/concepts/multikmeansclusterer.md b/docs/concepts/multikmeansclusterer.md index d9ba94b..144ba28 100644 --- a/docs/concepts/multikmeansclusterer.md +++ b/docs/concepts/multikmeansclusterer.md @@ -1,6 +1,6 @@ # MultiKMeansClusterer -One may construct K-Means models using one of the provided clusterers that implement Lloyd's algorithm. +Lloyd's algorithm is simple to describe, but in practice different implementations are possible that can yield dramatically different running times depending on the data being clusters. We abstract the clusterer using the `MultiKMeansClusterer` trait. ```scala trait MultiKMeansClusterer extends Serializable with Logging { diff --git a/docs/introduction.md b/docs/introduction.md deleted file mode 100644 index 77df777..0000000 --- a/docs/introduction.md +++ /dev/null @@ -1,12 +0,0 @@ -# Introduction - -The goal of K-Means clustering is to produce a set of clusters of a set of points that satisfies certain optimality constraints. That model is called a **K-Means model** \[`trait KMeansModel]`. It is fundamentally a set of points and a function that defines the distance from an arbitrary point to a cluster center. - -The K-Means algorithm computes a K-Means model using an iterative algorithm known as [Lloyd's algorithm](http://en.wikipedia.org/wiki/Lloyd's\_algorithm). Each iteration of Lloyd's algorithm assigns a set of points to clusters, then updates the cluster centers to acknowledge the assignment of the points to the cluster. - -The update of clusters is a form of averaging. Newly added points are averaged into the cluster while (optionally) reassigned points are removed from their prior clusters. - -A K-Means Model can be constructed from any set of cluster centers and distance function. However, the more interesting models satisfy an optimality constraint. If we sum the distances from the points in a given set to their closest cluster centers, we get a number called the "distortion" or "cost". - -A K-Means Model is locally optimal with respect to a set of points if each cluster center is determined by the mean of the points assigned to that cluster. Computing such a `KMeansModel` given a set of points is called "training" the model on those points. - diff --git a/docs/introduction/algorithms-implemented.md b/docs/introduction/algorithms-implemented.md new file mode 100644 index 0000000..4ecf685 --- /dev/null +++ b/docs/introduction/algorithms-implemented.md @@ -0,0 +1,14 @@ +# Algorithms Implemented + +Most practical variants of K-means clustering are implemented or can be implemented with this package. + +* [clustering using general distance functions (Bregman divergences)](http://www.cs.utexas.edu/users/inderjit/public\_papers/bregmanclustering\_jmlr.pdf) +* [clustering large numbers of points using mini-batches](https://arxiv.org/abs/1108.1351) +* [clustering high dimensional Euclidean data](http://www.ida.liu.se/\~arnjo/papers/pakdd-ws-11.pdf) +* [clustering high dimensional time series data](http://www.cs.gmu.edu/\~jessica/publications/ikmeans\_sdm\_workshop03.pdf) +* [clustering using symmetrized Bregman divergences](https://people.clas.ufl.edu/yun/files/article-8-1.pdf) +* [clustering via bisection](http://www.siam.org/meetings/sdm01/pdf/sdm01\_05.pdf) +* [clustering with near-optimality](http://theory.stanford.edu/\~sergei/papers/vldb12-kmpar.pdf) +* [clustering streaming data](http://papers.nips.cc/paper/3812-streaming-k-means-approximation.pdf) + +If you find a novel variant of k-means clustering that is provably superior in some manner, implement it using the package and send a pull request along with the paper analyzing the variant!\ diff --git a/docs/introduction/relation-to-spark-k-means-clusterer.md b/docs/introduction/relation-to-spark-k-means-clusterer.md new file mode 100644 index 0000000..3cf8e9f --- /dev/null +++ b/docs/introduction/relation-to-spark-k-means-clusterer.md @@ -0,0 +1,26 @@ +# Relation to Spark K-Means Clusterer + +This project generalizes the Spark MLLIB Batch K-Means (v1.1.0) clusterer and the Spark MLLIB Streaming K-Means (v1.2.0) clusterer. + +This code has been tested on data sets of tens of millions of points in a 700+ dimensional space using a variety of distance functions. Thanks to the excellent core Spark implementation, it rocks! + + + + + + + + + +#### + + + + + + + +#### + +```scala +``` diff --git a/docs/usage/alternative-kmeansmodel-construction.md b/docs/usage/alternative-kmeansmodel-construction.md index af5ec68..23c8115 100644 --- a/docs/usage/alternative-kmeansmodel-construction.md +++ b/docs/usage/alternative-kmeansmodel-construction.md @@ -1,5 +1,5 @@ --- -description: How to creaate K-Means Models using the KMeansModel Helper Object +description: How to create K-Means Models using the KMeansModel companion Object --- # Alternative KMeansModel Construction diff --git a/docs/usage/using-an-embedding.md b/docs/usage/embedding-data.md similarity index 100% rename from docs/usage/using-an-embedding.md rename to docs/usage/embedding-data.md diff --git a/docs/usage/distance-functions.md b/docs/usage/selecting-a-distance-function.md similarity index 100% rename from docs/usage/distance-functions.md rename to docs/usage/selecting-a-distance-function.md