GITBOOK-6: No subject

derrickburns · Jan 18, 2024 · 01a7d60 · 01a7d60
1 parent 9b7de8e
commit 01a7d60
Show file tree

Hide file tree

Showing 20 changed files with 56 additions and 97 deletions.
diff --git a/docs/README.md b/docs/README.md
@@ -1,30 +1,12 @@
-# Generalized K-Means Clustering
+# Introduction
 
-This project generalizes the Spark MLLIB Batch K-Means (v1.1.0) clusterer and the Spark MLLIB Streaming K-Means (v1.2.0) clusterer.&#x20;
+The goal of K-Means clustering is to produce a set of clusters of a set of points that satisfies certain optimality constraints. That model is called a **K-Means model** \[`trait KMeansModel]`. It is fundamentally a set of points and a function that defines the distance from an arbitrary point to a cluster center.
 
-Most practical variants of K-means clustering are implemented or can be implemented with this package.
+The K-Means algorithm computes a K-Means model using an iterative algorithm known as [Lloyd's algorithm](http://en.wikipedia.org/wiki/Lloyd's\_algorithm). Each iteration of Lloyd's algorithm assigns a set of points to clusters, then updates the cluster centers to acknowledge the assignment of the points to the cluster.
 
-If you find a novel variant of k-means clustering that is provably superior in some manner, implement it using the package and send a pull request along with the paper analyzing the variant!
+The update of clusters is a form of averaging. Newly added points are averaged into the cluster while (optionally) reassigned points are removed from their prior clusters.
 
-This code has been tested on data sets of tens of millions of points in a 700+ dimensional space using a variety of distance functions. Thanks to the excellent core Spark implementation, it rocks!
+A K-Means Model can be constructed from any set of cluster centers and distance function. However, the more interesting models satisfy an optimality constraint. If we sum the distances from the points in a given set to their closest cluster centers, we get a number called the "distortion" or "cost".&#x20;
 
+A K-Means Model is locally optimal with respect to a set of points if each cluster center is determined by the mean of the points assigned to that cluster. Computing such a `KMeansModel` given a set of points is called "training" the model on those points.
 
-
-
-
-
-
-
-
-####
-
-
-
-
-
-
-
-####
-
-```scala
-```
diff --git a/docs/SUMMARY.md b/docs/SUMMARY.md
@@ -1,35 +1,24 @@
 # Table of contents
 
-* [Generalized K-Means Clustering](README.md)
-* [Introduction](introduction.md)
+* [Introduction](README.md)
+  * [Relation to Spark K-Means Clusterer](introduction/relation-to-spark-k-means-clusterer.md)
+  * [Algorithms Implemented](introduction/algorithms-implemented.md)
 * [Requirements](requirements.md)
 * [Quick Start](quick-start.md)
 * [Concepts](concepts/README.md)
   * [Bregman Divergence](concepts/bregman-divergence.md)
+  * [WeightedVector](concepts/weightedvector.md)
   * [BregmanPoint, BregmanCenter, BregmanPointOps](concepts/bregmanpoint-bregmancenter-bregmanpointops.md)
   * [KMeansModel](concepts/kmeansmodel.md)
   * [MultiKMeansClusterer](concepts/multikmeansclusterer.md)
-  * [WeightedVector](concepts/weightedvector.md)
   * [KMeansSelector](concepts/kmeansselector.md)
 * [Usage](usage/README.md)
-  * [Distance Functions](usage/distance-functions.md)
+  * [Selecting a Distance Function](usage/selecting-a-distance-function.md)
   * [Constructing K-Means Models using Clusterers](usage/constructing-k-means-models-using-clusterers.md)
-  * [Using an Embedding](usage/using-an-embedding.md)
+  * [Embedding Data](usage/embedding-data.md)
   * [Seeding the Set of Cluster Centers](usage/seeding-the-set-of-cluster-centers.md)
   * [Iterative Clustering](usage/iterative-clustering.md)
   * [Alternative KMeansModel Construction](usage/alternative-kmeansmodel-construction.md)
   * [Customizing](usage/customizing/README.md)
     * [Creating a Custom Distance Function](usage/customizing/creating-a-custom-distance-function.md)
     * [Creating a Custom Embedding](usage/customizing/creating-a-custom-embedding.md)
-
-## Algorithms
-
-* [Algorithms Implemented](algorithms/algorithms-implemented/README.md)
-  * [Clustering using general distance functions (Bregman divergences)](algorithms/algorithms-implemented/clustering-using-general-distance-functions-bregman-divergences.md)
-  * [Clustering large numbers of points using mini-batches](algorithms/algorithms-implemented/clustering-large-numbers-of-points-using-mini-batches.md)
-  * [Clustering high dimensional Euclidean data](algorithms/algorithms-implemented/clustering-high-dimensional-euclidean-data.md)
-  * [Clustering high dimensional time series data](algorithms/algorithms-implemented/clustering-high-dimensional-time-series-data.md)
-  * [Clustering using symmetrized Bregman divergences](algorithms/algorithms-implemented/clustering-using-symmetrized-bregman-divergences.md)
-  * [Clustering via bisection](algorithms/algorithms-implemented/clustering-via-bisection.md)
-  * [Clustering with near-optimality](algorithms/algorithms-implemented/clustering-with-near-optimality.md)
-  * [Clustering streaming data](algorithms/algorithms-implemented/clustering-streaming-data.md)
diff --git a/docs/algorithms/algorithms-implemented/README.md b/docs/algorithms/algorithms-implemented/README.md
diff --git a/...algorithms/algorithms-implemented/clustering-high-dimensional-euclidean-data.md b/...algorithms/algorithms-implemented/clustering-high-dimensional-euclidean-data.md
diff --git a/...gorithms/algorithms-implemented/clustering-high-dimensional-time-series-data.md b/...gorithms/algorithms-implemented/clustering-high-dimensional-time-series-data.md
diff --git a/...algorithms-implemented/clustering-large-numbers-of-points-using-mini-batches.md b/...algorithms-implemented/clustering-large-numbers-of-points-using-mini-batches.md
diff --git a/docs/algorithms/algorithms-implemented/clustering-streaming-data.md b/docs/algorithms/algorithms-implemented/clustering-streaming-data.md
diff --git a/...-implemented/clustering-using-general-distance-functions-bregman-divergences.md b/...-implemented/clustering-using-general-distance-functions-bregman-divergences.md
diff --git a/...thms/algorithms-implemented/clustering-using-symmetrized-bregman-divergences.md b/...thms/algorithms-implemented/clustering-using-symmetrized-bregman-divergences.md
diff --git a/docs/algorithms/algorithms-implemented/clustering-via-bisection.md b/docs/algorithms/algorithms-implemented/clustering-via-bisection.md
diff --git a/docs/algorithms/algorithms-implemented/clustering-with-near-optimality.md b/docs/algorithms/algorithms-implemented/clustering-with-near-optimality.md
diff --git a/docs/concepts/kmeansmodel.md b/docs/concepts/kmeansmodel.md
@@ -1,6 +1,6 @@
 # KMeansModel
 
-We define our realization of a k-means model, `KMeansModel`, which we enrich with operations to find closest clusters to a point and to compute distances:
+A K-means model is a set of cluster centers.  We abstract the K-means model with the `KMeansModel` trait with methods to map an arbitrary point (viz. `Vector`, `WeightedVector`, or `BregmanPoint`) to the nearest cluster center and to compute the cost/distance to that center.&#x20;
 
 ```scala
 package com.massivedatascience.clusterer

diff --git a/docs/concepts/kmeansselector.md b/docs/concepts/kmeansselector.md
@@ -1,10 +1,6 @@
 # KMeansSelector
 
-Any K-Means model may be used as seed value to Lloyd's algorithm. In fact, our clusterers accept multiple seed sets. The `K-Means.train` helper methods allows one to name an initialization method.
-
-Two algorithms are implemented that produce viable seed sets. They may be constructed by using the `apply` method of the companion object`KMeansSelector`.
-
-Initializers are implemented with the `KMeansSelector` trait.
+The initial selection of cluster centers is called the initialization step. We abstract implementations of the initialization step with the `KMeansSelector` trait.
 
 ```scala
 package com.massivedatascience.clusterer

diff --git a/docs/concepts/multikmeansclusterer.md b/docs/concepts/multikmeansclusterer.md
@@ -1,6 +1,6 @@
 # MultiKMeansClusterer
 
-One may construct K-Means models using one of the provided clusterers that implement Lloyd's algorithm.
+Lloyd's algorithm is simple to describe, but in practice different implementations are possible that can yield dramatically different running times depending on the data being clusters. We abstract the clusterer using the `MultiKMeansClusterer` trait.
 
 ```scala
 trait MultiKMeansClusterer extends Serializable with Logging {

diff --git a/docs/introduction.md b/docs/introduction.md
diff --git a/docs/introduction/algorithms-implemented.md b/docs/introduction/algorithms-implemented.md
@@ -0,0 +1,14 @@
+# Algorithms Implemented
+
+Most practical variants of K-means clustering are implemented or can be implemented with this package.
+
+* [clustering using general distance functions (Bregman divergences)](http://www.cs.utexas.edu/users/inderjit/public\_papers/bregmanclustering\_jmlr.pdf)
+* [clustering large numbers of points using mini-batches](https://arxiv.org/abs/1108.1351)
+* [clustering high dimensional Euclidean data](http://www.ida.liu.se/\~arnjo/papers/pakdd-ws-11.pdf)
+* [clustering high dimensional time series data](http://www.cs.gmu.edu/\~jessica/publications/ikmeans\_sdm\_workshop03.pdf)
+* [clustering using symmetrized Bregman divergences](https://people.clas.ufl.edu/yun/files/article-8-1.pdf)
+* [clustering via bisection](http://www.siam.org/meetings/sdm01/pdf/sdm01\_05.pdf)
+* [clustering with near-optimality](http://theory.stanford.edu/\~sergei/papers/vldb12-kmpar.pdf)
+* [clustering streaming data](http://papers.nips.cc/paper/3812-streaming-k-means-approximation.pdf)
+
+If you find a novel variant of k-means clustering that is provably superior in some manner, implement it using the package and send a pull request along with the paper analyzing the variant!\
diff --git a/docs/introduction/relation-to-spark-k-means-clusterer.md b/docs/introduction/relation-to-spark-k-means-clusterer.md
@@ -0,0 +1,26 @@
+# Relation to Spark K-Means Clusterer
+
+This project generalizes the Spark MLLIB Batch K-Means (v1.1.0) clusterer and the Spark MLLIB Streaming K-Means (v1.2.0) clusterer.&#x20;
+
+This code has been tested on data sets of tens of millions of points in a 700+ dimensional space using a variety of distance functions. Thanks to the excellent core Spark implementation, it rocks!
+
+
+
+
+
+
+
+
+
+####
+
+
+
+
+
+
+
+####
+
+```scala
+```
diff --git a/docs/usage/alternative-kmeansmodel-construction.md b/docs/usage/alternative-kmeansmodel-construction.md
@@ -1,5 +1,5 @@
 ---
-description: How to creaate K-Means Models using the KMeansModel Helper Object
+description: How to create K-Means Models using the KMeansModel companion Object
 ---
 
 # Alternative KMeansModel Construction

diff --git a/docs/usage/using-an-embedding.md → docs/usage/embedding-data.md b/docs/usage/using-an-embedding.md → docs/usage/embedding-data.md
diff --git a/docs/usage/distance-functions.md → docs/usage/selecting-a-distance-function.md b/docs/usage/distance-functions.md → docs/usage/selecting-a-distance-function.md