Skip to content

Commit

Permalink
GITBOOK-6: No subject
Browse files Browse the repository at this point in the history
  • Loading branch information
derrickburns authored and gitbook-bot committed Jan 18, 2024
1 parent 9b7de8e commit 01a7d60
Show file tree
Hide file tree
Showing 20 changed files with 56 additions and 97 deletions.
30 changes: 6 additions & 24 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,12 @@
# Generalized K-Means Clustering
# Introduction

This project generalizes the Spark MLLIB Batch K-Means (v1.1.0) clusterer and the Spark MLLIB Streaming K-Means (v1.2.0) clusterer. 
The goal of K-Means clustering is to produce a set of clusters of a set of points that satisfies certain optimality constraints. That model is called a **K-Means model** \[`trait KMeansModel]`. It is fundamentally a set of points and a function that defines the distance from an arbitrary point to a cluster center.

Most practical variants of K-means clustering are implemented or can be implemented with this package.
The K-Means algorithm computes a K-Means model using an iterative algorithm known as [Lloyd's algorithm](http://en.wikipedia.org/wiki/Lloyd's\_algorithm). Each iteration of Lloyd's algorithm assigns a set of points to clusters, then updates the cluster centers to acknowledge the assignment of the points to the cluster.

If you find a novel variant of k-means clustering that is provably superior in some manner, implement it using the package and send a pull request along with the paper analyzing the variant!
The update of clusters is a form of averaging. Newly added points are averaged into the cluster while (optionally) reassigned points are removed from their prior clusters.

This code has been tested on data sets of tens of millions of points in a 700+ dimensional space using a variety of distance functions. Thanks to the excellent core Spark implementation, it rocks!
A K-Means Model can be constructed from any set of cluster centers and distance function. However, the more interesting models satisfy an optimality constraint. If we sum the distances from the points in a given set to their closest cluster centers, we get a number called the "distortion" or "cost". 

A K-Means Model is locally optimal with respect to a set of points if each cluster center is determined by the mean of the points assigned to that cluster. Computing such a `KMeansModel` given a set of points is called "training" the model on those points.








####







####

```scala
```
23 changes: 6 additions & 17 deletions docs/SUMMARY.md
Original file line number Diff line number Diff line change
@@ -1,35 +1,24 @@
# Table of contents

* [Generalized K-Means Clustering](README.md)
* [Introduction](introduction.md)
* [Introduction](README.md)
* [Relation to Spark K-Means Clusterer](introduction/relation-to-spark-k-means-clusterer.md)
* [Algorithms Implemented](introduction/algorithms-implemented.md)
* [Requirements](requirements.md)
* [Quick Start](quick-start.md)
* [Concepts](concepts/README.md)
* [Bregman Divergence](concepts/bregman-divergence.md)
* [WeightedVector](concepts/weightedvector.md)
* [BregmanPoint, BregmanCenter, BregmanPointOps](concepts/bregmanpoint-bregmancenter-bregmanpointops.md)
* [KMeansModel](concepts/kmeansmodel.md)
* [MultiKMeansClusterer](concepts/multikmeansclusterer.md)
* [WeightedVector](concepts/weightedvector.md)
* [KMeansSelector](concepts/kmeansselector.md)
* [Usage](usage/README.md)
* [Distance Functions](usage/distance-functions.md)
* [Selecting a Distance Function](usage/selecting-a-distance-function.md)
* [Constructing K-Means Models using Clusterers](usage/constructing-k-means-models-using-clusterers.md)
* [Using an Embedding](usage/using-an-embedding.md)
* [Embedding Data](usage/embedding-data.md)
* [Seeding the Set of Cluster Centers](usage/seeding-the-set-of-cluster-centers.md)
* [Iterative Clustering](usage/iterative-clustering.md)
* [Alternative KMeansModel Construction](usage/alternative-kmeansmodel-construction.md)
* [Customizing](usage/customizing/README.md)
* [Creating a Custom Distance Function](usage/customizing/creating-a-custom-distance-function.md)
* [Creating a Custom Embedding](usage/customizing/creating-a-custom-embedding.md)

## Algorithms

* [Algorithms Implemented](algorithms/algorithms-implemented/README.md)
* [Clustering using general distance functions (Bregman divergences)](algorithms/algorithms-implemented/clustering-using-general-distance-functions-bregman-divergences.md)
* [Clustering large numbers of points using mini-batches](algorithms/algorithms-implemented/clustering-large-numbers-of-points-using-mini-batches.md)
* [Clustering high dimensional Euclidean data](algorithms/algorithms-implemented/clustering-high-dimensional-euclidean-data.md)
* [Clustering high dimensional time series data](algorithms/algorithms-implemented/clustering-high-dimensional-time-series-data.md)
* [Clustering using symmetrized Bregman divergences](algorithms/algorithms-implemented/clustering-using-symmetrized-bregman-divergences.md)
* [Clustering via bisection](algorithms/algorithms-implemented/clustering-via-bisection.md)
* [Clustering with near-optimality](algorithms/algorithms-implemented/clustering-with-near-optimality.md)
* [Clustering streaming data](algorithms/algorithms-implemented/clustering-streaming-data.md)
5 changes: 0 additions & 5 deletions docs/algorithms/algorithms-implemented/README.md

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

This file was deleted.

2 changes: 1 addition & 1 deletion docs/concepts/kmeansmodel.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# KMeansModel

We define our realization of a k-means model, `KMeansModel`, which we enrich with operations to find closest clusters to a point and to compute distances:
A K-means model is a set of cluster centers. We abstract the K-means model with the `KMeansModel` trait with methods to map an arbitrary point (viz. `Vector`, `WeightedVector`, or `BregmanPoint`) to the nearest cluster center and to compute the cost/distance to that center. 

```scala
package com.massivedatascience.clusterer
Expand Down
6 changes: 1 addition & 5 deletions docs/concepts/kmeansselector.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,6 @@
# KMeansSelector

Any K-Means model may be used as seed value to Lloyd's algorithm. In fact, our clusterers accept multiple seed sets. The `K-Means.train` helper methods allows one to name an initialization method.

Two algorithms are implemented that produce viable seed sets. They may be constructed by using the `apply` method of the companion object`KMeansSelector`.

Initializers are implemented with the `KMeansSelector` trait.
The initial selection of cluster centers is called the initialization step. We abstract implementations of the initialization step with the `KMeansSelector` trait.

```scala
package com.massivedatascience.clusterer
Expand Down
2 changes: 1 addition & 1 deletion docs/concepts/multikmeansclusterer.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# MultiKMeansClusterer

One may construct K-Means models using one of the provided clusterers that implement Lloyd's algorithm.
Lloyd's algorithm is simple to describe, but in practice different implementations are possible that can yield dramatically different running times depending on the data being clusters. We abstract the clusterer using the `MultiKMeansClusterer` trait.

```scala
trait MultiKMeansClusterer extends Serializable with Logging {
Expand Down
12 changes: 0 additions & 12 deletions docs/introduction.md

This file was deleted.

14 changes: 14 additions & 0 deletions docs/introduction/algorithms-implemented.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Algorithms Implemented

Most practical variants of K-means clustering are implemented or can be implemented with this package.

* [clustering using general distance functions (Bregman divergences)](http://www.cs.utexas.edu/users/inderjit/public\_papers/bregmanclustering\_jmlr.pdf)
* [clustering large numbers of points using mini-batches](https://arxiv.org/abs/1108.1351)
* [clustering high dimensional Euclidean data](http://www.ida.liu.se/\~arnjo/papers/pakdd-ws-11.pdf)
* [clustering high dimensional time series data](http://www.cs.gmu.edu/\~jessica/publications/ikmeans\_sdm\_workshop03.pdf)
* [clustering using symmetrized Bregman divergences](https://people.clas.ufl.edu/yun/files/article-8-1.pdf)
* [clustering via bisection](http://www.siam.org/meetings/sdm01/pdf/sdm01\_05.pdf)
* [clustering with near-optimality](http://theory.stanford.edu/\~sergei/papers/vldb12-kmpar.pdf)
* [clustering streaming data](http://papers.nips.cc/paper/3812-streaming-k-means-approximation.pdf)

If you find a novel variant of k-means clustering that is provably superior in some manner, implement it using the package and send a pull request along with the paper analyzing the variant!\
26 changes: 26 additions & 0 deletions docs/introduction/relation-to-spark-k-means-clusterer.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Relation to Spark K-Means Clusterer

This project generalizes the Spark MLLIB Batch K-Means (v1.1.0) clusterer and the Spark MLLIB Streaming K-Means (v1.2.0) clusterer. 

This code has been tested on data sets of tens of millions of points in a 700+ dimensional space using a variety of distance functions. Thanks to the excellent core Spark implementation, it rocks!









####







####

```scala
```
2 changes: 1 addition & 1 deletion docs/usage/alternative-kmeansmodel-construction.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
description: How to creaate K-Means Models using the KMeansModel Helper Object
description: How to create K-Means Models using the KMeansModel companion Object
---

# Alternative KMeansModel Construction
Expand Down
File renamed without changes.
File renamed without changes.

0 comments on commit 01a7d60

Please sign in to comment.