|
| 1 | +Introduction |
| 2 | +============ |
| 3 | + |
| 4 | +The AMVIDC algorithm is presented in detail in the following |
| 5 | +publication: |
| 6 | + |
| 7 | +"Spectrometric differentiation of yeast strains using Minimum Volume |
| 8 | +Increase and Minimum Direction Change clustering criteria", N. Fachada, |
| 9 | +M.T. Figueiredo, V.V. Lopes, R.C. Martins and A.C. Rosa. Pattern |
| 10 | +Recognition Letters, 2014 (IN PRESS) |
| 11 | + |
| 12 | +Data format |
| 13 | +----------- |
| 14 | + |
| 15 | +Typically, data is presented as a set of samples (or points), each with |
| 16 | +a constant number of dimensions. As such, for the rest of this guide, |
| 17 | +data matrices are considered to be in the following format: |
| 18 | + |
| 19 | +- *m* x *n*, with *m* samples (points) and *n* dimensions (variables) |
| 20 | + |
| 21 | +Many times the number of dimensions is too high, making clustering |
| 22 | +inefficient. When this occurs, one can reduce data dimensionality using |
| 23 | +a number of techniques. In this work, PCA and SVD (which are very |
| 24 | +similar) are used via the Matlab native `princomp` and `svd`/`svds` |
| 25 | +functions. |
| 26 | + |
| 27 | +Generating data |
| 28 | +--------------- |
| 29 | + |
| 30 | +This code was inspired on the differentiation of spectrometric data. |
| 31 | +However, to further validate the clustering algorithms, synthetic |
| 32 | +data sets can be generated with the generateData function. This function |
| 33 | +generates data in the *m* x *n* format, with *m* samples (points) and |
| 34 | +*n* dimensions (variables) according to a set of parameters, which are |
| 35 | +explained in the source code. |
| 36 | + |
| 37 | +Running the algorithm |
| 38 | +===================== |
| 39 | + |
| 40 | +This algorithm is based on AHC, using Minimum Volume Increase (MVI) and |
| 41 | +Minimum Direction Change (MDC) clustering criteria. This algorithm can |
| 42 | +be tested using the clusterdata_amvidc.m function: |
| 43 | + |
| 44 | + idx = clusterdata_amvidc(X, k, idx_init); |
| 45 | + |
| 46 | +where **X**, **k** and **idx\_init** are the typical data matrix, |
| 47 | +maximum number of clusters and initial clustering, respectively. Initial |
| 48 | +clustering is required so that all possible new clusters have volume, a |
| 49 | +requirement for MVI. `clusterdata_amvidc` function has many optional |
| 50 | +parameters, with reasonable defaults, as specified in the following |
| 51 | +table: |
| 52 | + |
| 53 | + Parameter Default Options/Description |
| 54 | + ------------- ------------------------ ------------------------------------------------------------------------------------------------------- |
| 55 | + *volume* ‘convhull’ Volume type: ‘ellipsoid’ or ‘convhull’ |
| 56 | + *tol* 0.01 Tolerance for minimum volume ellipse calculation (‘ellipsoid’ volume only) |
| 57 | + *dirweight* 0 Direction weight in last iteration (0 means MDC linkage is ignored) |
| 58 | + *dirpower* *dirweight* \> 0 Convergence power to dirweight (higher values make convergence steeper and occurring more to the end) |
| 59 | + *dirtype* ‘svd’ Direction type: ‘pca’, ‘svd’ |
| 60 | + *nvi* true Allow negative volume increase? |
| 61 | + *loglevel* 3 (show warnings only) Log level: 0 (show all messages) to 4 (only show critical errors), default is 3 (show warnings) |
| 62 | + |
| 63 | +For example, to perform clustering using ellipsoid volume taking into |
| 64 | +account direction change, where cluster direction is determined using |
| 65 | +PCA, one would do: |
| 66 | + |
| 67 | + idx = clusterdata_mvidc(X, k, idx_init, 'volume', 'ellipsoid', 'dirweight',0.5, 'dirpower', 4, 'dirtype', 'pca'); |
| 68 | + |
| 69 | +As specified, the `clusterdata_amvidc` function requires initial clusters |
| 70 | +which, if joined, produce new clusters with volume. There are two |
| 71 | +clustering functions appropriate for this (but others can be used): |
| 72 | + |
| 73 | +- initClust.m - Performs very simple initial clustering based |
| 74 | + on AHC with single linkage (nearest neighbor) and user defined |
| 75 | + distance. Each sample is associated with the same cluster of its |
| 76 | + nearest point. Allows to define a minimum size for each cluster, |
| 77 | + distance type (as supported by Matlab `pdist`) and the number of |
| 78 | + clusters which are allowed to have less than the minimum size. |
| 79 | +- pddp.m - Perform PDDP (principal direction divisive |
| 80 | + clustering) on input data. This implementation always selects the |
| 81 | + largest cluster for division, with the algorithm proceeding while |
| 82 | + the division of a cluster yields sub-clusters which can have a |
| 83 | + volume. |
| 84 | + |
| 85 | +Analysis of results |
| 86 | +=================== |
| 87 | + |
| 88 | +F-score |
| 89 | +------- |
| 90 | + |
| 91 | +In this work, the [F-score](http://en.wikipedia.org/wiki/F1_score) |
| 92 | +measure was used to evaluate clustering results. The source:fscore.m |
| 93 | +function was developed for this purpose. To run this function, do: |
| 94 | + |
| 95 | + eval = fscore(idx, numclasses, numclassmembers); |
| 96 | + |
| 97 | +where: |
| 98 | + |
| 99 | +- **idx** - *m* x *1* vector containing the cluster indices of each |
| 100 | + point (as returned by the clustering functions) |
| 101 | +- **numclasses** - Correct number of clusters |
| 102 | +- **numclassmembers** - Vector with the correct size of each cluster |
| 103 | + (or a scalar if all clusters are of the same size) |
| 104 | + |
| 105 | +The `fscore` function returns: |
| 106 | + |
| 107 | +- **eval** - Value between 0 (worst case) and 1 (perfect clustering) |
| 108 | + |
| 109 | +Plotting clusters |
| 110 | +----------------- |
| 111 | + |
| 112 | +Sometimes visualizing how an algorithm grouped clusters can provide |
| 113 | +important insight on its effectiveness. Also, it may be important to |
| 114 | +visually compare an algorithm’s clustering result with the correct |
| 115 | +result. These are the goals of the plotClusters.m function, which |
| 116 | +can show two clustering results in the same image (e.g. the correct one |
| 117 | +and one returned by an algorithm). You can run `plotClusters` in the |
| 118 | +following way: |
| 119 | + |
| 120 | + h_out = plotClusters(X, dims, idx_marker, idx_encircle, encircle_method, h_in); |
| 121 | + |
| 122 | +where: |
| 123 | + |
| 124 | +- **X** - Data matrix, *m* x *n*, with m samples (points) and n |
| 125 | + dimensions (variables) |
| 126 | +- **dims** - Number of dimensions (2 or 3) |
| 127 | +- **idx_marker** - Clustering result ^1^ to be shown directly in |
| 128 | + points using markers |
| 129 | +- **idx_encircle** - Clustering result ^1^ to be shown using |
| 130 | + encirclement/grouping of points |
| 131 | +- **encircle_method** - How to encircle the **idx*encircle** |
| 132 | + result: ‘convhull’ (default), ‘ellipsoid’ or ‘none’ |
| 133 | +- **h_in** - (Optional) Existing figure handle where to create |
| 134 | + plot |
| 135 | + |
| 136 | +^1^ *m* x *1* vector containing the cluster indices of each point |
| 137 | + |
| 138 | +The `plotClusters` function returns: |
| 139 | + |
| 140 | +- **h_out** - Figure handle of plot |
| 141 | + |
| 142 | + |
0 commit comments