Skip to content

Commit c8bcc8a

Browse files
committed
First commit for GitHub
0 parents  commit c8bcc8a

18 files changed

+1560
-0
lines changed

+mve/MinVolEllipse.m

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
function [A , c] = MinVolEllipse(P, tolerance)
2+
% [A , c] = MinVolEllipse(P, tolerance)
3+
% Finds the minimum volume enclsing ellipsoid (MVEE) of a set of data
4+
% points stored in matrix P. The following optimization problem is solved:
5+
%
6+
% minimize log(det(A))
7+
% subject to (P_i - c)' * A * (P_i - c) <= 1
8+
%
9+
% in variables A and c, where P_i is the i-th column of the matrix P.
10+
% The solver is based on Khachiyan Algorithm, and the final solution
11+
% is different from the optimal value by the pre-spesified amount of 'tolerance'.
12+
%
13+
% inputs:
14+
%---------
15+
% P : (d x N) dimnesional matrix containing N points in R^d.
16+
% tolerance : error in the solution with respect to the optimal value.
17+
%
18+
% outputs:
19+
%---------
20+
% A : (d x d) matrix of the ellipse equation in the 'center form':
21+
% (x-c)' * A * (x-c) = 1
22+
% c : 'd' dimensional vector as the center of the ellipse.
23+
%
24+
% example:
25+
% --------
26+
% P = rand(5,100);
27+
% [A, c] = MinVolEllipse(P, .01)
28+
%
29+
% To reduce the computation time, work with the boundary points only:
30+
%
31+
% K = convhulln(P');
32+
% K = unique(K(:));
33+
% Q = P(:,K);
34+
% [A, c] = MinVolEllipse(Q, .01)
35+
%
36+
%
37+
% Nima Moshtagh ([email protected])
38+
% University of Pennsylvania
39+
%
40+
% December 2005
41+
% UPDATE: Jan 2009
42+
43+
44+
45+
%%%%%%%%%%%%%%%%%%%%% Solving the Dual problem%%%%%%%%%%%%%%%%%%%%%%%%%%%5
46+
% ---------------------------------
47+
% data points
48+
% -----------------------------------
49+
[d N] = size(P);
50+
51+
Q = zeros(d+1,N);
52+
Q(1:d,:) = P(1:d,1:N);
53+
Q(d+1,:) = ones(1,N);
54+
55+
56+
% initializations
57+
% -----------------------------------
58+
count = 1;
59+
err = 1;
60+
u = (1/N) * ones(N,1); % 1st iteration
61+
62+
63+
% Khachiyan Algorithm
64+
% -----------------------------------
65+
while err > tolerance,
66+
X = Q * diag(u) * Q'; % X = \sum_i ( u_i * q_i * q_i') is a (d+1)x(d+1) matrix
67+
M = diag(Q' * inv(X) * Q); % M the diagonal vector of an NxN matrix
68+
[maximum j] = max(M);
69+
step_size = (maximum - d -1)/((d+1)*(maximum-1));
70+
new_u = (1 - step_size)*u ;
71+
new_u(j) = new_u(j) + step_size;
72+
count = count + 1;
73+
err = norm(new_u - u);
74+
u = new_u;
75+
end
76+
77+
78+
79+
%%%%%%%%%%%%%%%%%%% Computing the Ellipse parameters%%%%%%%%%%%%%%%%%%%%%%
80+
% Finds the ellipse equation in the 'center form':
81+
% (x-c)' * A * (x-c) = 1
82+
% It computes a dxd matrix 'A' and a d dimensional vector 'c' as the center
83+
% of the ellipse.
84+
85+
U = diag(u);
86+
87+
% the A matrix for the ellipse
88+
% --------------------------------------------
89+
A = (1/d) * inv(P * U * P' - (P * u)*(P*u)' );
90+
91+
92+
% center of the ellipse
93+
% --------------------------------------------
94+
c = P * u;

+mve/license.txt

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
Copyright (c) 2009, Nima Moshtagh
2+
All rights reserved.
3+
4+
Redistribution and use in source and binary forms, with or without
5+
modification, are permitted provided that the following conditions are
6+
met:
7+
8+
* Redistributions of source code must retain the above copyright
9+
notice, this list of conditions and the following disclaimer.
10+
* Redistributions in binary form must reproduce the above copyright
11+
notice, this list of conditions and the following disclaimer in
12+
the documentation and/or other materials provided with the distribution
13+
14+
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
15+
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
16+
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
17+
ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE
18+
LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
19+
CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
20+
SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
21+
INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
22+
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)
23+
ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
24+
POSSIBILITY OF SUCH DAMAGE.

.gitignore

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
*~
2+
.~*#
3+
.nfs*
4+
*.mat
5+
*.png
6+
*.fig
7+
*.aux
8+
*.log
9+
*.blg
10+
*.out
11+
*.pdf
12+
*.gz
13+
*.ods
14+
synt/

README.md

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
Introduction
2+
============
3+
4+
The AMVIDC algorithm is presented in detail in the following
5+
publication:
6+
7+
"Spectrometric differentiation of yeast strains using Minimum Volume
8+
Increase and Minimum Direction Change clustering criteria", N. Fachada,
9+
M.T. Figueiredo, V.V. Lopes, R.C. Martins and A.C. Rosa. Pattern
10+
Recognition Letters, 2014 (IN PRESS)
11+
12+
Data format
13+
-----------
14+
15+
Typically, data is presented as a set of samples (or points), each with
16+
a constant number of dimensions. As such, for the rest of this guide,
17+
data matrices are considered to be in the following format:
18+
19+
- *m* x *n*, with *m* samples (points) and *n* dimensions (variables)
20+
21+
Many times the number of dimensions is too high, making clustering
22+
inefficient. When this occurs, one can reduce data dimensionality using
23+
a number of techniques. In this work, PCA and SVD (which are very
24+
similar) are used via the Matlab native `princomp` and `svd`/`svds`
25+
functions.
26+
27+
Generating data
28+
---------------
29+
30+
This code was inspired on the differentiation of spectrometric data.
31+
However, to further validate the clustering algorithms, synthetic
32+
data sets can be generated with the generateData function. This function
33+
generates data in the *m* x *n* format, with *m* samples (points) and
34+
*n* dimensions (variables) according to a set of parameters, which are
35+
explained in the source code.
36+
37+
Running the algorithm
38+
=====================
39+
40+
This algorithm is based on AHC, using Minimum Volume Increase (MVI) and
41+
Minimum Direction Change (MDC) clustering criteria. This algorithm can
42+
be tested using the clusterdata_amvidc.m function:
43+
44+
idx = clusterdata_amvidc(X, k, idx_init);
45+
46+
where **X**, **k** and **idx\_init** are the typical data matrix,
47+
maximum number of clusters and initial clustering, respectively. Initial
48+
clustering is required so that all possible new clusters have volume, a
49+
requirement for MVI. `clusterdata_amvidc` function has many optional
50+
parameters, with reasonable defaults, as specified in the following
51+
table:
52+
53+
Parameter Default Options/Description
54+
------------- ------------------------ -------------------------------------------------------------------------------------------------------
55+
*volume* ‘convhull’ Volume type: ‘ellipsoid’ or ‘convhull’
56+
*tol* 0.01 Tolerance for minimum volume ellipse calculation (‘ellipsoid’ volume only)
57+
*dirweight* 0 Direction weight in last iteration (0 means MDC linkage is ignored)
58+
*dirpower* *dirweight* \> 0 Convergence power to dirweight (higher values make convergence steeper and occurring more to the end)
59+
*dirtype* ‘svd’ Direction type: ‘pca’, ‘svd’
60+
*nvi* true Allow negative volume increase?
61+
*loglevel* 3 (show warnings only) Log level: 0 (show all messages) to 4 (only show critical errors), default is 3 (show warnings)
62+
63+
For example, to perform clustering using ellipsoid volume taking into
64+
account direction change, where cluster direction is determined using
65+
PCA, one would do:
66+
67+
idx = clusterdata_mvidc(X, k, idx_init, 'volume', 'ellipsoid', 'dirweight',0.5, 'dirpower', 4, 'dirtype', 'pca');
68+
69+
As specified, the `clusterdata_amvidc` function requires initial clusters
70+
which, if joined, produce new clusters with volume. There are two
71+
clustering functions appropriate for this (but others can be used):
72+
73+
- initClust.m - Performs very simple initial clustering based
74+
on AHC with single linkage (nearest neighbor) and user defined
75+
distance. Each sample is associated with the same cluster of its
76+
nearest point. Allows to define a minimum size for each cluster,
77+
distance type (as supported by Matlab `pdist`) and the number of
78+
clusters which are allowed to have less than the minimum size.
79+
- pddp.m - Perform PDDP (principal direction divisive
80+
clustering) on input data. This implementation always selects the
81+
largest cluster for division, with the algorithm proceeding while
82+
the division of a cluster yields sub-clusters which can have a
83+
volume.
84+
85+
Analysis of results
86+
===================
87+
88+
F-score
89+
-------
90+
91+
In this work, the [F-score](http://en.wikipedia.org/wiki/F1_score)
92+
measure was used to evaluate clustering results. The source:fscore.m
93+
function was developed for this purpose. To run this function, do:
94+
95+
eval = fscore(idx, numclasses, numclassmembers);
96+
97+
where:
98+
99+
- **idx** - *m* x *1* vector containing the cluster indices of each
100+
point (as returned by the clustering functions)
101+
- **numclasses** - Correct number of clusters
102+
- **numclassmembers** - Vector with the correct size of each cluster
103+
(or a scalar if all clusters are of the same size)
104+
105+
The `fscore` function returns:
106+
107+
- **eval** - Value between 0 (worst case) and 1 (perfect clustering)
108+
109+
Plotting clusters
110+
-----------------
111+
112+
Sometimes visualizing how an algorithm grouped clusters can provide
113+
important insight on its effectiveness. Also, it may be important to
114+
visually compare an algorithm’s clustering result with the correct
115+
result. These are the goals of the plotClusters.m function, which
116+
can show two clustering results in the same image (e.g. the correct one
117+
and one returned by an algorithm). You can run `plotClusters` in the
118+
following way:
119+
120+
h_out = plotClusters(X, dims, idx_marker, idx_encircle, encircle_method, h_in);
121+
122+
where:
123+
124+
- **X** - Data matrix, *m* x *n*, with m samples (points) and n
125+
dimensions (variables)
126+
- **dims** - Number of dimensions (2 or 3)
127+
- **idx_marker** - Clustering result ^1^ to be shown directly in
128+
points using markers
129+
- **idx_encircle** - Clustering result ^1^ to be shown using
130+
encirclement/grouping of points
131+
- **encircle_method** - How to encircle the **idx*encircle**
132+
result: ‘convhull’ (default), ‘ellipsoid’ or ‘none’
133+
- **h_in** - (Optional) Existing figure handle where to create
134+
plot
135+
136+
^1^ *m* x *1* vector containing the cluster indices of each point
137+
138+
The `plotClusters` function returns:
139+
140+
- **h_out** - Figure handle of plot
141+
142+

angleDiff.m

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
%
2+
% angleDiff function - determine smallest angle between two directions
3+
%
4+
% Parameters:
5+
% v1 - n x 1, vector representing first direction
6+
% v2 - n x 1, vector representing second direction
7+
% Output:
8+
% delta - angle in radians between first and second directions
9+
%
10+
function delta = angleDiff(v1, v2)
11+
12+
% Determine number of dimensions
13+
numDims = size(v1, 1);
14+
15+
% Make sure vectors are in 1st or 2nd quadrants (from a 2D perspective,
16+
% although this should work for m-dimensions)
17+
if v1(numDims, 1) < 0
18+
v1 = -1 * v1;
19+
end;
20+
if v2(numDims, 1) < 0
21+
v2 = -1 * v2;
22+
end;
23+
24+
% Obtain angle between vectors (thus, between the directions they
25+
% represent)
26+
cosDelta = dot(v1, v2) / (norm(v1) * norm(v2));
27+
delta = acos(cosDelta);
28+
%delta = 1 - cosDelta;
29+
30+
end

clusterVol.m

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
%
2+
% clusterVol function - Determine the volume of the cluster formed by the
3+
% given set of points. The volume is determined using the convex hull
4+
% formed by the cluster, of the minimum volume ellipsoid (mve) formed by
5+
% the cluster.
6+
%
7+
% Parameters:
8+
% points - m x n, with m samples and n dimensions
9+
% type - 'ellipsoid' or 'convhull'
10+
% zeroVolValue - value to assign to volume if given points are not enough
11+
% to calculate a volume
12+
% tol - tolerance for ellipsoid volume
13+
% Output:
14+
% volCluster - Volume of the cluster formed by the given set of points.
15+
%
16+
function volCluster = clusterVol(points, type, zeroVolValue, tol)
17+
18+
% How many points are in cluster
19+
sizeCluster = size(points, 1);
20+
% How many dimensions are at stake
21+
numDims = size(points, 2);
22+
% Add path to external functions
23+
%addpath('external/');
24+
25+
% Check if there are enough points to calculate a volume
26+
if sizeCluster < numDims + 1
27+
volCluster = zeroVolValue;
28+
return;
29+
end;
30+
31+
% Determine what type of volume to use
32+
if strcmp(type, 'ellipsoid')
33+
% Ellipsoid, use MinVolEllipse from Nima Moshtagh
34+
[A, ~] = mve.MinVolEllipse(points', tol);
35+
volCluster = det(inv(A));
36+
elseif strcmp(type, 'convhull')
37+
% Convex hull, use matlab native function
38+
[~, volCluster] = convhulln(points,{'QJ','Pp'});
39+
else
40+
error('Unknown type of volume.');
41+
end;
42+

0 commit comments

Comments
 (0)