This module provides a function for computing the earth mover's distance (EMD) in Python2. It wraps Yossi Rubner's C implementation of the EMD into a Python module using a custom SWIG interface wrapper.
The EMD is a distance measure between two probability distributions. It can be used, for example, to retrieve similar images from a database. However, this implementation is not limited to images or histograms; it can represent distributions over any user-defined features, using any user-defined ground distance. Please see the original homepage for more explanations and references.
Building the module requires the following:
- Python2 C headers (
python-dev
) - C compiler and linker, as well as Make (
build-essential
) - SWIG Simple Wrapper and Interface Generator (
swig
) - EMD source code and interface definition (
emd.h
,emd.c
, andemd.i
in this repository)
The name in parentheses after the first three entries is the name of the Debian/Ubuntu Linux package that contains the required files.
If all requirements are met, the module can be built using Make.
Simply execute make
in the directory that contains the EMD source
code and interface definition. The output should look similar to the
following:
>>> Building object file 'emd.o'.
cc -o emd.o -c emd.c -fPIC -I/usr/include/python2.7 -I/usr/include/python2.7
>>> Generating C interface
swig -python emd.i
>>> Building object file 'emd_wrap.o'.
cc -o emd_wrap.o -c emd_wrap.c -fPIC -I/usr/include/python2.7 -I/usr/include/python2.7
>>> Linking wrapper library '_emd.so'.
ld -shared -o _emd.so emd.o emd_wrap.o
rm emd_wrap.o emd.o emd_wrap.c
You can ignore warnings about redefinitions of INFINITY
and extra
tokens at the end of include statements. If everything went well, the
now directory contains the C library _emd.so
and its Python wrapper
emd.py
.
The included example1.py
is a port of example1.c
. It demonstrates
the basic use of the EMD module. Simply run the script with the
Python2 interpreter:
$ python2 example1.py
160.542770386
The function emd
in the emd
module computes the EMD. (I bet you
were surprised by this fact.) It accepts two pairs of feature and
weight lists, and a callable for computing the ground distance.
from emd import emd
print emd( (features1, weights1), (features2, weights2), distance)
In above code, features1
and features2
are lists of arbitrary
Python objects. The implementation invokes the callable distance
on
pairs of feature objects to compute the distance matrix. Each feature
object is assigned a weight to characterize the distribution.
The implementation treats objects as opaquely as possible. The only limitations are:
features1
andweights1
are lists of the same length. The entries ofweights1
must be floating point numbers;features1
can contain arbitrary objects, but thedistance
function must be able to process them. Similarly forfeatures2
andweights2
.- The
distance
argument is a callable object (for example a function) that accepts two arguments (from thefeatures
lists) and returns a floating point number.
See example1.py
for a filled-in example.
- The C implementation uses
float
as its fundamental data type for distances and weights. The (typicallydouble
) floating point objects of Python will be down-cast and lose precision. - See
emd.h
for the hard-coded definitions of maximum signature size, maximum iteration count, and computational constants. - Currently, the wrapper ignores the flow output argument that is available in the C implementation.
- The wrapper is not suited for concurrent execution. It uses a
global variable for the distance callback function, so calling
emd
from concurrent threads will result in undefined behavior.
Only the wrapper code, the Makefile, the Python version of
example1.c
, and this document were written by me and consequently
are (c) copyrighted by Peter Dinges. These files are available under
the open-source MIT License.
Yossi Rubner is the author of the C code that performs the actual computation. I have included his source code in this repository because I am unsure about the future availability of his homepage. (I could not reach him via email.)