This project is optimized for maximum readability and not performance. This is reinventing the wheel for educational purposes.
There are comments & docstrings throughout the code.
No external libraries are used, except the Fibonacci Heap for finding MST.
The following is an explanation of the algorithm. you can read it, jump straight into the code or read these
Table of Contents
Generally, given some cost function C
, we want to find x
with minimal C(x)
.
E.g. For the k-Colouring Problem, I have defined the cost function to be the number of bad edges, i.e. edges connecting same-colour vertices.
(Alternatively, we can maximize a utility function. without loss of generality, we will stick to the first problem description)
A model
(density estimator) represents where we think the minimal C(x)
is.
At first, we don't know anything about C
, so the model
should represent a uniform distribution over all admissible solutions.
We use the model
to generate new candidate solutions (x
s) by calling model.sample()
.
We evaluate C(x)
to see if our guess was right or wrong.
We then improve our guessing iteratively, learning about the distribution of C
.
At each iteration:
- We use this
model
to generate more samples and grow ourpopulation
(usingβ_μ
). - Candidate solutions with high (bad)
C
values are discarded and the fittest ones survive (usingS
). - With this new
population
, ourmodel
could be updated to better estimate the distribution (usingα_MIMIC
).
This idea of guessing intelligently improves performance compared to Simulated Annealing
and other alternatives which blindly walk on C
.
Each EDA has a unique way of modeling with its own pros and cons.
Learning the structure of the function space helps us to better estimate, sample, and model C
.
For example, in a 2-colouring problem, imagine a graph with only two neighbouring vertices. Possible valid solutions are 0-1 and 1-0, meaning that each vertex has an equal possibility to be coloured 0 and 1. A modeling strategy without learning the relationships would sample 0-0 and 0-1 with equal possibilities.
We want a good modeling strategy that learns about the structure of x
's variables (x = [x_0, x_1]
),
e.g. knowing x_0 = 0
, the possibility of sampling x_1 = 0
should decrease drastically.
helpful read: no free lunch theorem in search and optimization - an explanation of why optimizations can only improve on others if they build in certain assumptions about structure
But learning too much structure would cost us too much time, i.e. we will reach the goal in fewer iterations but each iteration would cost more time. It's all about balancing the two.
Learning the structure perfectly would generate the joint probability distribution like this:
p(x) = p(x_0)*p(x_1 | x_0)*p(x_2 | x_1, x_0)*p(x_3 | x_2, x_1, x_0) ...
Which means: the probability of x1
knowing that x_0
is whatever it is
,
the probability of x2
knowing that x_1, x_0
are whatever they are
,
the probability of x3
knowing that x_2, x_1, x_0
are whatever they are
,
and so on and so forth.
But this increases our computational complexity exponentially as len(x)
grows
MIMIC solves this problem by limiting the learning, sacrificing perfect structure for performance.
MIMIC computes only pairwise conditional probabilities
(using the current population
),
i.e. p(x_2 | x_1)
instead of p(x_2 | x_1, x_0)
and p(x_3 | x_2)
instead of p(x_3 | x_2, x_1, x_0)
.
p(x) = p(x_0)*p(x_1 | x_0)*p(x_2 | x_1)*p(x_3 | x_2) ...
This decreases the accuracy of our estimation. the amount of decrease is rigorously measured by Kullback–Leibler divergence.
The divergence between the perfect joint probability distribution and MIMIC's estimation is inevitable, but we will try to minimize this divergence by finding the best permutation.
The permutation is the limited structure that our algorithm learns.
Some variables have more impact on some than on others.
In the k-Colouring Problem, neighbouring vertices are intuitively more impactful on each other. If one is blue, the other is definitely not blue. Mutually independent variables have zero impact on each other.
The idea is to compute p(x_i | x_j)
s in an order in which x_j
has the most impact on x_i
.
so, instead of computing
p(x) = p(x_0)*p(x_1 | x_0)*p(x_2 | x_1)*p(x_3 | x_2) ...
we can compute (for example)
p(x) = p(x_0)*p(x_2 | x_0)*p(x_1 | x_2)*p(x_3 | x_2) ...
Given that x_0
and x_2
are more dependent on each other,
this would increase the accuracy of our estimation.
The amount of impact two variables have on each other is rigorously measured by mutual information. It is a measure of the mutual dependence between the two variables. In other words, it quantifies the amount of information obtained about one random variable by observing the other random variable.
What's better than a path/chain? A Tree!
What if x_0
is so impactful that we would be better with just
p(x) = p(x_0)*p(x_1 | x_0)*p(x_2 | x_0)*p(x_3 | x_0) ...
This would be a star in a Dependency Tree.
Dependency Trees are Bayesian Networks
or Dependency Graphs
where every node has exactly one parent (except the root
, which has None
).
This improves our estimation further without any loss in performance.
Instead of computing the best permutation, I have chosen to compute the best tree. It may be hard to believe, but it was easier for me to implement it.
To build a model
from a sample
population,
we first create a Mutual Information Graph gmi
:
A complete graph with each edge weighting its ends' mutual information.
We then compute the best tree by finding the Maximum Spanning Tree of gmi
.
Maximizing M.I. ≡ minimizing entropy ≡ minimizing KL divergence
In the process of model-building, we must deduce the empirical probabilities for each variable x_i
.
This can be easily computed as follows:
p(x_i = c)
is the percentage of x
s in our population
with x_i = c
Generating samples from our model
(Dependency Tree) is as you would think.
We choose a colour for the root
based on its empirical probability p(root)
.
For every other node v
, we choose a colour based on its empirical conditional probability p(v | parent(v))
.
note that parent(v)
's colour should be chosen before v
itself.
This constraint is satisfied with
BFS,
DFS,
and a lot of other tree traversal algorithms.
I chose DFS for the simplicity of implementation.