Surrogate Optim. Neural Net Hyperparameter Search

sharma-n · May 29, 2020 · 14c9355 · 14c9355
1 parent 0c8f98c
commit 14c9355
Show file tree

Hide file tree

Showing 6 changed files with 654 additions and 3 deletions.
diff --git a/.gitignore b/.gitignore
@@ -3,5 +3,3 @@
 **/*.mp4
 .vscode/
 output/
-10. Surrogate Optimization.ipynb
-text_generation.ipynb
diff --git a/10. Surrogate Optimization.ipynb b/10. Surrogate Optimization.ipynb
@@ -0,0 +1,153 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Surrogate Optimization - An Introduction\n",
+    "Surrogate optimization involves the use of a multivariate iterative approximation of the expensive objective function. It is especially useful in cases when evaulating the objective function a large number of times ($\\sim 6000$ times like it is common in Genetic Algorithms) can be intractable. The purpose of surrogate optimization is to reduce the number of simulations required to obtain approximate optimization solutions for a numerically defined function $f(x)$.\n",
+    "\n",
+    "This situation is very common when the optimization function is actually a simulation of a non linear model, which may not have any deriviatives and may be multi-modal. The 3 identifying characteristics for functions ideal for surrogate optimization are:\n",
+    "\n",
+    "* **Costly:** Computing $f(x)$ even once requires a significant amount of time.\n",
+    "* **Deriviative-free:** For some complex functions, deriviatives are unavailable as it may not be able to compute them accurately in a sufficient amount of time. Inaccurate deriviative estimates may actually do more harm with gradient based optimization.\n",
+    "* **Black box, multi-modal:** For black box functoins, it may not be possible to study them to understand if they are multi-modal or not.\n",
+    "\n",
+    "Surrogate optimization is called so because it uses **Surrogate models**. A surrogate model $R(x)$ (or a *response surface model* or *function approximation model*) approximates a continuous function $f(x)$. Examples of this could be radial basis functions, kriging, polynomials, splines and regression. This response surface is used as a \"surrogate\" (replacement) for the expensive function $f(x)$ in parts of the optimization. This helps reduce the number of points at which $f(x)$ is evaluated, thereby significantly reducing computational cost.\n",
+    "\n",
+    "## General Algorithm Flow\n",
+    "A surrogate optimization algorithm usually proceeds in the following steps:\n",
+    "\n",
+    "1. Initialization of the surrogate surface using a space filling method. (Expensive)\n",
+    "2. Generation of the surrogate response surface including the previously evaluated set of points $A$.\n",
+    "3. Search the response surface for next evaluation point $x^e$. This requires: a) An optimization search procedure on the surface, and b) A criterion for the optimization search.\n",
+    "4. Evaluate the expensive function $f(x)$ for $x=x^e$ and store the evaluation in the set of evaluated points $A$. (Expensive)\n",
+    "5. If number of evaluations $<$ max evaluations, goto step 3.\n",
+    "\n",
+    "For step 1, the initial points to evluate can be selected using methods like [Latin Hypercube Sampling / Orthogonal Sampling](https://en.wikipedia.org/wiki/Latin_hypercube_sampling).\n",
+    "\n",
+    "## Stochastic Response Surface (SRS) Method\n",
+    "**Problem Definition:** Given $D$ a compact set in $\\mathbb{R}^d$ and let $f: D\\to\\mathbb{R}$ be a deterministic continuous function. The **global optimization problem (GOP)** is to find $x\\in D$ such that $f(x)<f(x') \\quad\\forall x'\\in D$. For simplicity we assume that teh domain $D$ is a compact hypercube in $\\mathbb{R}^d$. Function $f$ is treated as a black box that results from an expensive simulation and the deriviatives of $f$ are not avaialable. The algorithms described below were developed and published in the following two papers:\n",
+    "\n",
+    "* [R.G. Regis, C.A. Shoemaker (2007):A Stochastic Radial Basis Function Method for the Global Optimization of Expensive Functions](https://pubsonline.informs.org/doi/abs/10.1287/ijoc.1060.0182)\n",
+    "* [R.G. Regis, C.A. Shoemaker (2009): Parallel Stochastic Global Optimization Using Radial Basis Functions](https://pubsonline.informs.org/doi/10.1287/ijoc.1090.0325)\n",
+    "\n",
+    "### SRS Method\n",
+    "The SRS method is iterative, and at each iteration the response surface model is updated and one point is selected for function evaluation from a set of randomly generated points, called candidate points. Given that $f$ has a unique global minimizer $x^*\\in D$, and the random candidate points and the probability distribution that generate them satisfy some suitable conditions, then SRS applied to $f$ on $D$ converges almost surely to $x^*$. Given $n$ the number of previosuly evaluated points, $A_n$ the set of previously evaluated points, and $s_n(x)$ the response surface model after $n$ function evalautions, the algorithm is described below:\n",
+    "\n",
+    "* **Inputs:**\n",
+    "  * $f$ the continuous real-valued function defined on a compact hypercube $D\\in\\mathbb{R}^d$\n",
+    "  * A response surface model, e.g. radial basis functions or neural networks\n",
+    "  * Initial evaluation points $I = (x_1, \\ldots, x_{n_0})$ (generated by using for example a space-filling method)\n",
+    "  * $t$, the number of candidate points in each iteration. $t\\sim \\mathcal{O}(d)$\n",
+    "  * $N_{max}$, the maximum number of function evaluations allowed\n",
+    "* **Output:** The best decision point encountered by the algorithm\n",
+    "* **Steps:**\n",
+    "  1. *Evaluate the function* $f$ at each point in $I$. Set $n=n_0$ and $A_n = I$. Let $x_n$ be the point in $A_n$ with the best function value.\n",
+    "  2. While $(n<N_{max}$:\n",
+    "  \n",
+    "    a. *Fit / update response surface model* $s_n(x)$ using data points $B_n = \\{(x_i, f(x_i)): i=1, \\ldots, n\\}$\n",
+    "\n",
+    "    b. *Randomly generate $t$ candidate points* $\\Omega_n = \\{y_{n,1}, \\ldots, y_{n,t}\\}$ in $\\mathbb{R}^d$. For each $j=1, \\ldots, t$, if $y_{n,j}\\notin D$, then replace $y_{n,j}$ by the nearest point in $D$.\n",
+    "\n",
+    "    c. *Select the next function evaluation point* using the information from the surface response model $s_n(x)$ and the data points $B_n$ to select the evaluation point $x_{n+1}$ deterministically from the $t$ candidate points in $\\Omega_n$.\n",
+    "\n",
+    "    d. *Evaluate the function* $f$ at the new point $x_{n+1}$\n",
+    "\n",
+    "    e. *Update information* $A_{n+1} = A_n\\cup x_{n+1},\\quad B_{n+1} = B_{n}\\cup \\{x_{n+1}, f(x_{n+1})\\}$. Let $x_{n+1}^*$ be the point in $A_{n+1}$ with the best function value. Reset $n = n+1$.\n",
+    "    \n",
+    "  3. *Return the best solution* found, $x_{N_{MAX}}^*$\n",
+    "\n",
+    "For the steps 2.2, generation of candidate points, various methods can be used to generate the candidates, with 2 examples as:\n",
+    "\n",
+    "1. Uniform candidates: generated uniformly at random throughout $D$, referred to as type U candidate points.\n",
+    "2. Normal Random candidates: generated in the vicinity of the current best solution $x_n$ obtained by adding random pertubations to $x_n$ that are $\\sim\\mathcal{N}(0, \\sigma^2_n I_d)$, where $\\text{inf}_{n\\geq n_0}\\sigma_n>0$. These are referred to as type N candidate points.\n",
+    "\n",
+    "### Metric SRS (MSRS)\n",
+    "Metric SRS is a special case of SRS where the next point of function evaluation is chosen from the candidates based on the best weighted score for two criteria: estimated function value obtained from the response surface model, and the minimum distance from previously evaluated points. There can be two further versions of MSRS:\n",
+    "\n",
+    "1. Global MSRS: Global optimization versions\n",
+    "2. Multistart Local MSRS: Local optimization using many intiali solutions started on a parallel computer to reach mutliple local optimums at the same time.\n",
+    "\n",
+    "Every candidate point in MSRS is given a score between 0 and 1, with 0 being given to the most desirable point. Ideally a good candidate should have a low estimated function value and should be far away from the previously evaluated points. To this extent, a distance metric $D\\in\\mathbb{R}^d$ is defined along with a set of nonnegative weights $\\{(w_n^R, w_n^D):n=n_0, n_0+1, \\ldots\\}$ such that $w_n^R+w_n^D=1, \\forall n\\geq n_0$ for the response surface and the distance criteria for $x\\in\\Omega_n$, the set of candidate points. Because of these changes, the step 2.c in the SRS method changes to the steps below:\n",
+    "\n",
+    "* *Estimate the function value of candidate points:* For each $x\\in\\Omega_n$, compute $s_n(x), s_n^{max}$ and $s_n^{min}$.\n",
+    "* *Determine the minimum distance from previously evaluated points:* For each $x\\in\\Omega_n$, compute $\\Delta_n(x) = min_{1\\geq i\\geq n}D(x, x_i)$. Also compute $\\Delta_n^{max}$ and $\\Delta_n^{min}$.\n",
+    "* *Compute the score $V_n^R$ for the Response Surface Criterion:* For each $x\\in\\Omega_n$, compute $V_n^R(x) = \\frac{s_n(x)-s_n^{min}}{s_n^{max}-s_n^{min}}$. If $s_n^{max}=s_n^{min}$, then $V_n^R(x) = 1$.\n",
+    "* *Compute the score $V_n^D$ for the Distance Criterion:* For each $x\\in\\Omega_n$, compute $V_n^D(x) = \\frac{\\Delta_n^{max}-\\Delta_n(x)}{\\Delta_n^{max}-\\Delta_n^{min}}$. If $\\Delta_n^{max}=\\Delta_n^{min}$, then $V_n^D(x) = 1$.\n",
+    "* *Compute the weighted score:* For each $x\\in\\Omega_n$, compute $\\mathcal{W}_n(x) = w_n^RV_n^R(x)+w_n^DV_n^D(x)$\n",
+    "* *Select the next evaluation point:* Let $x_{n+1}$ be the point in $\\Omega_n$ that minimizes $\\mathcal{W}_n$.\n",
+    "\n",
+    "The MSRS apprach has teh following advantages:\n",
+    "* Easy to implement compared to alternative optimization methods.\n",
+    "* Avoids solving any difficult optimization subproblems.\n",
+    "* Selection of points is based on which candidate points minimize $\\mathcal{W}_n$. The guiding principla here is the selection of points that fulfill dual goals.\n",
+    "* If $w_n^R=0$, this is performing global search since selected points are ar away from previosuly evaluated points. If $w_n^R = 1$, then this is performing local search since the search is for the candidate which is minimum on the surrogate surface. This parameter can be changed over the course of the optimization to move from one approach to the other (for example starting with global search and then slowly converting it to more local search)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## DYCORS Candidate Generation Algorithm\n",
+    "DYCORS (DYnamic COordinate search using Response Surface models) is a framework for bound constrianed HEB (High dimensional, Expensive, Black-box) problems developed by [R. Regis, C.A. Shoemaker (2013)](https://www.tandfonline.com/doi/abs/10.1080/0305215X.2012.687731). On top of keeping a surrogate model of the objective function, DYCORS uses a dynamic coordinate search strategy for generating trial solutions (similar to the DDS Algorithm). The evaluation point is selected by perturbing only a subset of the coordinates of the current best solution. Below is presented the DYCORS-LMSRBF extension, which is the DYCORS concept applied to the LMSRBF method.\n",
+    "\n",
+    "DYCORS differes from MSRS in the following ways:\n",
+    "* In MSRS, a candidate point is generated by applying normally distributed random pertubations on all coordinates ofthe current $x_{best}$.\n",
+    "* In DYCORS, a candidate point is generated by applying normally distributed pertubations on only *some* of the coordinates of $x_{best}$.\n",
+    "* In DYCORS, similar to DDS, the corrdinates to be perturbed are randomly selected, and the number of coordinates pertubed decreases with the number of function evaluations.\n",
+    "\n",
+    "### Inputs for DYCORS\n",
+    "The additional inputs required for DYCORS are:\n",
+    "* A strictly decreasing function $\\varphi(n)$ defined for all positive integers $n_0\\geq n\\geq N_{max}-1$, with values in $[0,1]$\n",
+    "* The initial step size $\\sigma_{init}$ and the minimum step size $\\sigma_{min}$.\n",
+    "* (optional) The tolerance for the number of consecutive failed iteration $\\mathcal{T}_{fail}$ and the threshold for the number of consecutive successful iterations $\\mathcal{T}_{success}$.\n",
+    "\n",
+    "### Steps for DYCORS\n",
+    "The steps until 2.a remain the same as before. Below are the steps 2.b onwards.\n",
+    "\n",
+    "2. While iterations are not finished\n",
+    "\n",
+    "  a. ...\n",
+    "  \n",
+    "  b. *Determine probability of perturbing a coordinate*, calculate $p_{select} = \\varphi(n)$.\n",
+    "\n",
+    "  c. *Generate multiple candidate points*: Generate $\\Omega_n = \\{y_{n,1}, \\ldots, y_{n ,t}\\}$ as follows. For $j=1, \\ldots, t$ do:\n",
+    "    * Select coordinates to perturb: Generate $d$ uniform random numbers $w_1, \\ldots, w_d\\in [0,1]$. Let $I_{perturb}=\\{i:w_i<p_{select}\\}$. If $I_{perturb}=\\phi$, then select $j$ uniformly at random and set $I_{perturb} = \\{j\\}$.\n",
+    "    * Generate candidate point: Generate $y_{n,j} = x_{best}+z$ where $z^{(i)}=0\\forall i\\notin I_{perturb}$ and $z^{(i)}$ is a realization of $\\mathcal{N}(0, \\sigma_n)\\forall i\\in I_{perturb}$.\n",
+    "    * Ensure candidate point is in domain: If $y_{n,j}\\notin D$, then replace it by a point in $D$ obtained by performing successive reflection of $y_{n,j}$ about the closest point on the boundary of $D$.\n",
+    "    \n",
+    "  d. Select next evaluation point based on some criterion $x_{n+1}=$`select_evaluation_point`$(\\Omega_n, \\mathcal{B_n}, s_n(x))$.\n",
+    "\n",
+    "  e. Perform function evaluation $f(x_{n+1})$\n",
+    "\n",
+    "  f. Update counters: If $f(x_{n+1}<f_{best}$, then reset $C_{success}+=1$ and $C_{fail}=0$. Otherwise, reset $C_{fail}+=1$ and $C_{success}=0$.\n",
+    "\n",
+    "  g. Update step size: $[\\sigma_{n+1}, C_{success}, C_{fail}]=$`adjust_step_size`$(\\sigma_n, C_{success}, \\mathcal{T}_{success}, C_{fail}, \\mathcal{T}_{fail})$\n",
+    "\n",
+    "  h. Update best solution if required, and update $\\mathcal{A}_{n+1} = \\mathcal{A}_n \\cup \\{x_{n+1}\\}$ and $n = n+1$\n",
+    "\n",
+    "Note that $I_{perturb}$ changes every time a new candidate is sampled, to allow for diversity. A large number of possible criteria can be chosen to select the next evaluation point from the candidate points, and hence this is specified using the function `select_evaluation_point()`. $C_{success}$ and $C_{fail}$ are the number of consecutive successful and failed iterations respectively. At every iteration, the step size $\\sigma_n$ is adjusted using `adjust_step_size()`.\n",
+    "\n",
+    "In the case of DYCORS-LMSRBF, the `select_evaluation_point()` function selects the evaluation point based on a weighted score from two criteria: estimated function value from the RBF surrogate (RBF criterion); and minimum distance from previously evaluated points (distance criterion)."
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": 3
+  },
+  "orig_nbformat": 2
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}