-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
why is the Optuna CMA-ES sampler better than my custom cmaes code? #163
Comments
@FlorinAndrei Thank you for using cmaes library! Even when you have specified the |
In addition to @c-bata's comment, using CMA-ES with Margin for problems with only binary variables is not appropriate. |
@c-bata If I specify the TPESampler explicitly in Optuna, it gets stuck very quickly in a local extreme and it does not make much progress after that, on this problem. Also, I have not seen any message from Optuna that may indicate it switches the sampler. In fact, I've tried all samplers available in Optuna, and the best performance I get is with CMA-ES, which is what prompted me to use your library directly. @nomuramasahir0 With Optuna, I've tried the CMA-ES sampler with and without margin. Without margin, it makes essentially no progress. When I enable the margin, it performs as shown in the Optuna plot, which is the best performance I've seen yet for this problem. So I am still unable to explain or replicate the good performance I get from Optuna with the CMA-ES sampler. I will try the code example you've indicated. |
@c-bata @nomuramasahir0 I looked at the CMA-ES sampler in the Optuna code and compared it with my code. I've debugged the
With these changes applied, my code beats everything I've seen so far for this binary selection problem, including Optuna with CMA-ES, and a genetic algorithm I've tried, in terms of the best objective function value found. And it's very fast. My new code is included at the end. I will try (1+1)-CMA-ES with Margin later today and I'll compare it with what I have so far. New code: def cma_objective(fs):
features_use = ['const'] + [f for i, f in enumerate(features_select) if fs[i,] == 1]
lin_mod = sm.OLS(y, X[features_use], hasconst=True).fit()
return lin_mod.bic
features_select = [f for f in X.columns if f != 'const']
cma_bounds = np.tile([0, 1], (len(features_select), 1))
cma_steps = np.ones(len(features_select))
n_max_resampling = 10 * len(features_select)
optimizer = CMAwM(
mean=np.full(len(features_select), 0.5),
sigma=1 / 6,
bounds=cma_bounds,
steps=cma_steps,
n_max_resampling=n_max_resampling,
)
pop_size = optimizer.population_size
gen_max = 1000
best_value = np.inf
best_gen = 0
best_sol_raw = None
history_values = np.full((gen_max,), np.nan)
history_values_best = np.full((gen_max,), np.nan)
rs = np.random.RandomState(seed=0)
for generation in tqdm(range(gen_max)):
best_value_gen = np.inf
sol = []
solutions = []
vals = np.full((pop_size,), np.nan)
for i in range(optimizer.population_size):
seed = rs.randint(1, 2**16) + generation
optimizer._rng.seed(seed)
fs_for_eval, fs_for_tell = optimizer.ask()
solutions.append(fs_for_eval)
value = cma_objective(fs_for_eval)
vals[i] = value
sol.append((fs_for_tell, value))
optimizer.tell(sol)
best_value_gen = vals.min()
if best_value_gen < best_value:
best_value = best_value_gen
best_gen = generation
best_sol_raw = solutions[np.argmin(vals)]
print(f'gen: {best_gen:5n}, new best objective: {best_value:.4f}')
history_values[generation] = best_value_gen
history_values_best[generation] = best_value
if optimizer.should_stop():
break
gen_completed = generation |
Thank you for the detailed situation. Maybe I've got the situation.
Yes, CMA with Margin is clearly better than the vanilla CMA (i.e. without Margin), but CMA with Margin itself is not originally developed for binary-only problems, and then the effectiveness for such problems has not been verified so far. I'm not sure, but the performance may be improved by initializing the optimizer itself (i.e., initializing the mean vector randomly, and the step-size), as in BIPOP-CMA-ES (example code), when restart occurs (i.e.,
Great. I look forward to seeing the new result. |
I've tried (1+1)-CMA-ES. It's tricky to do a direct comparison with CMA-ES with Margin, because the (1+1) code is quite different (generation count is not exposed, etc) and I didn't have time to dig too deeply into it. But I've set it up such that both codes run in about the same time, approx 1 min 30 sec on my Ryzen 7 5800X3D machine. Bottom line is: (1+1) is not able to find the same best value. It's actually worse than sequential search. It's quite possible I made a mistake, since the (1+1) library is not written for general consumption. This is a minimization problem - lower objective is better. Here are the performance results from various algorithms:
Comparison between CMA-ES with margin, and (1+1)-CMA-ES. I've adjusted the X axis such that it represents actual time (as opposed to generations, etc): (1+1)-CMA-ES code: import cma.objective_function.binary as f_bin
import cma.optimizer.cmaeswm_elitist as cma_e
import cma.util.sampler as sampler
import cma.util.log as logger
from cma.objective_function.base import *
features_select = [f for f in X.columns if f != 'const']
dimensions = len(features_select)
# max_eval = dimensions ** 3
max_eval = 20000
discrete_space = np.tile(np.arange(0, 2, 1), (dimensions, 1))
class Cma1p1Obj(ObjectiveFunction):
minimization_problem = True
def __init__(self, d, target_eval=None, max_eval=1e4):
self.d = d
if target_eval is None:
target_eval = self.d
super(Cma1p1Obj, self).__init__(target_eval, max_eval)
def __call__(self, fs):
self.eval_count += len(fs)
features_use = ['const'] + [f for i, f in enumerate(features_select) if fs[0, i] == 1]
lin_mod = sm.OLS(y, X[features_use], hasconst=True).fit()
evals = np.full((1), lin_mod.bic)
self._update_best_eval(evals)
return evals
objective = Cma1p1Obj(dimensions, max_eval=max_eval)
init_m, init_sigma = f_bin.initial_setting_for_gaussian(objective)
s = sampler.Sampler(objective, 1)
cma1p1_log_file = 'cmaes-bin-1p1.csv'
l = logger.DataLogger(cma1p1_log_file)
optimizer = cma_e.CMAESwM_elitist(
d=dimensions,
discrete_space=discrete_space,
sampler=s,
m=init_m,
sigma=init_sigma,
margin=None,
min_problem=objective.minimization_problem,
postprocess=True,
)
_, best_result, _ = optimizer.run(s, logger=l, verbose=False)
print(f'best result: {best_result}') If you want to try it yourself, this is the repo, run the main notebook, it only takes a few minutes, I've disabled everything not related to CMA: |
Thank you for the interesting result. The reason may be that the problem is multimodal. By the way, feature selection is an interesting application of CMA-ES. I'll check it when I have some time. |
@nomuramasahir0 Thank you for your explanation. That indeed makes sense of what I'm seeing here. I will add the plain CMA code to my stash of code snippets for feature selection. I am planning to write an article (or maybe a series of articles) for Towards Data Science, on the topic of feature selection, and cmaes will be one of the tools I will present in the articles. Thank you for your help! |
Awesome. Please let me know when the article is published. |
@nomuramasahir0 @c-bata My CMA-ES article went live 30 minutes ago on TDS. Repository: https://github.com/FlorinAndrei/fast_feature_selection |
Hi @FlorinAndrei Masahiro Nomura, Masashi Shibata. |
Awesome, thanks! |
Summary of the Question
I have a dataset with features in the X dataframe, and the target in the y series. I am trying to select the features in X such that, when fitting a model, I reach the extreme of some objective function. Specifically, the model is linear regression, and the objective is BIC (Bayesian Information Criterion) - I'm trying to select the features in X so as to minimize BIC for the model.
X has a very large number of features, so exhaustive search of all feature combinations is not doable. But if you create a list of binary flags
[1, 0, 0, 1, 1, 1, 0, 1, 0, 0, ...]
, one flag for each feature (0 means feature is not selected, 1 means feature is selected), then the problem becomes one of hyperparameter optimization: find the binary flag values that minimize BIC, by selecting the best features.I've tried to use Optuna with
CmaEsSampler()
. I've also tried thecmaes
library directly.For some reason, Optuna with
CmaEsSampler()
finds a better solution (lower BIC), and does not get stuck, but is slow. Thecmaes
library, at least my implementation, only finds a slightly less good solution (slightly higher BIC), and appears to get stuck in a local minimum, but it iterates much faster.I would like to use
cmaes
directly because it's so much faster, but I can't make it overcome the local minimum. What am I missing?Detailed Explanation
Optuna code:
My
cmaes
code, inspired from this link https://github.com/CyberAgentAILab/cmaes/blob/main/examples/cmaes_with_margin_binary.pyFull code - this is the notebook with all the code, both Optuna and
cmaes
, along with other things I've attempted, and all required context (data loading, etc):https://github.com/FlorinAndrei/feature_selection/blob/main/feature_selection.ipynb
Context and Environment
Python 3.11.7
cmaes 0.10.0
optuna 3.5.0
Ubuntu 22.04
Additional Information
Optuna history:
cmaes history:
Why is there so much more variance in the Optuna trials? Why is it able to maintain that variance across many trials? It seems like the Optuna code would continue to find even better combinations if I let it run even more.
The text was updated successfully, but these errors were encountered: