Support for Hyperparameter Tuning #125

kifuman · 2023-08-21T23:46:56Z

kifuman
Aug 21, 2023

thank you for creating this awesome library!
Is there any chance to get support for hyperparameter tuning?

Best regards,
Fabi

Toni-SM · 2023-08-23T09:48:30Z

Toni-SM
Aug 23, 2023
Maintainer

I know that there are currently no examples of hyperparameter tuning in the library documentation.
However, I think that it could be easily integrated with Optuna, for example.

It is a pending topic (which I have had in mind for a long long long time 😅🙈) for future releases.
I'll see if I can find some time for it :)

0 replies

kifuman · 2023-08-23T14:42:33Z

kifuman
Aug 23, 2023
Author

Thanks for your quick reply!

Is there any chance you can help me out with a small example for DDPG in a gymnasium environment?
e.g. for optimizing the learning rates of actor and critic

2 replies

Toni-SM Sep 15, 2023
Maintainer

Hi @kifuman

Sorry for late response, I'm visiting my native country where I only have limited internet access.

If you are still interested in this feature, I will try to make an example when I return home.

kifuman Sep 15, 2023
Author

Hi @Toni-SM
Yeah I am still interested. Your help would be much appreciated. TY!

Toni-SM · 2023-10-07T21:27:26Z

Toni-SM
Oct 7, 2023
Maintainer

Hi @kifuman

In the hyperparameter_optimization.zip file you can find a script that implement hyperparameter tuning, using Optuna, for the gymnasium Pendulum environment with DDPG. The script optimize the batch_size, learning_rate (for actor and critic) and the discount_factor. All the skrl screen output is disabled (set "disable_progressbar" : False to enable training progress visualization).

Note that it is necessary to use skrl-v1.1.0 (develop branch) since it uses the StepTrainer (formerly ManualTrainer) to control the training loop step-by-step and get the instantaneous reward to compute the optimization objective (metric).

The script will generate a database file that can be loaded with the Optuna Dashboard for visualization (check the Optuna documentation for information about the Optuna Dashboard) as follows:

optuna-dashboard sqlite:///hyperparameter_optimization.db

Feel free to continue the discussion if you have any questions.
From this discussion and with your feedback, I will include the hyperparameter optimization in the library documentation :)

hyperparameter_optimization.zip

import optuna
import logging
import numpy as np

# disable skrl logging
from skrl import logger
logger.setLevel(logging.WARNING)


def objective(trial: optuna.Trial):
    # parameters to optimize
    # https://optuna.readthedocs.io/en/stable/reference/generated/optuna.trial.Trial.html
    batch_size = trial.suggest_categorical("batch_size", [64, 128, 256])
    learning_rate = trial.suggest_float("learning_rate", low=1e-5, high=1e-2, log=True)
    discount_factor = trial.suggest_categorical("discount_factor", [0.98, 0.99, 0.999])

    # metrics
    episode_rewards = []
    instantaneous_rewards = []

    # reinforcement learning experiment
    # ---------------------------------

    import gymnasium as gym

    import torch
    import torch.nn as nn
    import torch.nn.functional as F

    # import the skrl components to build the RL system
    from skrl.agents.torch.ddpg import DDPG, DDPG_DEFAULT_CONFIG
    from skrl.envs.wrappers.torch import wrap_env
    from skrl.memories.torch import RandomMemory
    from skrl.models.torch import DeterministicMixin, Model
    from skrl.resources.noises.torch import OrnsteinUhlenbeckNoise
    from skrl.trainers.torch import StepTrainer
    from skrl.utils import set_seed

    # define models (deterministic models) using mixin
    class Actor(DeterministicMixin, Model):
        def __init__(self, observation_space, action_space, device, clip_actions=False):
            Model.__init__(self, observation_space, action_space, device)
            DeterministicMixin.__init__(self, clip_actions)

            self.linear_layer_1 = nn.Linear(self.num_observations, 400)
            self.linear_layer_2 = nn.Linear(400, 300)
            self.action_layer = nn.Linear(300, self.num_actions)

        def compute(self, inputs, role):
            x = F.relu(self.linear_layer_1(inputs["states"]))
            x = F.relu(self.linear_layer_2(x))
            # Pendulum-v1 action_space is -2 to 2
            return 2 * torch.tanh(self.action_layer(x)), {}

    class Critic(DeterministicMixin, Model):
        def __init__(self, observation_space, action_space, device, clip_actions=False):
            Model.__init__(self, observation_space, action_space, device)
            DeterministicMixin.__init__(self, clip_actions)

            self.linear_layer_1 = nn.Linear(self.num_observations + self.num_actions, 400)
            self.linear_layer_2 = nn.Linear(400, 300)
            self.linear_layer_3 = nn.Linear(300, 1)

        def compute(self, inputs, role):
            x = F.relu(self.linear_layer_1(torch.cat([inputs["states"], inputs["taken_actions"]], dim=1)))
            x = F.relu(self.linear_layer_2(x))
            return self.linear_layer_3(x), {}

    # seed for reproducibility
    set_seed()  # e.g. `set_seed(42)` for fixed seed

    # load and wrap the gymnasium environment.
    # note: the environment version may change depending on the gymnasium version
    try:
        env = gym.make("Pendulum-v1")
    except (gym.error.DeprecatedEnv, gym.error.VersionNotFound) as e:
        env_id = [spec for spec in gym.envs.registry if spec.startswith("Pendulum-v")][0]
        print("Pendulum-v1 not found. Trying {}".format(env_id))
        env = gym.make(env_id)
    env = wrap_env(env)

    device = env.device


    # instantiate a memory as experience replay
    memory = RandomMemory(memory_size=10000, num_envs=env.num_envs, device=device, replacement=False)


    # instantiate the agent's models (function approximators).
    # DDPG requires 4 models, visit its documentation for more details
    # https://skrl.readthedocs.io/en/latest/api/agents/ddpg.html#models
    models = {}
    models["policy"] = Actor(env.observation_space, env.action_space, device)
    models["target_policy"] = Actor(env.observation_space, env.action_space, device)
    models["critic"] = Critic(env.observation_space, env.action_space, device)
    models["target_critic"] = Critic(env.observation_space, env.action_space, device)

    # initialize models' parameters (weights and biases)
    for model in models.values():
        model.init_parameters(method_name="normal_", mean=0.0, std=0.1)


    # configure and instantiate the agent (visit its documentation to see all the options)
    # https://skrl.readthedocs.io/en/latest/api/agents/ddpg.html#configuration-and-hyperparameters
    cfg = DDPG_DEFAULT_CONFIG.copy()
    cfg["exploration"]["noise"] = OrnsteinUhlenbeckNoise(theta=0.15, sigma=0.1, base_scale=1.0, device=device)
    cfg["discount_factor"] = discount_factor
    cfg["batch_size"] = batch_size
    cfg["random_timesteps"] = 100
    cfg["learning_starts"] = 100
    cfg["actor_learning_rate"] = learning_rate
    cfg["critic_learning_rate"] = learning_rate
    # skip logging to TensorBoard and write checkpoints (in timesteps)
    cfg["experiment"]["write_interval"] = 0
    cfg["experiment"]["checkpoint_interval"] = 0

    agent = DDPG(models=models,
                memory=memory,
                cfg=cfg,
                observation_space=env.observation_space,
                action_space=env.action_space,
                device=device)


    # configure and instantiate the RL trainer
    cfg_trainer = {"timesteps": 10000, 
                   "headless": True, 
                   "disable_progressbar": True,
                   "close_environment_at_exit": False}
    trainer = StepTrainer(cfg=cfg_trainer, env=env, agents=[agent])

    # train the agent
    for timestep in range(cfg_trainer["timesteps"]):
        # training step
        next_states, rewards, terminated, truncated, infos = trainer.train(timestep=timestep)
        # storage metrics
        instantaneous_rewards.append(rewards.item())
        if terminated.any() or truncated.any():
            episode_rewards.append(np.sum(instantaneous_rewards))
            instantaneous_rewards = []

    # close the environment
    env.close()

    # ---------------------------------

    return np.mean(episode_rewards)




# https://optuna.readthedocs.io/en/stable/reference/generated/optuna.create_study.html
storage = "sqlite:///hyperparameter_optimization.db"
sampler = optuna.samplers.TPESampler()
direction = "maximize"  # maximize episode reward

study = optuna.create_study(storage=storage,
                            sampler=sampler,
                            study_name="optimization",
                            direction=direction,
                            load_if_exists=True)

study.optimize(objective, n_trials=25)

print(f"The best trial obtains a normalized score of {study.best_trial.value}", study.best_trial.params)

1 reply

berttggg Oct 18, 2023

thank you so much for the nice and detailed example.
You have mentioned that we have to the StepTrainer for this library.
How about the multi_agent training?
StepTrainer does not include the multi_agent training API.

kifuman · 2023-10-15T13:40:42Z

kifuman
Oct 15, 2023
Author

Hi @Toni-SM ,

thank you so much for your example. Below i attached a snippet of my code. At first i create a environment for training and a SequentialTrainer. The trainer trains the agent on the training environemnt. For evaluation i create a seperate environment (necessary as i need to use a different callback here) to which i assign the trainer. I also change the amount of steps for evaluation.

# ....
# this is just a snippet
# more code above
def ddpg_param_objective(trial):

    # ....
    # this is just a snippet
    # more code above

    train_Callback = [trainCallback()]
    env_path = "runs/gem_optuna/CONT-CC-PMSM-v0_Trial_" + str(trial.number)
    env = create_env(directory_path=env_path, callbacks=train_Callback)
    device = env.device

    agent = DDPG(models=models,
                 memory=memory,
                 cfg=cfg,
                 observation_space=env.observation_space,
                 action_space=env.action_space,
                 device=device)

    # configure and instantiate the RL trainer
    cfg_trainer = {"timesteps": nb_training_steps, "headless": True}
    trainer = SequentialTrainer(cfg=cfg_trainer, env=env, agents=[agent])

    # start training
    trainer.train()
    env.close()

    # crete eval env with eval_Callback anc change trainer parameters for evaluation
    eval_Callback = [evalCallback(ref_i_q, ref_i_d)]
    eval_env = create_env(directory_path=env_path, callbacks=eval_Callback)
    trainer.env = eval_env
    trainer.timesteps = eval_total_steps

    # evaluate the trained agent, cumulative_error will be calculated by the eval_callback
    trainer.eval()
    eval_env.close()

    return cumulative_error

sampler = optuna.samplers.TPESampler()
study = optuna.create_study(direction='minimize',
                            sampler=sampler,
                            study_name='ddpg_hypermaram')

study.optimize(ddpg_param_objective, n_trials=5)

print('Number of finished trials: ', len(study.trials))

print('Best trial:', study.best_trial.number)
trial = study.best_trial
print('Value: ', trial.value)
print('Params: ')
for key, value in trial.params.items():
    print(f'    {key}: {value}')

The console output is the following:

[I 2023-10-15 15:21:49,768] A new study created in memory with name: ddpg_hypermaram
C:\Users\X\anaconda3\envs\skrl\lib\site-packages\gymnasium\core.py:311: UserWarning: WARN: env.device to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do `env.unwrapped.device` for environment variables or `env.get_wrapper_attr('device')` that will search the reminding wrappers.
  logger.warn(
C:\Users\X\anaconda3\envs\skrl\lib\site-packages\gymnasium\core.py:311: UserWarning: WARN: env.num_envs to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do `env.unwrapped.num_envs` for environment variables or `env.get_wrapper_attr('num_envs')` that will search the reminding wrappers.
  logger.warn(
C:\Users\X\anaconda3\envs\skrl\lib\site-packages\gymnasium\core.py:311: UserWarning: WARN: env.num_agents to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do `env.unwrapped.num_agents` for environment variables or `env.get_wrapper_attr('num_agents')` that will search the reminding wrappers.
  logger.warn(
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:08<00:00, 118.60it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 817.12it/s]
[I 2023-10-15 15:21:58,899] Trial 0 finished with value: 816.7351988969523 and parameters: {'actor_num_layers': 1, 'neuron_size': 64, 'critic_num_layers': 2, 'critic_neuron_size': 32, 'actor_learning_rate': 1.0472650168715462e-05, 'critic_learning_rate': 1.665081475392472e-06, 'theta': 0.010554676676163215, 'sigma': 0.14793439953731685, 'exploration_end_after_step': 426619}. Best is trial 0 with value: 816.7351988969523.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:08<00:00, 122.33it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 834.28it/s] 
[I 2023-10-15 15:22:07,715] Trial 1 finished with value: 792.6855787223212 and parameters: {'actor_num_layers': 1, 'neuron_size': 32, 'critic_num_layers': 2, 'critic_neuron_size': 32, 'actor_learning_rate': 0.0004055866003670534, 'critic_learning_rate': 0.0005642719630354771, 'theta': 0.01652835376415755, 'sigma': 0.0532246998114288, 'exploration_end_after_step': 475493}. Best is trial 1 with value: 792.6855787223212. 
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:10<00:00, 97.71it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 793.39it/s]
[I 2023-10-15 15:22:18,665] Trial 2 finished with value: 793.9788300500115 and parameters: {'actor_num_layers': 1, 'neuron_size': 64, 'critic_num_layers': 2, 'critic_neuron_size': 64, 'actor_learning_rate': 0.00026974695000980033, 'critic_learning_rate': 0.00021223215517231922, 'theta': 0.018121304040132923, 'sigma': 0.14119734877256893, 'exploration_end_after_step': 471410}. Best is trial 1 with value: 792.6855787223212.
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:08<00:00, 118.39it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 773.05it/s]
[I 2023-10-15 15:22:27,814] Trial 3 finished with value: 578.9403001267382 and parameters: {'actor_num_layers': 1, 'neuron_size': 64, 'critic_num_layers': 2, 'critic_neuron_size': 32, 'actor_learning_rate': 2.4076185189331824e-05, 'critic_learning_rate': 7.105557885361469e-05, 'theta': 0.01934723689494763, 'sigma': 0.06036273091114802, 'exploration_end_after_step': 449298}. Best is trial 3 with value: 578.9403001267382.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:11<00:00, 85.78it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 754.73it/s] 
[I 2023-10-15 15:22:40,200] Trial 4 finished with value: 272.3939921358844 and parameters: {'actor_num_layers': 1, 'neuron_size': 64, 'critic_num_layers': 3, 'critic_neuron_size': 64, 'actor_learning_rate': 0.0003257410281712553, 'critic_learning_rate': 9.499999871404513e-05, 'theta': 0.0136550838255451, 'sigma': 0.039609598180733514, 'exploration_end_after_step': 409134}. Best is trial 4 with value: 272.3939921358844.
Number of finished trials:  5
Best trial: 4
Value:  272.3939921358844
Params:
    actor_num_layers: 1
    neuron_size: 64
    critic_num_layers: 3
    critic_neuron_size: 64
    actor_learning_rate: 0.0003257410281712553
    critic_learning_rate: 9.499999871404513e-05
    theta: 0.0136550838255451
    sigma: 0.039609598180733514
    exploration_end_after_step: 409134
[skrl:INFO] Closing environment
[skrl:INFO] Environment closed
[skrl:INFO] Closing environment
[skrl:INFO] Environment closed
[skrl:INFO] Closing environment
[skrl:INFO] Environment closed
[skrl:INFO] Closing environment
[skrl:INFO] Environment closed
[skrl:INFO] Closing environment
[skrl:INFO] Environment closed

Don't mind the values for the hyperparameters or the training length, I just created this for demonstration.
I want to close the environments after every trial. My problem is that that this does not seem to be happening, even though I close them in my code after .train() and .eval(). To me it looks like some are closed after the study is finished as the console print shows. Would be awesome if you could help me out on why this is happening, Thanks!

Best regards

0 replies

Toni-SM · 2023-11-05T15:39:35Z

Toni-SM
Nov 5, 2023
Maintainer

Hi @kifuman

Sorry for late replay.

You can disable the skrl's trainer environment closing feature be setting the close_environment_at_exit to False (default: True)

https://skrl.readthedocs.io/en/develop/api/trainers/sequential.html#configuration

1 reply

kifuman Nov 5, 2023
Author

I must have overlooked that, thank you so much for your continuous help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Hyperparameter Tuning #125

{{title}}

Replies: 5 comments 4 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Support for Hyperparameter Tuning #125

kifuman Aug 21, 2023

Replies: 5 comments · 4 replies

Toni-SM Aug 23, 2023 Maintainer

kifuman Aug 23, 2023 Author

Toni-SM Sep 15, 2023 Maintainer

kifuman Sep 15, 2023 Author

Toni-SM Oct 7, 2023 Maintainer

berttggg Oct 18, 2023

kifuman Oct 15, 2023 Author

Toni-SM Nov 5, 2023 Maintainer

kifuman Nov 5, 2023 Author

kifuman
Aug 21, 2023

Replies: 5 comments 4 replies

Toni-SM
Aug 23, 2023
Maintainer

kifuman
Aug 23, 2023
Author

Toni-SM Sep 15, 2023
Maintainer

kifuman Sep 15, 2023
Author

Toni-SM
Oct 7, 2023
Maintainer

kifuman
Oct 15, 2023
Author

Toni-SM
Nov 5, 2023
Maintainer

kifuman Nov 5, 2023
Author