The OmniSafe Navigation Benchmark for model-based algorithms evaluates the effectiveness of OmniSafe's model-based algorithms across two different environments from the Safety-Gymnasium task suite. For each supported algorithm and environment, we offer the following:
- Default hyperparameters used for the benchmark and scripts that enable result replication.
- Graphs and raw data that can be utilized for research purposes.
- Detailed logs obtained during training.
- Suggestions and hints on fine-tuning the algorithm for achieving optimal results.
Supported algorithms are listed below:
- [NeurIPS 2001] Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models (PETS))
- [CoRL 2021] Learning Off-Policy with Online Planning (LOOP and SafeLOOP)
- [AAAI 2022] Conservative and Adaptive Penalty for Model-Based Safe Reinforcement Learning (CAP)
- [ICML 2022 Workshop] Constrained Model-based Reinforcement Learning with Robust Cross-Entropy Method (RCE)
- [NeurIPS 2018] Constrained Cross-Entropy Method for Safe Reinforcement Learning (CCE)
We highly recommend using Safety-Gymnasium to run the following experiments. To install, in a linux machine, type:
pip install safety_gymnasium
You can set the main function of examples/benchmarks/experiment_grid.py
as:
if __name__ == '__main__':
eg = ExperimentGrid(exp_name='Model-Based-Benchmarks')
# set up the algorithms.
model_based_base_policy = ['LOOP', 'PETS']
model_based_safe_policy = ['SafeLOOP', 'CCEPETS', 'CAPPETS', 'RCEPETS']
eg.add('algo', model_based_base_policy + model_based_safe_policy)
# you can use wandb to monitor the experiment.
eg.add('logger_cfgs:use_wandb', [False])
# you can use tensorboard to monitor the experiment.
eg.add('logger_cfgs:use_tensorboard', [True])
eg.add('train_cfgs:total_steps', [1000000])
# set up the environment.
eg.add('env_id', [
'SafetyPointGoal1-v0-modelbased',
'SafetyCarGoal1-v0-modelbased',
])
eg.add('seed', [0, 5, 10, 15, 20])
# total experiment num must can be divided by num_pool
# meanwhile, users should decide this value according to their machine
eg.run(train, num_pool=5)
After that, you can run the following command to run the benchmark:
cd examples/benchmarks
python run_experiment_grid.py
You can set the path of examples/benchmarks/experiment_grid.py
:
example:
path ='/home/username/omnisafe/omnisafe/examples/benchmarks/exp-x/Model-Based-Benchmarks'
You can also plot the results by running the following command:
cd examples
python analyze_experiment_results.py
For a detailed usage of OmniSafe statistics tool, please refer to this tutorial.
To demonstrate the high reliability of the algorithms implemented, OmniSafe offers performance insights within the Safety-Gymnasium environment. It should be noted that all data is procured under the constraint of cost_limit=1.00
. The results are presented in Table 1 and Figure 1.
Table 1: The performance of OmniSafe model-based algorithms, encompassing both reward and cost, was assessed within the Safety-Gymnasium environments. It is crucial to highlight that all model-based algorithms underwent evaluation following 1e6 training steps.
PETS | LOOP | SafeLOOP | ||||
---|---|---|---|---|---|---|
Environment | Reward | Cost | Reward | Cost | Reward | Cost |
SafetyCarGoal1-v0 | 33.07 ± 1.33 | 61.20 ± 7.23 | 25.41 ± 1.23 | 62.64 ± 8.34 | 22.09 ± 0.30 | 0.16 ± 0.15 |
SafetyPointGoal1-v0 | 27.66 ± 0.07 | 49.16 ± 2.69 | 25.08 ± 1.47 | 55.23 ± 2.64 | 22.94 ± 0.72 | 0.04 ± 0.07 |
CCEPETS | RCEPETS | CAPPETS | ||||
Environment | Reward | Cost | Reward | Cost | Reward | Cost |
SafetyCarGoal1-v0 | 27.60 ± 1.21 | 1.03 ± 0.29 | 29.08 ± 1.63 | 1.02 ± 0.88 | 23.33 ± 6.34 | 0.48 ± 0.17 |
SafetyPointGoal1-v0 | 24.98 ± 0.05 | 1.87 ± 1.27 | 25.39 ± 0.28 | 2.46 ± 0.58 | 9.45 ± 8.62 | 0.64 ± 0.77 |
Figure 1: Training curves in Safety-Gymnasium environments, covering classical reinforcement learning algorithms and safe learning algorithms mentioned in Table 1.
SafetyCarGoal1-v0
|
SafetyPointGoal1-v0
|
In our experiments, we found that some hyperparameters are important for the performance of the algorithm:
action_repeat
: The time of action repeat.init_var
: The initial variance of gaussian distribution for sampling actions.temperature
: The temperature factor for rescaling reward in planning.cost_temperature
: : The temperature factor for rescaling cost in planningplan_horizon
: Planning horizon.
We have done some experiments to show the effect of these hyperparameters, and we log the best configuration for each algorithm in each environment. You can check it in the omnisafe/configs/model_based
.
In experiments, we found that the action_repeat=5
always performs better than action_repeat=1
in the navigation task for the cem-based methods. That means the change in reward or observation per action performed in a navigation task may be too small, the action_repeat=5
will enlarge these variable and make the dynamics model more trainable.
Importantly, we found that the high variance like init_var=4.0
performs better than low variance like init_var=0.01
in pets-based algorithms, but we found that the situation is the opposite in policy-guided algorithms like LOOP, LOOP need the low variance like init_var=0.01
to make the planning policy more similar to the neural policy.
Besides, the hyperparameter temperature
and cost_temperature
are also important. LOOP and SafeLOOP should fine tune these two parameters in in different environments. This affects the contribution of reward size to action mean and variance.
Moreover, No policy-guided like pets need the high plan_horizon
, and policy-guided algorithms like loop only need low ``plan_horizon` in mujoco environments, but for a fair comparison, we use the planning horizon in navigation tasks.
If you find that other hyperparameters perform better, please feel free to open an issue or pull request.
Algorithm | action_repeat | init_var | plan_horizon |
---|---|---|---|
PETS | 5 | 4.0 | 7 |
LOOP | 5 | 0.01 | 7 |
SafeLOOP | 5 | 0.075 | 7 |
CCEPETS | 5 | 4.0 | 7 |
CAPPETS | 5 | 4.0 | 7 |
RCEPETS | 5 | 4.0 | 7 |
However, there are some differences between these algorithms. We list the differences below:
Environment | temperature |
---|---|
SafetyPointGoal1-v0 | 10.0 |
SafetyCarGoal1-v0 | 10.0 |
Environment | temperature | cost_temperature |
---|---|---|
SafetyPointGoal1-v0 | 10.0 | 100.0 |
SafetyCarGoal1-v0 | 10.0 | 100.0 |