[Bug]: Possible inconsistencies with the PPO implementation #1986

hexonfox · 2024-08-02T09:39:33Z

🐛 Bug

I tested different implementations of the PPO algorithm and found some discrepancies among the implementations. I tested each implementation on 56 Atari environments, with five trials per (implementation, environment) permutation. The table below depicts an environment-wise one-way ANOVA to determine the effect of implementation source on mean reward. Out of the 56 environments tested, the implementations significantly differed in nine environments, as seen in the table with respect to Stable Baselines3, CleanRL, and Baselines (not the 108 variant).

I believe that there are inconsistencies among the implementations which causes the observed environment-dependent discrepancies. For example, I found some inconsistencies (i.e., a bug) with Baselines' implementation where the frames per episode did not conform to 108K as per the v4 ALE specification, causing mean rewards to differ significantly in some environments. After correcting this, three out of nine environments previously flagged as statistically different were now not different, as seen in the table above with Baselines108. The inconsistencies is likely to be related to the environments, so I am now investigating parts of Stable Baselines3's implementation which might affect a subset of environments (similar to the frames per episode). Was wondering if they were any specific differences in the implementation by Stable Baselines3 which might have contributed to the differences in performance? Any suggestions would be greatly appreciated :)

To Reproduce

Run Command:

python ppo_atari.py --gpu 0 --env Atlantis --trials 5

The hyperparameters follow that of the original PPO implementation (without LSTM).
ppo_atari.py:

import argparse
import json
import os
import pathlib
import time
import uuid

import gymnasium as gym
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_atari_env
from stable_baselines3.common.evaluation import evaluate_policy
from stable_baselines3.common.logger import configure
from stable_baselines3.common.torch_layers import NatureCNN
from stable_baselines3.common.utils import get_linear_fn
from stable_baselines3.common.vec_env import DummyVecEnv, VecFrameStack


def train_atari(args):
    env = make_atari_env(
        f"{args.env}NoFrameskip-v4",
        n_envs=8,
        seed=args.seed,
        wrapper_kwargs={
            "noop_max": 30,
            "frame_skip": 4,
            "screen_size": 84,
            "terminal_on_life_loss": True,
            "clip_reward": True,
            "action_repeat_probability": 0.0,
        },
        vec_env_cls=DummyVecEnv,
    )
    env = VecFrameStack(env, n_stack=4)

    model = PPO(
        "CnnPolicy",
        env,
        learning_rate=get_linear_fn(2.5e-4, 0, 1.0),
        n_steps=128,
        batch_size=256,
        n_epochs=4,
        gamma=0.99,
        gae_lambda=0.95,
        clip_range=0.1,
        clip_range_vf=0.1,
        normalize_advantage=True,
        ent_coef=0.01,
        vf_coef=0.5,
        max_grad_norm=float("inf") if args.noclip else 0.5,
        use_sde=False,
        target_kl=None,
        stats_window_size=100,
        policy_kwargs={
            "ortho_init": True,
            "features_extractor_class": NatureCNN,
            "share_features_extractor": True,
            "normalize_images": True,
        },
        seed=args.seed,
    )

    logger = configure(args.path, ["csv"])
    model.set_logger(logger)
    start_time = time.time()
    model.learn(total_timesteps=10000000, log_interval=1, progress_bar=True)
    train_end_time = time.time()
    mean_reward, _ = evaluate_policy(
        model,
        model.get_env(),
        n_eval_episodes=100,
        deterministic=False,
    )
    eval_end_time = time.time()
    args.training_time_h = ((train_end_time - start_time) / 60) / 60
    args.total_time_h = ((eval_end_time - start_time) / 60) / 60
    args.eval_mean_reward = mean_reward


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "-g",
        "--gpu",
        type=int,
        help="Specify GPU index",
        default=0,
    )
    parser.add_argument(
        "-e",
        "--env",
        type=str,
        help="Specify Atari environment w/o version",
        default="Pong",
    )
    parser.add_argument(
        "-t",
        "--trials",
        type=int,
        help="Specify number of trials",
        default=5,
    )
    parser.add_argument(
        "-nc",
        "--noclip",
        action="store_true",
        help="Only specify for no gradient clipping",
    )
    args = parser.parse_args()
    for _ in range(args.trials):
        args.id = uuid.uuid4().hex
        if args.noclip:
            args.path = os.path.join("trials", "ppo", f"{args.env}_NoClip", args.id)
        else:
            args.path = os.path.join("trials", "ppo", args.env, args.id)
        args.seed = int(time.time())

        # create dir
        pathlib.Path(args.path).mkdir(parents=True, exist_ok=True)

        # set gpu
        os.environ["CUDA_VISIBLE_DEVICES"] = f"{args.gpu}"

        train_atari(args)

        # save trial info
        with open(os.path.join(args.path, "info.json"), "w") as f:
            json.dump(vars(args), f, indent=4)

Relevant log output / Error message

No response

System Info

OS: Linux-5.15.0-72-generic-x86_64-with-glibc2.31 # 79~20.04.1-Ubuntu SMP Thu Apr 20 22:12:07 UTC 2023
Python: 3.11.9
Stable-Baselines3: 2.3.0
PyTorch: 2.3.1+cu121
GPU Enabled: True
Numpy: 1.26.4
Cloudpickle: 3.0.0
Gymnasium: 0.29.1

Checklist

My issue does not relate to a custom gym environment. (Use the custom gym env template instead)
I have checked that there is no similar issue in the repo
I have read the documentation
I have provided a minimal and working example to reproduce the bug
I've used the markdown code blocks for both code and stack traces.

araffin · 2024-08-02T09:53:50Z

Hello,
there are two differences that I know:

PPO in SB3 handles timeout properly: [Feature Request] Fixing TimeLimit Handling for On-Policy algorithm #1355
deprecated value clipping (not used by default, not recommended) in SB3 PPO is done differently compared to baselines/cleanRL

Other differences might come from using PyTorch vs Tensorflow (for instance, the Adam implementation might be slightly different, the same happened for A2C: #110)

hexonfox · 2024-08-03T12:17:39Z

@araffin Thank you for the information. Will investigate more into the mentioned inconsistencies.

hexonfox added the bug Something isn't working label Aug 2, 2024

araffin added question Further information is requested and removed bug Something isn't working labels Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Possible inconsistencies with the PPO implementation #1986

[Bug]: Possible inconsistencies with the PPO implementation #1986

hexonfox commented Aug 2, 2024 •

edited

Loading

araffin commented Aug 2, 2024

hexonfox commented Aug 3, 2024 •

edited

Loading

[Bug]: Possible inconsistencies with the PPO implementation #1986

[Bug]: Possible inconsistencies with the PPO implementation #1986

Comments

hexonfox commented Aug 2, 2024 • edited Loading

🐛 Bug

To Reproduce

Relevant log output / Error message

System Info

Checklist

araffin commented Aug 2, 2024

hexonfox commented Aug 3, 2024 • edited Loading

hexonfox commented Aug 2, 2024 •

edited

Loading

hexonfox commented Aug 3, 2024 •

edited

Loading