PPO Agent playing LunarLander-v2

This is a trained model of a PPO agent playing LunarLander-v2 using the stable-baselines3 library. Training time was less than 4 min on macbook m1 pro

below is the base code which scored ~240, for the final submission, i re-trained the model 2-3 times to get slightly improved result

Usage (with Stable-baselines3)

from stable_baselines3 import ...
from huggingface_sb3 import load_from_hub

def fast_start_decay(initial_value, final_value, power=2): #Power law decay
    """
    LR decreases FAST at the beginning, then slowly later.
    power > 1 makes the curve drop faster early.
    """
    def lr_func(progress_remaining):
        # progress_remaining: 1 → 0
        t = 1 - progress_remaining   # 0 → 1
        return initial_value - (initial_value - final_value) * (t ** (1/power))
    return lr_func


env = make_vec_env('LunarLander-v2', n_envs=8, vec_env_cls=SubprocVecEnv)


model = PPO(
    policy="MlpPolicy",
    env=env,
    n_steps=2048,
    batch_size=64,
    n_epochs=10,
    gamma=0.998,
    gae_lambda=0.97,
    learning_rate=fast_start_decay(0.001, 0.0001, 2),
    clip_range=fast_start_decay(0.8, 0.1, 3), 
    ent_coef=0.01,
    tensorboard_log="./ppo_lunarlander_tb/",
)

model.learn(total_timesteps=int(8e5), tb_log_name="PPO_LunarLander")

...

Downloads last month: 68

Video Preview

Reinforcement Learning

Evaluation results

mean_reward on LunarLander-v2
self-reported

277.16 +/- 20.55

View on Papers With Code