REINFORCE Agent playing CartPole-v1

This is a trained model of a REINFORCE agent playing CartPole-v1 using PyTorch and the Deep Reinforcement Learning Course.

Algorithm

REINFORCE is a policy gradient method that:

  • Directly optimizes the policy Ï€(a|s)
  • Uses Monte Carlo sampling to estimate returns
  • Updates parameters in the direction of higher expected returns
  • Belongs to the family of Policy Gradient methods

Evaluation Results

Metric Value
Mean Reward 496.53
Std Reward 26.20
Min Reward 257.00
Max Reward 500.00
Mean Episode Length 496.53
Score (mean - std) 470.33
Evaluation Episodes 100

Usage

import torch
import torch.nn as nn
import torch.nn.functional as F
import gymnasium as gym
import numpy as np

class Policy(nn.Module):
    def __init__(self, s_size, a_size, h_size=128):
        super(Policy, self).__init__()
        self.fc1 = nn.Linear(s_size, h_size)
        self.fc2 = nn.Linear(h_size, a_size)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.softmax(x, dim=1)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
checkpoint = torch.load("reinforce_cartpole.pth", map_location=device)

policy = Policy(checkpoint['s_size'], checkpoint['a_size'], checkpoint['hidden_size'])
policy.load_state_dict(checkpoint['policy_state_dict'])
policy.eval()

env = gym.make("CartPole-v1")
state, _ = env.reset()

for step in range(1000):
    state_tensor = torch.from_numpy(state).float().unsqueeze(0)
    with torch.no_grad():
        probs = policy(state_tensor)
        action = torch.argmax(probs, dim=1).item()
    
    state, reward, terminated, truncated, _ = env.step(action)
    
    if terminated or truncated:
        state, _ = env.reset()


## Training Configuration

- **Algorithm**: REINFORCE (Policy Gradient)
- **Policy Network**: 2-layer MLP (128 hidden units)
- **Optimizer**: Adam
- **Learning Rate**: 0.003
- **Discount Factor**: 0.99
- **Training Episodes**: 800
- **Device**: cuda:0

## Training Hyperparameters
- Episodes: 800
- Max steps per episode: 1000
- Learning rate: 0.01
- Gamma (discount factor): 0.99
- Hidden layer size: 128
- Optimizer: Adam
Downloads last month
20
Video Preview
loading

Evaluation results