REINFORCE Agent playing CartPole-v1
This is a trained model of a REINFORCE agent playing CartPole-v1 using PyTorch and the Deep Reinforcement Learning Course.
Algorithm
REINFORCE is a policy gradient method that:
- Directly optimizes the policy π(a|s)
- Uses Monte Carlo sampling to estimate returns
- Updates parameters in the direction of higher expected returns
- Belongs to the family of Policy Gradient methods
Evaluation Results
| Metric | Value |
|---|---|
| Mean Reward | 496.53 |
| Std Reward | 26.20 |
| Min Reward | 257.00 |
| Max Reward | 500.00 |
| Mean Episode Length | 496.53 |
| Score (mean - std) | 470.33 |
| Evaluation Episodes | 100 |
Usage
import torch
import torch.nn as nn
import torch.nn.functional as F
import gymnasium as gym
import numpy as np
class Policy(nn.Module):
def __init__(self, s_size, a_size, h_size=128):
super(Policy, self).__init__()
self.fc1 = nn.Linear(s_size, h_size)
self.fc2 = nn.Linear(h_size, a_size)
def forward(self, x):
x = F.relu(self.fc1(x))
x = self.fc2(x)
return F.softmax(x, dim=1)
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
checkpoint = torch.load("reinforce_cartpole.pth", map_location=device)
policy = Policy(checkpoint['s_size'], checkpoint['a_size'], checkpoint['hidden_size'])
policy.load_state_dict(checkpoint['policy_state_dict'])
policy.eval()
env = gym.make("CartPole-v1")
state, _ = env.reset()
for step in range(1000):
state_tensor = torch.from_numpy(state).float().unsqueeze(0)
with torch.no_grad():
probs = policy(state_tensor)
action = torch.argmax(probs, dim=1).item()
state, reward, terminated, truncated, _ = env.step(action)
if terminated or truncated:
state, _ = env.reset()
## Training Configuration
- **Algorithm**: REINFORCE (Policy Gradient)
- **Policy Network**: 2-layer MLP (128 hidden units)
- **Optimizer**: Adam
- **Learning Rate**: 0.003
- **Discount Factor**: 0.99
- **Training Episodes**: 800
- **Device**: cuda:0
## Training Hyperparameters
- Episodes: 800
- Max steps per episode: 1000
- Learning rate: 0.01
- Gamma (discount factor): 0.99
- Hidden layer size: 128
- Optimizer: Adam
- Downloads last month
- 20
Evaluation results
- mean_reward on CartPole-v1self-reported496.53 +/- 26.20