REINFORCE Agent playing CartPole-v1

This is a trained model of a REINFORCE agent playing CartPole-v1 using PyTorch and the Deep Reinforcement Learning Course.

Algorithm

REINFORCE is a policy gradient method that:

Directly optimizes the policy π(a|s)
Uses Monte Carlo sampling to estimate returns
Updates parameters in the direction of higher expected returns
Belongs to the family of Policy Gradient methods

Evaluation Results

Metric	Value
Mean Reward	496.53
Std Reward	26.20
Min Reward	257.00
Max Reward	500.00
Mean Episode Length	496.53
Score (mean - std)	470.33
Evaluation Episodes	100

Usage

import torch
import torch.nn as nn
import torch.nn.functional as F
import gymnasium as gym
import numpy as np

class Policy(nn.Module):
    def __init__(self, s_size, a_size, h_size=128):
        super(Policy, self).__init__()
        self.fc1 = nn.Linear(s_size, h_size)
        self.fc2 = nn.Linear(h_size, a_size)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.fc2(x)
        return F.softmax(x, dim=1)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
checkpoint = torch.load("reinforce_cartpole.pth", map_location=device)

policy = Policy(checkpoint['s_size'], checkpoint['a_size'], checkpoint['hidden_size'])
policy.load_state_dict(checkpoint['policy_state_dict'])
policy.eval()

env = gym.make("CartPole-v1")
state, _ = env.reset()

for step in range(1000):
    state_tensor = torch.from_numpy(state).float().unsqueeze(0)
    with torch.no_grad():
        probs = policy(state_tensor)
        action = torch.argmax(probs, dim=1).item()
    
    state, reward, terminated, truncated, _ = env.step(action)
    
    if terminated or truncated:
        state, _ = env.reset()


## Training Configuration

- **Algorithm**: REINFORCE (Policy Gradient)
- **Policy Network**: 2-layer MLP (128 hidden units)
- **Optimizer**: Adam
- **Learning Rate**: 0.003
- **Discount Factor**: 0.99
- **Training Episodes**: 800
- **Device**: cuda:0

## Training Hyperparameters
- Episodes: 800
- Max steps per episode: 1000
- Learning rate: 0.01
- Gamma (discount factor): 0.99
- Hidden layer size: 128
- Optimizer: Adam

Downloads last month: 20

Video Preview

Reinforcement Learning

Evaluation results

mean_reward on CartPole-v1
self-reported

496.53 +/- 26.20