|
|
--- |
|
|
library_name: reinforce |
|
|
tags: |
|
|
- CartPole-v1 |
|
|
- deep-reinforcement-learning |
|
|
- reinforcement-learning |
|
|
- policy-gradient |
|
|
- reinforce |
|
|
model-index: |
|
|
- name: REINFORCE |
|
|
results: |
|
|
- task: |
|
|
type: reinforcement-learning |
|
|
name: reinforcement-learning |
|
|
dataset: |
|
|
name: CartPole-v1 |
|
|
type: CartPole-v1 |
|
|
metrics: |
|
|
- type: mean_reward |
|
|
value: 496.53 +/- 26.20 |
|
|
name: mean_reward |
|
|
verified: false |
|
|
--- |
|
|
|
|
|
# **REINFORCE** Agent playing **CartPole-v1** |
|
|
|
|
|
This is a trained model of a **REINFORCE** agent playing **CartPole-v1** |
|
|
using PyTorch and the [Deep Reinforcement Learning Course](https://huggingface.co/deep-rl-course/unit4). |
|
|
|
|
|
## Algorithm |
|
|
REINFORCE is a policy gradient method that: |
|
|
- Directly optimizes the policy π(a|s) |
|
|
- Uses Monte Carlo sampling to estimate returns |
|
|
- Updates parameters in the direction of higher expected returns |
|
|
- Belongs to the family of Policy Gradient methods |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| Mean Reward | 496.53 | |
|
|
| Std Reward | 26.20 | |
|
|
| Min Reward | 257.00 | |
|
|
| Max Reward | 500.00 | |
|
|
| Mean Episode Length | 496.53 | |
|
|
| Score (mean - std) | 470.33 | |
|
|
| Evaluation Episodes | 100 | |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import torch.nn as nn |
|
|
import torch.nn.functional as F |
|
|
import gymnasium as gym |
|
|
import numpy as np |
|
|
|
|
|
class Policy(nn.Module): |
|
|
def __init__(self, s_size, a_size, h_size=128): |
|
|
super(Policy, self).__init__() |
|
|
self.fc1 = nn.Linear(s_size, h_size) |
|
|
self.fc2 = nn.Linear(h_size, a_size) |
|
|
|
|
|
def forward(self, x): |
|
|
x = F.relu(self.fc1(x)) |
|
|
x = self.fc2(x) |
|
|
return F.softmax(x, dim=1) |
|
|
|
|
|
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu") |
|
|
checkpoint = torch.load("reinforce_cartpole.pth", map_location=device) |
|
|
|
|
|
policy = Policy(checkpoint['s_size'], checkpoint['a_size'], checkpoint['hidden_size']) |
|
|
policy.load_state_dict(checkpoint['policy_state_dict']) |
|
|
policy.eval() |
|
|
|
|
|
env = gym.make("CartPole-v1") |
|
|
state, _ = env.reset() |
|
|
|
|
|
for step in range(1000): |
|
|
state_tensor = torch.from_numpy(state).float().unsqueeze(0) |
|
|
with torch.no_grad(): |
|
|
probs = policy(state_tensor) |
|
|
action = torch.argmax(probs, dim=1).item() |
|
|
|
|
|
state, reward, terminated, truncated, _ = env.step(action) |
|
|
|
|
|
if terminated or truncated: |
|
|
state, _ = env.reset() |
|
|
|
|
|
|
|
|
## Training Configuration |
|
|
|
|
|
- **Algorithm**: REINFORCE (Policy Gradient) |
|
|
- **Policy Network**: 2-layer MLP (128 hidden units) |
|
|
- **Optimizer**: Adam |
|
|
- **Learning Rate**: 0.003 |
|
|
- **Discount Factor**: 0.99 |
|
|
- **Training Episodes**: 800 |
|
|
- **Device**: cuda:0 |
|
|
|
|
|
## Training Hyperparameters |
|
|
- Episodes: 800 |
|
|
- Max steps per episode: 1000 |
|
|
- Learning rate: 0.01 |
|
|
- Gamma (discount factor): 0.99 |
|
|
- Hidden layer size: 128 |
|
|
- Optimizer: Adam |
|
|
|