REINFORCE CartPole - Mean: 496.53, Std: 26.20, Score: 470.33

7ef0f3c verified 15 days ago

2.75 kB

	---
	library_name: reinforce
	tags:
	- CartPole-v1
	- deep-reinforcement-learning
	- reinforcement-learning
	- policy-gradient
	- reinforce
	model-index:
	- name: REINFORCE
	results:
	- task:
	type: reinforcement-learning
	name: reinforcement-learning
	dataset:
	name: CartPole-v1
	type: CartPole-v1
	metrics:
	- type: mean_reward
	value: 496.53 +/- 26.20
	name: mean_reward
	verified: false
	---

	# REINFORCE Agent playing CartPole-v1

	This is a trained model of a REINFORCE agent playing CartPole-v1
	using PyTorch and the [Deep Reinforcement Learning Course](https://huggingface.co/deep-rl-course/unit4).

	## Algorithm
	REINFORCE is a policy gradient method that:
	- Directly optimizes the policy π(a\|s)
	- Uses Monte Carlo sampling to estimate returns
	- Updates parameters in the direction of higher expected returns
	- Belongs to the family of Policy Gradient methods

	## Evaluation Results

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Mean Reward \| 496.53 \|
	\| Std Reward \| 26.20 \|
	\| Min Reward \| 257.00 \|
	\| Max Reward \| 500.00 \|
	\| Mean Episode Length \| 496.53 \|
	\| Score (mean - std) \| 470.33 \|
	\| Evaluation Episodes \| 100 \|

	## Usage

	```python
	import torch
	import torch.nn as nn
	import torch.nn.functional as F
	import gymnasium as gym
	import numpy as np

	class Policy(nn.Module):
	def __init__(self, s_size, a_size, h_size=128):
	super(Policy, self).__init__()
	self.fc1 = nn.Linear(s_size, h_size)
	self.fc2 = nn.Linear(h_size, a_size)

	def forward(self, x):
	x = F.relu(self.fc1(x))
	x = self.fc2(x)
	return F.softmax(x, dim=1)

	device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
	checkpoint = torch.load("reinforce_cartpole.pth", map_location=device)

	policy = Policy(checkpoint['s_size'], checkpoint['a_size'], checkpoint['hidden_size'])
	policy.load_state_dict(checkpoint['policy_state_dict'])
	policy.eval()

	env = gym.make("CartPole-v1")
	state, _ = env.reset()

	for step in range(1000):
	state_tensor = torch.from_numpy(state).float().unsqueeze(0)
	with torch.no_grad():
	probs = policy(state_tensor)
	action = torch.argmax(probs, dim=1).item()

	state, reward, terminated, truncated, _ = env.step(action)

	if terminated or truncated:
	state, _ = env.reset()


	## Training Configuration

	- Algorithm: REINFORCE (Policy Gradient)
	- Policy Network: 2-layer MLP (128 hidden units)
	- Optimizer: Adam
	- Learning Rate: 0.003
	- Discount Factor: 0.99
	- Training Episodes: 800
	- Device: cuda:0

	## Training Hyperparameters
	- Episodes: 800
	- Max steps per episode: 1000
	- Learning rate: 0.01
	- Gamma (discount factor): 0.99
	- Hidden layer size: 128
	- Optimizer: Adam