UR5 Reinforcement Learning Project

Reinforcement Learning for UR5 Robotic Arm Control using PPO, TD3, and SAC algorithms. This project implements and compares on-policy (PPO) and off-policy (TD3, SAC) deep reinforcement learning methods for a 6-DOF robotic manipulation task.

🎯 Project Overview

This project trains RL agents to control a UR5 robotic arm to reach random target positions in 3D space.

🏆 Best Results

Algorithm	Success Rate	Episodic Return	Training Steps
TD3 Baseline	92.3% 🔥	337.76	1M
PPO v2.1	74.9%	337.26	2M
PPO v2	62.8%	336.2	2M
PPO Tuned	61.9%	340.65	500k
PPO Baseline	5.7%	37.16	1M
SAC v3	0.05% ❌	-10.5	1M

Key Findings:

TD3 outperforms all other algorithms with 92.3% success rate!
PPO achieved ~75% after extensive hyperparameter tuning
SAC failed to learn in this environment (unstable Q-function)
Reward shaping is critical - 13x improvement from baseline to tuned PPO

📁 Repository Structure

UR5/
├── env.py                          # Original UR5 Gymnasium environment
├── env_tuned.py                    # Tuned environment (10cm threshold)
├── env_tuned_v3.py                 # PPO v3 environment (8cm threshold)
├── env_tuned_sac.py                # SAC-specific environment
├── env_tuned_sac_v3.py             # SAC v3 environment (simplified rewards)
├── __init__.py                     # Package initialization
├── test_env.py                     # Environment testing script
├── ik.py                           # Inverse kinematics implementation
│
├── assets/                         # MuJoCo model files
│   ├── scene.xml                   # Main scene with UR5 + environment
│   ├── ur5e.xml                    # UR5e robot MJCF model
│   └── meshes/                     # Robot mesh files (.stl)
│
├── cleanrl/cleanrl/                # Training scripts
│   ├── ppo_continuous_action.py           # PPO baseline
│   ├── ppo_continuous_action_tuned.py     # PPO tuned (61.9%)
│   ├── ppo_continuous_action_v2.py        # PPO v2 (62.8%)
│   ├── ppo_continuous_action_v2_1.py      # PPO v2.1 (74.9%) ✨
│   ├── ppo_continuous_action_v3.py        # PPO v3 (8cm, 41.3%)
│   ├── td3_continuous_action.py           # TD3 original
│   ├── td3_continuous_action_ur5.py       # TD3 UR5 (92.3%) 🔥
│   ├── sac_continuous_action.py           # SAC original
│   ├── sac_continuous_action_modified.py  # SAC adapted for UR5
│   ├── sac_continuous_action_tuned.py     # SAC tuned
│   ├── sac_continuous_action_v2.py        # SAC v2
│   ├── sac_continuous_action_v3.py        # SAC v3
│   └── ddpg_continuous_action.py          # DDPG implementation
│
├── runs/                           # TensorBoard training logs
├── wandb/                          # Weights & Biases experiment logs
└── videos/                         # Recorded episode videos

🖥️ Test Environment

Component	Specification
OS	Windows 10
CPU	Intel Core i7-7700HQ @ 2.80GHz
GPU	NVIDIA GeForce GTX 1060 (6GB)
Python	3.10.19
PyTorch	2.4.1 (CUDA)

📦 Installation

# Create conda environment
conda create -n ur5 python=3.10 -y
conda activate ur5

# Install PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install MuJoCo and Gymnasium
pip install mujoco gymnasium[mujoco] dm_control

# Install RL utilities
pip install wandb tyro tensorboard

🚀 Usage

Test Environment

python test_env.py

Training

TD3 (Best Performance - 92.3%):

python cleanrl/cleanrl/td3_continuous_action_ur5.py

PPO v2.1 (74.9%):

python cleanrl/cleanrl/ppo_continuous_action_v2_1.py

PPO Tuned (61.9%):

python cleanrl/cleanrl/ppo_continuous_action_tuned.py

Monitoring

TensorBoard: tensorboard --logdir runs/
WandB: https://wandb.ai/supernova0417-korea-university/ur5

⚙️ Hyperparameter Configurations

TD3 Baseline (Best)

learning_rate = 3e-4
learning_starts = 25000
policy_noise = 0.2
exploration_noise = 0.1
tau = 0.005
batch_size = 256
total_timesteps = 1000000

PPO v2.1 (Best PPO)

learning_rate = 5e-5
ent_coef = 0.003  # Key tuning point!
num_envs = 8
num_steps = 2048
total_timesteps = 2000000

Reward Function (env_tuned.py)

reward_cfg = {
    "dist_success_thresh": 0.10,  # 10cm
    "time_penalty": 0.001,
    "success_bonus": 300.0,
    "progress_scale": 15.0,
}

📊 Experiment Results Summary

Algorithm Comparison

Algorithm	Version	Success Rate	Key Changes
TD3	Baseline	92.3%	Twin Q, delayed policy
PPO	v2.1	74.9%	ent_coef=0.003
PPO	v2	62.8%	2M steps, 8 envs
PPO	Tuned	61.9%	Reward shaping
PPO	v3	41.3%	8cm threshold (too hard)
PPO	Baseline	5.7%	Original settings
SAC	v3	0.05%	Failed (Q divergence)
SAC	Baseline	0%	Failed

Key Insights

TD3 is the best algorithm for this task
- Twin Q-networks prevent overestimation
- Delayed policy updates ensure stability
- SAC's entropy term caused instability
PPO converges around ~75%
- ent_coef=0.003 is optimal (between 0.005 and 0.001)
- Further tuning shows diminishing returns
SAC fundamentally fails in this environment
- Q-function diverges to infinity
- Entropy regularization seems problematic
Reward shaping matters significantly
- 10cm threshold is achievable
- 8cm threshold causes major performance drop
- Progress-based rewards are essential

📚 References

Fujimoto et al. (2018). Addressing Function Approximation Error in Actor-Critic Methods. (TD3)
Schulman et al. (2017). Proximal Policy Optimization Algorithms. (PPO)
Haarnoja et al. (2018). Soft Actor-Critic. (SAC)
CleanRL - Clean implementation of RL algorithms.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Korea University - MECH485 Intelligent Robotics Course
CleanRL for the excellent RL implementations
MuJoCo for the physics simulation

Author: Jinkwon Park
Course: MECH485 - Intelligent Robotics, Korea University
Date: 28, December, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UR5 Reinforcement Learning Project

🎯 Project Overview

🏆 Best Results

📁 Repository Structure

🖥️ Test Environment

📦 Installation

🚀 Usage

Test Environment

Training

Monitoring

⚙️ Hyperparameter Configurations

TD3 Baseline (Best)

PPO v2.1 (Best PPO)

Reward Function (env_tuned.py)

📊 Experiment Results Summary

Algorithm Comparison

Key Insights

📚 References

📝 License

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
__pycache__		__pycache__
assets		assets
cleanrl		cleanrl
runs		runs
videos/UR5-v0__ppo_continuous_action__1__1759314124-eval		videos/UR5-v0__ppo_continuous_action__1__1759314124-eval
wandb		wandb
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
env.py		env.py
env_tuned.py		env_tuned.py
env_tuned_sac.py		env_tuned_sac.py
env_tuned_sac_v3.py		env_tuned_sac_v3.py
env_tuned_v3.py		env_tuned_v3.py
ik.py		ik.py
ik_solver.mlx		ik_solver.mlx
ik_solver.py		ik_solver.py
ros2_lecture.pptx		ros2_lecture.pptx
test_env.py		test_env.py

License

Supernova0417/UR5

Folders and files

Latest commit

History

Repository files navigation

UR5 Reinforcement Learning Project

🎯 Project Overview

🏆 Best Results

📁 Repository Structure

🖥️ Test Environment

📦 Installation

🚀 Usage

Test Environment

Training

Monitoring

⚙️ Hyperparameter Configurations

TD3 Baseline (Best)

PPO v2.1 (Best PPO)

Reward Function (env_tuned.py)

📊 Experiment Results Summary

Algorithm Comparison

Key Insights

📚 References

📝 License

🙏 Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages