Reinforcement Learning for UR5 Robotic Arm Control using PPO, TD3, and SAC algorithms. This project implements and compares on-policy (PPO) and off-policy (TD3, SAC) deep reinforcement learning methods for a 6-DOF robotic manipulation task.
This project trains RL agents to control a UR5 robotic arm to reach random target positions in 3D space.
| Algorithm | Success Rate | Episodic Return | Training Steps |
|---|---|---|---|
| TD3 Baseline | 92.3% π₯ | 337.76 | 1M |
| PPO v2.1 | 74.9% | 337.26 | 2M |
| PPO v2 | 62.8% | 336.2 | 2M |
| PPO Tuned | 61.9% | 340.65 | 500k |
| PPO Baseline | 5.7% | 37.16 | 1M |
| SAC v3 | 0.05% β | -10.5 | 1M |
Key Findings:
- TD3 outperforms all other algorithms with 92.3% success rate!
- PPO achieved ~75% after extensive hyperparameter tuning
- SAC failed to learn in this environment (unstable Q-function)
- Reward shaping is critical - 13x improvement from baseline to tuned PPO
UR5/
βββ env.py # Original UR5 Gymnasium environment
βββ env_tuned.py # Tuned environment (10cm threshold)
βββ env_tuned_v3.py # PPO v3 environment (8cm threshold)
βββ env_tuned_sac.py # SAC-specific environment
βββ env_tuned_sac_v3.py # SAC v3 environment (simplified rewards)
βββ __init__.py # Package initialization
βββ test_env.py # Environment testing script
βββ ik.py # Inverse kinematics implementation
β
βββ assets/ # MuJoCo model files
β βββ scene.xml # Main scene with UR5 + environment
β βββ ur5e.xml # UR5e robot MJCF model
β βββ meshes/ # Robot mesh files (.stl)
β
βββ cleanrl/cleanrl/ # Training scripts
β βββ ppo_continuous_action.py # PPO baseline
β βββ ppo_continuous_action_tuned.py # PPO tuned (61.9%)
β βββ ppo_continuous_action_v2.py # PPO v2 (62.8%)
β βββ ppo_continuous_action_v2_1.py # PPO v2.1 (74.9%) β¨
β βββ ppo_continuous_action_v3.py # PPO v3 (8cm, 41.3%)
β βββ td3_continuous_action.py # TD3 original
β βββ td3_continuous_action_ur5.py # TD3 UR5 (92.3%) π₯
β βββ sac_continuous_action.py # SAC original
β βββ sac_continuous_action_modified.py # SAC adapted for UR5
β βββ sac_continuous_action_tuned.py # SAC tuned
β βββ sac_continuous_action_v2.py # SAC v2
β βββ sac_continuous_action_v3.py # SAC v3
β βββ ddpg_continuous_action.py # DDPG implementation
β
βββ runs/ # TensorBoard training logs
βββ wandb/ # Weights & Biases experiment logs
βββ videos/ # Recorded episode videos
| Component | Specification |
|---|---|
| OS | Windows 10 |
| CPU | Intel Core i7-7700HQ @ 2.80GHz |
| GPU | NVIDIA GeForce GTX 1060 (6GB) |
| Python | 3.10.19 |
| PyTorch | 2.4.1 (CUDA) |
# Create conda environment
conda create -n ur5 python=3.10 -y
conda activate ur5
# Install PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Install MuJoCo and Gymnasium
pip install mujoco gymnasium[mujoco] dm_control
# Install RL utilities
pip install wandb tyro tensorboardpython test_env.pyTD3 (Best Performance - 92.3%):
python cleanrl/cleanrl/td3_continuous_action_ur5.pyPPO v2.1 (74.9%):
python cleanrl/cleanrl/ppo_continuous_action_v2_1.pyPPO Tuned (61.9%):
python cleanrl/cleanrl/ppo_continuous_action_tuned.py- TensorBoard:
tensorboard --logdir runs/ - WandB: https://wandb.ai/supernova0417-korea-university/ur5
learning_rate = 3e-4
learning_starts = 25000
policy_noise = 0.2
exploration_noise = 0.1
tau = 0.005
batch_size = 256
total_timesteps = 1000000learning_rate = 5e-5
ent_coef = 0.003 # Key tuning point!
num_envs = 8
num_steps = 2048
total_timesteps = 2000000reward_cfg = {
"dist_success_thresh": 0.10, # 10cm
"time_penalty": 0.001,
"success_bonus": 300.0,
"progress_scale": 15.0,
}| Algorithm | Version | Success Rate | Key Changes |
|---|---|---|---|
| TD3 | Baseline | 92.3% | Twin Q, delayed policy |
| PPO | v2.1 | 74.9% | ent_coef=0.003 |
| PPO | v2 | 62.8% | 2M steps, 8 envs |
| PPO | Tuned | 61.9% | Reward shaping |
| PPO | v3 | 41.3% | 8cm threshold (too hard) |
| PPO | Baseline | 5.7% | Original settings |
| SAC | v3 | 0.05% | Failed (Q divergence) |
| SAC | Baseline | 0% | Failed |
-
TD3 is the best algorithm for this task
- Twin Q-networks prevent overestimation
- Delayed policy updates ensure stability
- SAC's entropy term caused instability
-
PPO converges around ~75%
ent_coef=0.003is optimal (between 0.005 and 0.001)- Further tuning shows diminishing returns
-
SAC fundamentally fails in this environment
- Q-function diverges to infinity
- Entropy regularization seems problematic
-
Reward shaping matters significantly
- 10cm threshold is achievable
- 8cm threshold causes major performance drop
- Progress-based rewards are essential
- Fujimoto et al. (2018). Addressing Function Approximation Error in Actor-Critic Methods. (TD3)
- Schulman et al. (2017). Proximal Policy Optimization Algorithms. (PPO)
- Haarnoja et al. (2018). Soft Actor-Critic. (SAC)
- CleanRL - Clean implementation of RL algorithms.
This project is licensed under the MIT License - see the LICENSE file for details.
- Korea University - MECH485 Intelligent Robotics Course
- CleanRL for the excellent RL implementations
- MuJoCo for the physics simulation
Author: Jinkwon Park
Course: MECH485 - Intelligent Robotics, Korea University
Date: 28, December, 2025