Master Thesis by Patrick Siebke, TU Darmstadt – November 15, 2024
This repository contains the code and experiments from my master thesis on Quasimetric Reinforcement Learning (QRL). The thesis addresses two central research questions in Goal-Conditioned Reinforcement Learning (GCRL):
-
Impact of a Quasimetric Critic:
How does modeling the optimal value function as a quasimetric affect learning? I compare a standard neural network critic with a quasimetric critic based on Interval Quasimetric Embedding (IQE), as well as a metric critic derived from IQE to enforce symmetry. -
Extended Goal Relabeling:
Can integrating random off-trajectory goal relabeling with on-trajectory relabeling in Hindsight Experience Replay (HER) introduce a pessimism bias that improves both robustness and sample efficiency?
Experiments on robotics tasks from the Gymnasium Robotics suite — specifically Fetch and HandManipulate environments — demonstrate that these approaches can significantly enhance learning performance.
In goal-conditioned reinforcement learning, agents learn to achieve multiple goals from sparse reward signals. Two key challenges are:
- Sparse Rewards: Feedback is only provided when a goal is reached, which hinders efficient learning.
- Temporal Distance as a Quasimetric: The optimal value function, which represents the minimum cost or “distance” to reach a goal, is inherently asymmetric. This asymmetry — where the cost from state A to B may differ from the cost from B to A — can be captured by a quasimetric.
In this work, I make the following contributions:
-
Compare different Critic Architectures:
- A conventional neural network critic (unconstrained function approximator).
- A quasimetric critic using Interval Quasimetric Embedding (IQE) that can capture asymmetric distances.
- A metric critic (Interval Metric Embedding, IME) that enforces symmetry by averaging directional costs.
-
Extend HER Relabeling Strategies: I introduce mixed relabeling strategies that decouple the relabeling for the actor and the critic, combining random off-trajectory relabeling with traditional on-trajectory relabeling. This mixed strategy introduces a pessimism bias that helps the critic generalizing to unseen goals.
The core idea is to approximate the optimal value function $ V^*(s, g) $ as a quasimetric $ d_\theta(s, g) $. This formulation naturally enforces:
- Triangle Inequality: The cost of any detour is at least as high as that of the direct transition.
- Asymmetry: It accurately captures that moving toward a goal can incur a different cost than moving away.
To implement this, the thesis explores:
- Interval Quasimetric Embedding (IQE): A latent embedding that satisfies the properties of a quasimetric.
- Interval Metric Embedding (IME): An approach where a symmetric metric is derived from the quasimetric by averaging the directional costs.
HER is used to mitigate sparse rewards by relabeling goals based on future states. In my work, I extend HER by:
- Decoupling the relabeling process for the actor and the critic.
- Introducing random off-trajectory relabeling in conjunction with on-trajectory relabeling, which imposes a pessimism bias to learn a robust value functions.
The experimental evaluation is conducted on:
- Fetch Environments: Including FetchPush, FetchSlide, and FetchPickAndPlace-tasks that involve pushing, sliding, and pick-and-place operations with a 7-DoF robotic arm.
- HandManipulate Tasks: Involving complex manipulation with a simulated Shadow Dexterous Hand.
The experimental results indicate that:
- Sample Efficiency is Improved: Quasimetric critics based on IQE significantly outperform standard neural network critics.
- Learning Robustness is Enhanced: Mixed relabeling strategies that combine random and future goal relabeling yield more stable performance.
- Plots:
TODO add plots and videos
git clone https://github.com/pynator/QuasimetricRL.git
A Dockerfile is provided for a reproducible environment. To build and run the container use:
# Build the Docker image
docker build -t qrl .
# Run the Docker container
docker run -it --rm --gpus all -p 6006:6006 -v {path_to_project}/QuasimetricRL:/workspace qrl /bin/bash
If you prefer not to use Docker, you can install the required libraries using pip:
pip install torch
pip install mujoco==3.1.6
pip install git+https://github.com/Farama-Foundation/Gymnasium-Robotics.git
pip install tensorboard
pip install torch-tb-profiler
pip install moviepy
pip install wandb
To train a policy you can use the run.sh script.
./run.sh
To generate a video of the trained policy, use the video.sh script.
./video.sh
Interval Quasimetric Embedding (IQE) (Paper)
Interval Quasimetric Embedding (IQE) (Code - torchqmet)