Offline RL on cellworld_gym
We have 100,000:
evaluate for 100 epochs when captured -1, when success 1 The mean reward can be shown as follows:
-
BC: -17.42
-
CQL: -150.08
-
IQL: -29.06
It seems I have identified an intriguing case where Behavior Cloning (BC) is outperforming other offline RL methods, like CQL and Implicit Q-learning, which are designed to handle suboptimal datasets. This suggests that something unusual may be happening with the offline RL algorithms in your setting, potentially due to issues with how pessimistic they are when evaluating and learning from the dataset.
To investigate this further, comparing the Q-values (which represent the expected future reward from a given state-action pair) between your offline RL policies and a less pessimistic baseline (like BC) could shed light on why this is happening. Here's a detailed plan for how to approach this:
KL divergence()
- 7.569894318469576e-07 for IQL
- 1.4248751999871712e-07 for BC
- DQN: 1.1685386452255908
- QRDQN: 1.1869605695309853
- Dreamer-v3: 1.1931324363223705
Train for 100 epch, and evaluate for 20 epsolid after each epch:
After training plot:
-
There is a planning rate and RL rate. At the starting point, the planning rate is 100% and the RL rate is 0%.
-
And we are using the offline RL methods such as BC or Implict Q learning. During the training process, we are gradually reducing the planing rate and move the agent completely to RL based.
-
A mechnism to reduce the planning rate: Most simple one--linear decay, exponential decay. Performance based decay.
initial_planning_rate = 1.0
final_planning_rate = 0.0
planning_rate: form initial_planning_rate -> final_planning_rate
## initilized the buffer with tlppo tractories
replay_buffer = collet_start_data(data_points = 10000)
# train
for episode in range(total_episodes):
state, _ = env.reset()
done = False
while not done:
if random.uniform(0, 1) < planning_rate:
action = tlppo.(state)
else:
action = offline_rl(state)
next_state, reward, done, _ = env.step(action)
# here, when saving all transitions, should I save all transitions or only the planner's transition?
replay_buffer.add((state, action, reward, next_state, done))
state = next_state
offline_rl_agent.train(replay_buffer)
Planner still not ready, and I used a DQN agent with a success rate about 60%.

