Skip to content

stat-junda/Cellworld_Offline

 
 

Repository files navigation

Cellworld_Offline

Offline RL on cellworld_gym

Mouse data

We have 100,000:

image

image

BC/ CQL /IQL

evaluate for 100 epochs when captured -1, when success 1 The mean reward can be shown as follows:

  • BC: -17.42

  • CQL: -150.08

  • IQL: -29.06

It seems I have identified an intriguing case where Behavior Cloning (BC) is outperforming other offline RL methods, like CQL and Implicit Q-learning, which are designed to handle suboptimal datasets. This suggests that something unusual may be happening with the offline RL algorithms in your setting, potentially due to issues with how pessimistic they are when evaluating and learning from the dataset.

To investigate this further, comparing the Q-values (which represent the expected future reward from a given state-action pair) between your offline RL policies and a less pessimistic baseline (like BC) could shed light on why this is happening. Here's a detailed plan for how to approach this:

image image image image image image image

KL divergence()

  • 7.569894318469576e-07 for IQL
  • 1.4248751999871712e-07 for BC
  • DQN: 1.1685386452255908
  • QRDQN: 1.1869605695309853
  • Dreamer-v3: 1.1931324363223705

Improving Offline Learning

image image image image image image

Train for 100 epch, and evaluate for 20 epsolid after each epch:

image

After training plot:

image

Discrete to continous

image image image image image image image image image image

Offline RL Implementation with Planner-Greedy Exploration

  • There is a planning rate and RL rate. At the starting point, the planning rate is 100% and the RL rate is 0%.

  • And we are using the offline RL methods such as BC or Implict Q learning. During the training process, we are gradually reducing the planing rate and move the agent completely to RL based.

  • A mechnism to reduce the planning rate: Most simple one--linear decay, exponential decay. Performance based decay.

initial_planning_rate = 1.0
final_planning_rate = 0.0
planning_rate: form initial_planning_rate -> final_planning_rate

## initilized the buffer with tlppo tractories
replay_buffer = collet_start_data(data_points = 10000)

# train
for episode in range(total_episodes):
    state, _ = env.reset()
    done = False
    while not done:
        if random.uniform(0, 1) < planning_rate:
            action = tlppo.(state)
        else:
            action = offline_rl(state)
        next_state, reward, done, _ = env.step(action)
        # here, when saving all transitions, should I save all transitions or only the planner's transition?
        replay_buffer.add((state, action, reward, next_state, done))
        state = next_state
    offline_rl_agent.train(replay_buffer)

Planner still not ready, and I used a DQN agent with a success rate about 60%.

expert start

image

random start

image

About

Offline RL on cellworld_gym

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Jupyter Notebook 95.0%
  • Python 5.0%