Cellworld_Offline

Offline RL on cellworld_gym

Mouse data

We have 100,000:

BC/ CQL /IQL

evaluate for 100 epochs when captured -1, when success 1 The mean reward can be shown as follows:

BC: -17.42
CQL: -150.08
IQL: -29.06

It seems I have identified an intriguing case where Behavior Cloning (BC) is outperforming other offline RL methods, like CQL and Implicit Q-learning, which are designed to handle suboptimal datasets. This suggests that something unusual may be happening with the offline RL algorithms in your setting, potentially due to issues with how pessimistic they are when evaluating and learning from the dataset.

To investigate this further, comparing the Q-values (which represent the expected future reward from a given state-action pair) between your offline RL policies and a less pessimistic baseline (like BC) could shed light on why this is happening. Here's a detailed plan for how to approach this:

KL divergence()

7.569894318469576e-07 for IQL
1.4248751999871712e-07 for BC
DQN: 1.1685386452255908
QRDQN: 1.1869605695309853
Dreamer-v3: 1.1931324363223705

Improving Offline Learning

Train for 100 epch, and evaluate for 20 epsolid after each epch:

After training plot:

Discrete to continous

Offline RL Implementation with Planner-Greedy Exploration

There is a planning rate and RL rate. At the starting point, the planning rate is 100% and the RL rate is 0%.
And we are using the offline RL methods such as BC or Implict Q learning. During the training process, we are gradually reducing the planing rate and move the agent completely to RL based.
A mechnism to reduce the planning rate: Most simple one--linear decay, exponential decay. Performance based decay.

initial_planning_rate = 1.0
final_planning_rate = 0.0
planning_rate: form initial_planning_rate -> final_planning_rate

## initilized the buffer with tlppo tractories
replay_buffer = collet_start_data(data_points = 10000)

# train
for episode in range(total_episodes):
    state, _ = env.reset()
    done = False
    while not done:
        if random.uniform(0, 1) < planning_rate:
            action = tlppo.(state)
        else:
            action = offline_rl(state)
        next_state, reward, done, _ = env.step(action)
        # here, when saving all transitions, should I save all transitions or only the planner's transition?
        replay_buffer.add((state, action, reward, next_state, done))
        state = next_state
    offline_rl_agent.train(replay_buffer)

Planner still not ready, and I used a DQN agent with a success rate about 60%.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
data		data
plot		plot
reward_list		reward_list
.gitattributes		.gitattributes
CELLWORLD_Imitation-HS.ipynb		CELLWORLD_Imitation-HS.ipynb
KL_compare.ipynb		KL_compare.ipynb
README.md		README.md
agent.py		agent.py
cql_train.py		cql_train.py
iql_buffer.py		iql_buffer.py
iql_offline_train.py		iql_offline_train.py
utils.py		utils.py
vec_network.py		vec_network.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Cellworld_Offline

Mouse data

BC/ CQL /IQL

Improving Offline Learning

Discrete to continous

Offline RL Implementation with Planner-Greedy Exploration

expert start

random start

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Cellworld_Offline

Mouse data

BC/ CQL /IQL

Improving Offline Learning

Discrete to continous

Offline RL Implementation with Planner-Greedy Exploration

expert start

random start

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages