feat: improve dqn model, add support for environment randomization, add support for testing environment #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

ClashLuke wants to merge 75 commits into mitchellgoffpc:master from ClashLuke:master

.gitignore

-Original file line number
+Diff line change
@@ Expand Up / @@ -9,3 +9,24 @@ venv/ @@
     .DS_Store
     *.pyc
+    /src/a.out
+    /index.html
+    /checkpoints/dqn/dqn0/loss.txt
+    /.idea/**
+    **model_checkpoint*
+    /.idea/modules.xml
+    /src/obs.so
+    /src/observation_utils.c
+    /src/observation_utils.h
+    /src/observation_utils.so
+    /.idea/other.xml
+    /.idea/inspectionProfiles/profiles_settings.xml
+    /.idea/inspectionProfiles/Project_Default.xml
+    /r/rail_networks_35x40x40.pkl
+    /r/rail_networks_sum.pkl
+    /r/schedules_35x40x40.pkl
+    /r/schedules_sum.pkl
+    /.idea/vcs.xml
+    *.c
+    *.so
+    *.out

README.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1,33 +1,70 @@
  
    # flatland-training

    This repo contains an optimized version of flatland-rl's `flatland.envs.observations.TreeObsForRailEnv`. Tree-based observations allow RL models to learn much more quickly than the global observations do, but flatland's built-in TreeObsForRailEnv is kind of slow, so I wrote a faster version! This repo also contains an optimized version of [https://gitlab.aicrowd.com/flatland/baselines/blob/master/utils/observation_utils.py](https://gitlab.aicrowd.com/flatland/baselines/blob/master/utils/observation_utils.py), which flattens and normalizes the tree observations into 1D numpy arrays that can be passed to a feed-forward network.

    PyTorch solution for [flatland-2020](https://www.aicrowd.com/challenges/neurips-2020-flatland-challenge/)

    ## Implementation

    # Setup

    ## Create venv

    `python3.7 -m venv venv`\

    `source venv/bin/activate`

    This repository contains three major modules. 

    Verify python version is correct with: `python -V`\

    Should return `Python 3.7.something`

    ### Getting Started

    ## Install Requirements

    `pip install -r requirements.txt`

    #### Setup

    Before following along here, please note that there is an `install.sh` script which executes all the commands here.\

    First, create virtual environment using `python3.7 -m venv venv && source venv/bin/activate`.\

    Then install the requirements with `python3 -m pip install -r requirements.txt`.\

    It is recommended to verify that the installation was successful by first checking the python version and then attempting an import of all required non-standard packages.

    ```bash

    $ python3 --version

    Python 3.7.6

    $ python3 -c "import torch, torch_optimizer, numpy, cython, flatland, gym, tqdm; print('Successfully imported packages')"

    Successfully imported packages

    ```

    Lastly perhaps the most crucial step. It requires `gcc-7`, as no other version works. On Debian/Ubuntu, it can be installed using the apt package manager by running `apt install gcc-7`.\

    Once that's done, the python code can be compiled using cython. Compilation is done by first moving into the source folder and then executing cythonize.sh via `cd src && bash cythonize.sh`.

    # Generate Railways

    This script will precompute a bunch of railway maps to make training faster.\

    #### Generate Environments

    `python src/generate_railways.py`

    For better training performance, one can optionally generate the environments used to train the network _before_  training it. This way training is much faster as training data doesn't have to be regenerated repeatedly but instead gets loaded once on startup.\

    To start the generation of environments and their respective railways, use `python3 src/generate_railways.py --width 50`. The command will generate 50x50 grids of cities, rails and trains.

    This will run for quite a long time, go get some tea...\

    But also it's fine to stop it after it completes at least one round if you just want to test things out and make sure they run.\

    If you don't care about the speedup, you can run `python src/train.py --load-railways=False` to generate railways on the fly during training instead.

    #### Run Training

    Finally, it's time to train the model. You can do so by running `python3 src/train.py`, which will train a basic cnn using the "local observation" method. ![https://flatland.aicrowd.com/getting-started/env.html](https://i.imgur.com/oo8EIYv.png)\

    Not only global and tree observation, but also many model parameters are implemented as well. To find out more about them, add the  `--help` flag.

    # Run Training

    `python src/train.py`

    ### Structure

    This will begin training one or more agents in the flatland environment.\

    This file has a lot of parameters that can be set to do different things.\

    To see all the options, use the `--help` command line argument.

    Currently the code is structured in huge monolithic files. 

    | Name | Description |

    |----|----|

    | agent.py | Reward and trainings algorithm, as well as some hyper parameters (such as batch size and learning rate)|

    | generate_railways.py | Script to pre-compute and generate railways from the command line. See [#Generate Environments](#Generate-Environments)|

    | model.py | PyTorch definition of tree-observation and local-observation models|

    | observation_utils.pyx | Agent observation utilities called by environment to create training observation |

    | rail_env.pyx | Cython-port of flatland-rl RailEnv |

    | railway_utils.py | Utility script to handle creation of and iterators over railways |

    | train.py | Core training loop |

    ### Future Work

    The current implementation has many holes. One of them is the very poor performance received when controlling many (>10) agents at once.\

    We tackle this issue from multiple sites at once. If you would like to participate in this team, open an issue, pull request or join us on [discord](https://discord.gg/mP72wbE).

    Our current approaches are listed below:

    * **Observation, Model**:

        * Tree observation, graph neural networks

        * Tree observation, fully-connected networks

        * Tree observation, transformer

        * Local observation, cnn

        * Global observation, cnn 

    * **Teaching algorithm**:

        * PPO

        * (Double-) DQN

    * **Misc. Freebies**:

        * Epsilon-Greedy

        * Multiprocessing

        * Inter-agent communication

    If you are working on one of these tasks or would like to do so, please open an issue or pull request to let others know about it. Once seen, it will be added to the main repository.

checkpoints/dqn/dqn0/README.md

Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		DQN checkpoints will be saved here

checkpoints/ppo/README.md

This file was deleted.

install.sh

-Original file line number
+Diff line change
@@ -0,0 +1,16 @@
+    #!/usr/bin/env bash
+    function check_python {
+    python3 -c "import torch, torch_optimizer, numpy, cython, flatland, gym, tqdm; print('Successfully imported packages')" 2>/dev/null\
+    || (python3.7 -m venv venv && source venv/bin/activate && python3 -m pip install -r requirements.txt) || check_python
+    }
+    check_python
+    echo "Checking for GCC-7"
+    gcc-7 --version > /dev/null || sudo apt install gcc-7
+    echo "Compiling source"
+    cd src && source cythonize.sh > /dev/null 2>/dev/null
+    cd ..

railroads/README.md

This file was deleted.

requirements.txt

-Original file line number
+Diff line change
@@ -1,8 +1,7 @@
-    argparse==1.4.0
-    flatland-rl==2.2.1
-    numpy==1.18.5
-    torch==1.5.0
-    opencv-python==4.2.0.34
-    Pillow==7.1.2
-    tqdm==4.46.1
-    tensorboardX==2.0
+    torch
+    torch-optimizer
+    numpy
+    cython
+    flatland-rl
+    gym
+    tqdm

src/agent.py

-Original file line number
+Diff line change
@@ -0,0 +1,224 @@
+    import math
+    import pickle
+    import typing
+    import numpy as np
+    import torch
+    from torch_optimizer import Yogi as Optimizer
+    try:
+        from .model import QNetwork, ConvNetwork, init, GlobalStateNetwork, TripleClassificationHead, NAFHead
+    except:
+        from model import QNetwork, ConvNetwork, init, GlobalStateNetwork, TripleClassificationHead, NAFHead
+    import os
+    BATCH_SIZE = 256
+    CLIP_FACTOR = 0.2
+    LR = 1e-4
+    UPDATE_EVERY = 1
+    CUDA = True
+    MINI_BACKWARD = False
+    DQN_TAU = 1e-3
+    EPOCHS = 1
+    device = torch.device("cuda:0" if CUDA and torch.cuda.is_available() else "cpu")
+    @torch.jit.script
+    def aggregate(loss: torch.Tensor):
+        maximum, _ = loss.max(0)
+        return maximum.sum()
+    @torch.jit.script
+    def mse(in_x, in_y):
+        return aggregate((in_x - in_y).square())
+    @torch.jit.script
+    def dqn_target(rewards, targets_next, done):
+        return rewards + 0.998 * targets_next * (1 - done)
+    class Agent(torch.nn.Module):
+        def __init__(self, state_size, action_size, model_depth, hidden_factor, kernel_size, squeeze_heads, decoder_depth,
+                     model_type=0, softmax=True, debug=True, loss_type='PPO'):
+            super(Agent, self).__init__()
+            self.action_size = action_size
+            # Q-Network
+            if model_type == 1:  # Global/Local
+                network = ConvNetwork
+            elif model_type == 0:  # Tree
+                network = QNetwork
+            else:  # Global State
+                network = GlobalStateNetwork
+            if loss_type in ('PPO', 'DQN'):
+                tail = TripleClassificationHead(hidden_factor, action_size)
+            else:
+                tail = NAFHead(hidden_factor, action_size)
+            self.policy = network(state_size,
+                                  hidden_factor,
+                                  model_depth,
+                                  kernel_size,
+                                  squeeze_heads,
+                                  decoder_depth,
+                                  tail=tail,
+                                  softmax=softmax).to(device)
+            self.old_policy = network(state_size,
+                                      hidden_factor,
+                                      model_depth,
+                                      kernel_size,
+                                      squeeze_heads,
+                                      decoder_depth,
+                                      tail=tail,
+                                      softmax=softmax,
+                                      debug=False).to(device)
+            if debug:
+                print(self.policy)
+                parameters = sum(np.prod(p.size()) for p in filter(lambda p: p.requires_grad, self.policy.parameters()))
+                digits = int(math.log10(parameters))
+                number_string = " kMGTPEZY"[digits // 3]
+                print(f"[DEBUG/MODEL] Training with {parameters * 10 ** -(digits // 3 * 3):.1f}"
+                      f"{number_string} parameters")
+            self.policy.apply(init)
+            try:
+                self.policy = torch.jit.script(self.policy)
+                self.old_policy = torch.jit.script(self.old_policy)
+            except:
+                import traceback
+                traceback.print_exc()
+                print("NO JIT")
+            self.old_policy.load_state_dict(self.policy.state_dict())
+            self.optimizer = Optimizer(self.policy.parameters(), lr=LR, weight_decay=1e-2)
+            # Replay memory
+            self.stack = [[] for _ in range(6)]
+            self.t_step = 0
+            self.idx = 1
+            self.tensor_stack = []
+            self._policy_update = loss_type in ("PPO",)
+            self._soft_update = loss_type in ("DQN", "NAF")
+            self._action_index = torch.zeros(1)
+            self._value_index = torch.zeros(1) + 1
+            self._triangular_index = torch.zeros(1) + 2
+            self.loss = getattr(self, f'_{loss_type.lower()}_loss')
+        def _dqn_loss(self, states, actions, next_states, rewards, done):
+            actions = actions.argmax(1)
+            expected = self.policy(self._action_index, self._action_index, *states).gather(1, actions)
+            best_action = self.policy(self._action_index, self._action_index, next_states).argmax(1)
+            targets_next = self.old_policy(self._action_index, self._action_index,
+                                           *next_states).gather(1, best_action.unsqueeze(1))
+            targets = dqn_target(rewards, targets_next, done)
+            loss = mse(expected, targets)
+            return loss
+        def _naf_loss(self, states, actions, next_states, rewards, done):
+            targets_next = self.old_policy(self._value_index, self._action_index, next_states)
+            state_action_values = self.policy(self._triangular_index, actions, states)
+            targets = dqn_target(rewards, targets_next, done)
+            loss = mse(state_action_values, targets)
+            return loss
+        def _ppo_loss(self, states, actions, next_states, rewards, done):
+            _ = next_states
+            _ = done
+            actions = actions.argmax(1)
+            states_clone = [st.clone().detach().requires_grad_(False) for st in states]
+            old_responsible_outputs = self.old_policy(self._action_index, self._action_index,
+                                                      *states_clone).gather(1, actions).detach_()
+            responsible_outputs = self.policy(self._action_index, self._action_index, *states).gather(1, actions)
+            ratio = responsible_outputs / (old_responsible_outputs + 1e-5)
+            clamped_ratio = torch.clamp(ratio, 1. - CLIP_FACTOR, 1. + CLIP_FACTOR)
+            loss = aggregate(torch.min(ratio * rewards, clamped_ratio * rewards)).neg()
+            return loss
+        def reset(self):
+            self.policy.reset_cache()
+            self.old_policy.reset_cache()
+        def multi_act(self, state, argmax_only=True) -> typing.Union[typing.Tuple[np.ndarray, np.ndarray], np.ndarray]:
+            self.policy.eval()
+            with torch.no_grad():
+                action_values = self.policy(self._action_index, self._action_index, *state).detach()
+            argmax = action_values.argmax(1).cpu().numpy()
+            if argmax_only:
+                return argmax
+            return action_values, argmax
+        def step(self, state, action, agent_done, collision, next_state, step_reward=0, collision_reward=-2):
+            agent_count = len(agent_done[0]) - 1
+            self.stack[0].append(state)
+            self.stack[1].append(action)
+            self.stack[2].append([[done[idx] for idx in range(agent_count)] for done in agent_done])
+            self.stack[3].append(collision)
+            self.stack[4].append(next_state)
+            if MINI_BACKWARD or len(self.stack[0]) >= UPDATE_EVERY:
+                action = torch.cat(self.stack[1]).to(device).unsqueeze_(1)
+                agent_done = np.array(self.stack[2])
+                collision = np.array(self.stack[3])
+                reward = np.where(agent_done, 1, np.where(collision, collision_reward, step_reward))
+                reward = torch.tensor(reward, device=device, dtype=torch.float).flatten(0, 1).unsqueeze_(1)
+                state = tuple(torch.cat(st, 0) for st in zip(*self.stack[0]))
+                next_state = tuple(torch.cat(st, 0) for st in zip(*self.stack[4]))
+                agent_done = torch.as_tensor(agent_done, device=device, dtype=torch.int8).flatten(0, 1).unsqueeze_(1)
+                self.stack = [[] for _ in range(6)]
+                self.tensor_stack.append((state, action, reward, next_state, agent_done))
+                if len(self.tensor_stack) >= EPOCHS:
+                    tensor_stack = (torch.cat(t, 0) if isinstance(t[0], torch.Tensor)
+                                    else tuple(torch.cat(sub_t, 0) for sub_t in zip(*t))
+                                    for t in zip(*self.tensor_stack))
+                    del self.tensor_stack[0]
+                    self.learn(*tensor_stack)
+        def learn(self, states, actions, rewards, next_states, done):
+            if MINI_BACKWARD:
+                self.idx = (self.idx + 1) % UPDATE_EVERY
+            self.policy.train()
+            loss = self.loss(states, actions, next_states, rewards, done)
+            if MINI_BACKWARD:
+                loss = loss / UPDATE_EVERY
+            loss.backward()
+            if not MINI_BACKWARD or self.idx == 0:
+                if self._policy_update:
+                    self.old_policy.load_state_dict(self.policy.state_dict())
+                self.optimizer.step()
+                self.optimizer.zero_grad()
+                if self._soft_update:
+                    for target_param, local_param in zip(self.old_policy.parameters(), self.policy.parameters()):
+                        target_param.data.copy_(DQN_TAU * local_param.data +
+                                                (1.0 - DQN_TAU) * target_param.data)
+        def save(self, path, *data):
+            torch.save(self.policy.state_dict(), path / 'dqn/model_checkpoint.local')
+            torch.save(self.old_policy.state_dict(), path / 'dqn/model_checkpoint.target')
+            torch.save(self.optimizer.state_dict(), path / 'dqn/model_checkpoint.optimizer')
+            with open(path / 'dqn/model_checkpoint.meta', 'wb') as file:
+                pickle.dump(data, file)
+        def load(self, path, *defaults):
+            loc = {} if torch.cuda.is_available() else {'map_location': torch.device('cpu')}
+            try:
+                print("Loading model from checkpoint...")
+                dqn = os.path.join(path, 'dqn')
+                self.policy.load_state_dict(torch.load(os.path.join(dqn, 'model_checkpoint.local'), **loc))
+                self.old_policy.load_state_dict(torch.load(os.path.join(dqn, 'model_checkpoint.target'), **loc))
+                self.optimizer.load_state_dict(torch.load(os.path.join(dqn, 'model_checkpoint.optimizer'), **loc))
+                with open(os.path.join(dqn, 'model_checkpoint.meta'), 'rb') as file:
+                    return pickle.load(file)
+            except Exception as exc:
+                import traceback
+                traceback.print_exc()
+                print(f"Got exception {exc} loading model data. Possibly no checkpoint found.")
+                return defaults

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: improve dqn model, add support for environment randomization, add support for testing environment #4

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!

Uh oh!

feat: improve dqn model, add support for environment randomization, add support for testing environment #4

Are you sure you want to change the base?

Uh oh!

feat: improve dqn model, add support for environment randomization, add support for testing environment #4

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Uh oh!

Uh oh!