Farama-Foundation · vwxyzjn · Nov 16, 2022 · Nov 16, 2022 · Nov 16, 2022 · Nov 16, 2022
diff --git a/.github/workflows/pypi.yml b/.github/workflows/pypi.yml
@@ -25,12 +25,12 @@ jobs:
         java-version: '8.x' # The JDK version to make available on the path.
         java-package: jdk # (jre, jdk, or jdk+fx) - defaults to jdk
         architecture: x64 # (x64 or x86) - defaults to x64
-    - name: Build microrts
-      run: bash build.sh
-    - name: Run image
-      uses: abatilo/actions-poetry@v2.0.0
+    - name: Install Poetry
+      uses: snok/install-poetry@v1
       with:
-        poetry-version: 1.1.7
+        virtualenvs-create: true
+        virtualenvs-in-project: true
+        installer-parallel: true
     - name: Build a source tarball
       run: pip install poetry-dynamic-versioning && poetry install --no-dev && poetry build --format sdist
     - name: Upload artifact to S3

diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -30,9 +30,6 @@ jobs:
           java-package: jdk # (jre, jdk, or jdk+fx) - defaults to jdk
           architecture: x64 # (x64 or x86) - defaults to x64
 
-      - name: Build microrts
-        run: bash build.sh
-
       - name: Install Poetry
         uses: snok/install-poetry@v1
         with:

diff --git a/README.md b/README.md
@@ -64,30 +64,49 @@ Note that the experiments in the technical paper above are done with [`gym_micro
 
 Here is a description of Gym-μRTS's observation and action space:
 
-* **Observation Space.** (`Box(0, 1, (h, w, 27), int32)`) Given a map of size `h x w`, the observation is a tensor of shape `(h, w, n_f)`, where `n_f` is a number of feature planes that have binary values. The observation space used in this paper uses 27 feature planes as shown in the following table. A feature plane can be thought of as a concatenation of multiple one-hot encoded features. As an example, if there is a worker with hit points equal to 1, not carrying any resources, owner being Player 1, and currently not executing any actions, then the one-hot encoding features will look like the following:
-
-   `[0,1,0,0,0],  [1,0,0,0,0],  [1,0,0], [0,0,0,0,1,0,0,0],  [1,0,0,0,0,0]`
-
-
-    The 27 values of each feature plane for the position in the map of such worker will thus be:
-
-    `[0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0]`
-
-* **Partial Observation Space.** (`Box(0, 1, (h, w, 29), int32)`) Given a map of size `h x w`, the observation is a tensor of shape `(h, w, n_f)`, where `n_f` is a number of feature planes that have binary values. The observation space for partial observability uses 29 feature planes as shown in the following table. A feature plane can be thought of as a concatenation of multiple one-hot encoded features. As an example, if there is a worker with hit points equal to 1, not carrying any resources, owner being Player 1,  currently not executing any actions, and not visible to the opponent, then the one-hot encoding features will look like the following:
-
-   `[0,1,0,0,0],  [1,0,0,0,0],  [1,0,0], [0,0,0,0,1,0,0,0],  [1,0,0,0,0,0], [1,0]`
+* **Observation Space.** (`Box(0, 1, (h, w, 29), int32)`) Given a map of size `h x w`, the observation is a tensor of shape `(h, w, n_f)`, where `n_f` is a number of feature planes that have binary values. The observation space used in this paper uses 29 feature planes as shown in the following table. A feature plane can be thought of as a concatenation of multiple one-hot encoded features. As an example, the unit at a cell could be encoded as follows:
 
+    * the unit has 1 hit point -> `[0,1,0,0,0]`
+    * the unit is not carrying any resources, -> `[1,0,0,0,0]`
+    * the unit is owned by Player 1 -> `[0,1,0]`
+    * the unit is a worker -> `[0,0,0,0,1,0,0,0]`
+    * the unit is not executing any actions -> `[1,0,0,0,0]`
+    * the unit is standing at free terrain cell -> `[1,0]`
 
     The 29 values of each feature plane for the position in the map of such worker will thus be:
 
     `[0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0]`
 
+* **Partial Observation Space.** (`Box(0, 1, (h, w, 31), int32)`) under the partial observation space, there is an additional two plans indicating if the unit is visible to the opponent. For example, if the unit is visible to the opponent, the feature plane will be `[0,1]`. If the unit is not visible to the opponent, the feature plane will be `[1,0]`. Using the example above and assume the worker unit is not visible, then the 31 values of each feature plane for the position in the map of such worker will thus be:
+
+    `[0,1,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,1,0]`
+
 * **Action Space.** (`MultiDiscrete(concat(h * w * [[6   4   4   4   4   7 a_r]]))`) Given a map of size `h x w` and the maximum attack range `a_r=7`, the action is an (7hw)-dimensional vector of discrete values as specified in the following table. The first 7 component of the action vector represents the actions issued to the unit at `x=0,y=0`, and the second 7 component represents actions issued to the unit at `x=0,y=1`, etc. In these 7 components, the first component is the action type, and the rest of components represent the different parameters different action types can take. Depending on which action type is selected, the game engine will use the corresponding parameters to execute the action. As an example, if the RL agent issues a move south action to the worker at $x=0, y=1$ in a 2x2 map, the action will be encoded in the following way:
 
     `concat([0,0,0,0,0,0,0], [1,2,0,0,0,0,0], [0,0,0,0,0,0,0], [0,0,0,0,0,0,0]]`
     `=[0,0,0,0,0,0,0,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]`
 
-![image](https://user-images.githubusercontent.com/5555347/120344517-a5bf7300-c2c7-11eb-81b6-172813ba8a0b.png)
+<!-- ![image](https://user-images.githubusercontent.com/5555347/120344517-a5bf7300-c2c7-11eb-81b6-172813ba8a0b.png) -->
+
+Here is a table summarizing observation features and action components, where $a_r=7$ is the maximum attack range and `-` means not applicable.
+
+| Observation Features        | Planes             | Description                                              |
+|-----------------------------|--------------------|----------------------------------------------------------|
+| Hit Points                  | 5                  | 0, 1, 2, 3, $\geq 4$                                     |
+| Resources                   | 5                  | 0, 1, 2, 3, $\geq 4$                                     |
+| Owner                       | 3                  | -,player 1, player 2       |
+| Unit Types                  | 8                  | -, resource, base, barrack, worker, light, heavy, ranged |
+| Current Action              | 6                  | -, move, harvest, return, produce, attack                |
+| Action Components           | Range              | Description                                              |
+| Source Unit                 | $[0,h \times w-1]$ | the location of the unit selected to perform an action   |
+| Action Type                 | $[0,5]$            | NOOP, move, harvest, return, produce, attack             |
+| Move Parameter              | $[0,3]$            | north, east, south, west                                 |
+| Harvest Parameter           | $[0,3]$            | north, east, south, west                                 |
+| Return Parameter            | $[0,3]$            | north, east, south, west                                 |
+| Produce Direction Parameter | $[0,3]$            | north, east, south, west                                 |
+| Produce Type Parameter      | $[0,6]$            | resource, base, barrack, worker, light, heavy, ranged    |
+| Relative Attack Position    | $[0,a_r^2 - 1]$    | the relative location of the unit that  will be attacked |
+
 
 ## Evaluation
 

diff --git a/experiments/league.py b/experiments/league.py
@@ -58,7 +58,7 @@ def parse_args():
         help='the highest sigma of the trueskill evaluation')
     parser.add_argument('--output-path', type=str, default=f"league.temp.csv",
         help='the output path of the leaderboard csv')
-    parser.add_argument('--model-type', type=str, default=f"ppo_gridnet_large", choices=["ppo_gridnet_large", "ppo_gridnet"],
+    parser.add_argument('--model-type', type=str, default=f"ppo_gridnet", choices=["ppo_gridnet"],
         help='the output path of the leaderboard csv')
     parser.add_argument('--maps', nargs='+', default=["maps/16x16/basesWorkers16x16A.xml"],
         help="the maps to do trueskill evaluations")
@@ -83,17 +83,13 @@ def parse_args():
     dbpath = tmp_dbpath
 db = SqliteDatabase(dbpath)
 
-if args.model_type == "ppo_gridnet_large":
-    from ppo_gridnet_large import Agent, MicroRTSStatsRecorder
+if args.model_type == "ppo_gridnet":
+    from ppo_gridnet import Agent, MicroRTSStatsRecorder
 
     from gym_microrts.envs.vec_env import MicroRTSBotVecEnv, MicroRTSGridModeVecEnv
-else:
-    from ppo_gridnet import Agent, MicroRTSStatsRecorder
 
-    from gym_microrts.envs.vec_env import MicroRTSBotVecEnv
-    from gym_microrts.envs.vec_env import (
-        MicroRTSGridModeSharedMemVecEnv as MicroRTSGridModeVecEnv,
-    )
+else:
+    raise ValueError(f"model_type {args.model_type} is not supported")
 
 
 class BaseModel(Model):
@@ -189,6 +185,7 @@ def __init__(self, partial_obs: bool, match_up=None, map_path="maps/16x16/basesW
                 ai2s=built_in_ais,
                 map_paths=[map_path],
                 reward_weight=np.array([10.0, 1.0, 1.0, 0.2, 1.0, 4.0]),
+                autobuild_microrts=False,
             )
             self.agent = Agent(self.envs).to(self.device)
             self.agent.load_state_dict(torch.load(self.rl_ai, map_location=self.device))
@@ -202,6 +199,7 @@ def __init__(self, partial_obs: bool, match_up=None, map_path="maps/16x16/basesW
                 render_theme=2,
                 map_paths=[map_path],
                 reward_weight=np.array([10.0, 1.0, 1.0, 0.2, 1.0, 4.0]),
+                autobuild_microrts=False,
             )
             self.agent = Agent(self.envs).to(self.device)
             self.agent.load_state_dict(torch.load(self.rl_ai, map_location=self.device))
@@ -217,6 +215,7 @@ def __init__(self, partial_obs: bool, match_up=None, map_path="maps/16x16/basesW
                 render_theme=2,
                 map_paths=[map_path],
                 reward_weight=np.array([10.0, 1.0, 1.0, 0.2, 1.0, 4.0]),
+                autobuild_microrts=False,
             )
         self.envs = MicroRTSStatsRecorder(self.envs)
         self.envs = VecMonitor(self.envs)

diff --git a/experiments/ppo_gridnet.py b/experiments/ppo_gridnet.py
@@ -19,9 +19,7 @@
 from torch.utils.tensorboard import SummaryWriter
 
 from gym_microrts import microrts_ai
-from gym_microrts.envs.vec_env import (
-    MicroRTSGridModeSharedMemVecEnv as MicroRTSGridModeVecEnv,
-)
+from gym_microrts.envs.vec_env import MicroRTSGridModeVecEnv
 
 
 def parse_args():
@@ -410,34 +408,44 @@ def on_evaluation_done(self, future):
     )
 
     for update in range(starting_update, args.num_updates + 1):
+        step_time = 0
+        inference_time = 0
+        get_mask_time = 0
         # Annealing the rate if instructed to do so.
         if args.anneal_lr:
             frac = 1.0 - (update - 1.0) / args.num_updates
             lrnow = lr(frac)
             optimizer.param_groups[0]["lr"] = lrnow
 
         # TRY NOT TO MODIFY: prepare the execution of the game.
+        rollout_time_start = time.time()
         for step in range(0, args.num_steps):
             # envs.render()
             global_step += 1 * args.num_envs
             obs[step] = next_obs
             dones[step] = next_done
+
+            get_mask_time_start = time.time()
+            invalid_action_masks[step] = torch.tensor(envs.get_action_mask()).to(device)
+            get_mask_time += time.time() - get_mask_time_start
+
             # ALGO LOGIC: put action logic here
+            inference_time_start = time.time()
             with torch.no_grad():
-                invalid_action_masks[step] = torch.tensor(envs.get_action_mask()).to(device)
                 action, logproba, _, _, vs = agent.get_action_and_value(
                     next_obs, envs=envs, invalid_action_masks=invalid_action_masks[step], device=device
                 )
                 values[step] = vs.flatten()
-
             actions[step] = action
             logprobs[step] = logproba
-            try:
-                next_obs, rs, ds, infos = envs.step(action.cpu().numpy().reshape(envs.num_envs, -1))
-                next_obs = torch.Tensor(next_obs).to(device)
-            except Exception as e:
-                e.printStackTrace()
-                raise
+            cpu_action = action.cpu().numpy().reshape(envs.num_envs, -1)
+            inference_time += time.time() - inference_time_start
+
+            step_time_start = time.time()
+            next_obs, rs, ds, infos = envs.step(cpu_action)
+            step_time += time.time() - step_time_start
+
+            next_obs = torch.Tensor(next_obs).to(device)
             rewards[step], next_done = torch.Tensor(rs).to(device), torch.Tensor(ds).to(device)
 
             for info in infos:
@@ -449,6 +457,7 @@ def on_evaluation_done(self, future):
                         writer.add_scalar(f"charts/episodic_return/{key}", info["microrts_stats"][key], global_step)
                     break
 
+        training_time_start = time.time()
         # bootstrap reward if not done. reached the batch limit
         with torch.no_grad():
             last_value = agent.get_value(next_obs).reshape(1, -1)
@@ -559,6 +568,13 @@ def on_evaluation_done(self, future):
         if args.kle_stop or args.kle_rollback:
             writer.add_scalar("debug/pg_stop_iter", i_epoch_pi, global_step)
         writer.add_scalar("charts/sps", int(global_step / (time.time() - start_time)), global_step)
+        writer.add_scalar("charts/sps_step", int(args.num_envs * args.num_steps / step_time), global_step)
+        writer.add_scalar("charts/sps_inference", int(args.num_envs * args.num_steps / inference_time), global_step)
+        writer.add_scalar("charts/step_time", step_time, global_step)
+        writer.add_scalar("charts/inference_time", inference_time, global_step)
+        writer.add_scalar("charts/get_mask_time", get_mask_time, global_step)
+        writer.add_scalar("charts/rollout_time", time.time() - rollout_time_start, global_step)
+        writer.add_scalar("charts/training_time", time.time() - training_time_start, global_step)
         print("SPS:", int(global_step / (time.time() - start_time)))
 
     if eval_executor is not None:

diff --git a/experiments/ppo_gridnet_eval.py b/experiments/ppo_gridnet_eval.py
@@ -53,7 +53,7 @@ def parse_args():
         help="the path to the agent's model")
     parser.add_argument('--ai', type=str, default="",
         help='the opponent AI to evaluate against')
-    parser.add_argument('--model-type', type=str, default=f"ppo_gridnet_large", choices=["ppo_gridnet_large", "ppo_gridnet"],
+    parser.add_argument('--model-type', type=str, default=f"ppo_gridnet", choices=["ppo_gridnet"],
         help='the output path of the leaderboard csv')
     args = parser.parse_args()
     if not args.seed:
@@ -72,16 +72,12 @@ def parse_args():
 if __name__ == "__main__":
     args = parse_args()
 
-    if args.model_type == "ppo_gridnet_large":
-        from ppo_gridnet_large import Agent, MicroRTSStatsRecorder
+    if args.model_type == "ppo_gridnet":
+        from ppo_gridnet import Agent, MicroRTSStatsRecorder
 
         from gym_microrts.envs.vec_env import MicroRTSGridModeVecEnv
     else:
-        from ppo_gridnet import Agent, MicroRTSStatsRecorder
-
-        from gym_microrts.envs.vec_env import (
-            MicroRTSGridModeSharedMemVecEnv as MicroRTSGridModeVecEnv,
-        )
+        raise ValueError(f"model_type {args.model_type} is not supported")
 
     # TRY NOT TO MODIFY: setup the environment
     experiment_name = f"{args.gym_id}__{args.exp_name}__{args.seed}__{int(time.time())}"