Add werewolf game to kaggle environments #363

hannw · 2025-07-24T20:02:16Z

This pull request introduces a new game environment to Kaggle Environments: Werewolf. 🐺

This addition provides a new multiplayer environment for the community to develop and test agents in a social deduction game setting.

New additions:

a new environment in kaggle_environments/envs/werewolf, where the main entry point is werewolf.py and the config schema in werewolf.json.
The main game engine is implemented in kaggle_environments/envs/werewolf/game, where the main orchestrator is the Moderator class in engine.py.
A harness module in kaggle_environments/envs/werewolf/harness.

chuckcoder · 2025-08-01T23:55:25Z

kaggle_environments/agent.py

-                err = err[0:max_log_length]
+        if self.debug:
+            start = perf_counter()
+            action = self.agent(*args)


Not clear why there have to be two paths for self.debug here, looks like debug path is just removing try/catch and error logging?

Wrapping execution loop in try except catch-all would prevent debugger from functioning, defeating the purpose for debugging mode, which is why we separated the two modes here. I can add a comment to address it.

chuckcoder · 2025-08-02T00:03:13Z

kaggle_environments/core.py

-        out = None
-        err = None
-        # Append any environmental logs to any agent logs we collected.
+    def __run_interpreter_debug(self, state, logs):


This function looks to duplicate a lot of functionality in original '__run_interpreter' and now share code with '__run_interpreter_prod'. Could we either give a docstring about why two are needed and consolidate shared code in a helper function or merge together?

Sounds good. Will address.

chuckcoder · 2025-08-04T18:36:35Z

setup.py

@@ -59,6 +59,7 @@ def get_version(rel_path):
        "shimmy >= 1.2.1",
        "Chessnut >= 0.4.1",
        "open_spiel >= 1.6.0",
+        "pydantic >= 2.11.4",


Do you also need these:
pytest
litellm
tenacity
pygame
termcolor

As I got some errors when tried running test_werewolf.py script

Will add litellm. litellm and tenacity are harness specific, which might be moved out of the package once ready, so would wait until later. pygame and termcolor might be from other environments?

chuckcoder · 2025-08-04T19:59:57Z

kaggle_environments/envs/werewolf/game/roles.py

+
+
+def create_players_from_agents_config(agents_config: List[Dict]) -> List[Player]:
+    agents = [Agent(**agent_config) for agent_config in agents_config]


Are we sure agent ids are unique after this initialization? Don't see checking in Agent() initialization?

chuckcoder · 2025-08-08T23:35:53Z

kaggle_environments/envs/werewolf/game/protocols.py

+        if not self.done():
+            # Ensure _current_voter_index is within bounds before accessing
+            if self._current_voter_index < len(self._voter_queue):
+                return [self._voter_queue[self._current_voter_index]]


This looks like a bug and seems to prevent more than one player from voting for daytime elimination when I run the code. It seems like it will only return the player id at self._current_voter_index as an array, but should be doing something like return self._voter_queue[self._current_voter_index:] to get all of the voters?

I've resolved the bug. Let me push the fixes.

bovard

Good start, just make sure tests are passing and you handle TIMEOUT/ERROR states and we should be good to merge.

Also please ensure that the docker builds are passing with this environment

bovard · 2025-08-13T08:03:50Z

kaggle_environments/agent.py

+                err = err_buffer.getvalue()
+                # Get the maximum log length
+                # Allow up to 1k (default) log characters per step which is ~1MB per 600 step episode
+                max_log_length = self.configuration.get('maxLogLength', 1024)


note this is updated to 10_000 now by default

Sounds good. Will rebase before I merge.

bovard · 2025-08-13T08:04:02Z

kaggle_environments/agent.py

+                out = out_buffer.getvalue()
+                err = err_buffer.getvalue()
+                # Get the maximum log length
+                # Allow up to 1k (default) log characters per step which is ~1MB per 600 step episode


update comment as well

bovard · 2025-08-13T08:09:47Z

kaggle_environments/envs/werewolf/game/engine.py

+            entry_type=HistoryEntryType.PHASE_CHANGE,
+            public=True,
+        )
+        # announce await action to doctor


prefer if we can do all the voting in parallel (doctor saves, seer reveal, werewolf votes.

Given that some LLMs are taking 10 minutes a turn that could speed up a game of 8 rounds by almost 3 hours

I didn't look fully through the eventing and visibility system though, so let me know if this is infeasible.

In the replay we could still play it out as though it happened serially.

We have a config option to allow fully parallel discussion, thanks for the suggestion. The discussion protocol impact the balance of the game (wether it's 50/50 win rate between teams). We are planning to run some experiments to settle on a discussion protocol.

bovard · 2025-08-13T08:12:14Z

kaggle_environments/envs/werewolf/werewolf.json

+                "name": {
+                    "description": "The name of the voting protocol to use.",
+                    "type": "string",
+                    "enum": ["SimultaneousMajority", "SequentialVoting"],


nice that you can select which to run here!

bovard · 2025-08-13T08:14:48Z

requirements.txt

@@ -1,6 +1,5 @@
-flask
-gym


I'm unsure of all the places we use this rather than the list in setup.py but did you verify this doesn't break the docker build and tests?

bovard · 2025-08-13T08:16:26Z

kaggle_environments/envs/werewolf/test_werewolf.py

+
+def test_discussion_protocol():
+    roles = ["Werewolf", "Werewolf", "Doctor", "Seer", "Villager", "Villager", "Villager"]
+    names = ["gemini-2.5-pro", "gemini-2.5-flash", "gpt-4.1", "o3", "o4-mini", "claude-4-sonnet", "grok-4"]


I don't think we can run this test as you'll need the keys checked into the build pipeline (which we currently dont' have). Mark this as skip

Actually, this one use random agents, so we should be fine. I will double check other tests that require keys.

bovard · 2025-08-13T08:16:42Z

kaggle_environments/envs/werewolf/test_werewolf.py

+@pytest.mark.skip('Slow test, meant for manual testing.')
+def test_llm_players():
+    roles = ["Werewolf", "Werewolf", "Doctor", "Seer", "Villager", "Villager", "Villager"]
+    names = ["gemini-2.5-flash-0", "random-0", "gemini-2.5-flash-1", "gemini-2.5-flash-2", "gemini-2.5-flash-3", "random-1", "random-2"]


same for this test

bovard · 2025-08-13T08:17:40Z

kaggle_environments/envs/werewolf/werewolf.js

@@ -0,0 +1,1298 @@
+function renderer({


not including this in my review

What's the ask?

bovard · 2025-08-13T08:19:14Z

kaggle_environments/envs/werewolf/werewolf.py

+           Each dict has: {observation, action, reward, status, info}
+    env:   the kaggle_environments.Environment object itself including the env.game_state
+    """
+    # --- Initialize Moderator and GameState if it's the start of an episode ---


very clean, looks good

bovard · 2025-08-13T08:20:04Z

kaggle_environments/envs/werewolf/werewolf.py

+    moderator: Moderator = env.moderator
+    game_state: GameState = env.game_state
+
+    # 1. Collect and parse actions from Kaggle agents


you'll need to handle ERROR and TIMEOUT states from players here

(usually just mark everyone as DONE and end the episode as a tie, or that is what we did for chess)

That makes sense. Will do.

1. add moderator observations to env.info 2. refactor action to record game phase timestamps 3. change some entry to moderator announcement type so UI will render those

…ayers. 1. refactor DataEntry to have public_view method, so reasoning trace is hidden from public. 2. fix the bug of voting action add_history_entry calls in SimultaneousMajority.

… voting

…bility.

… rather than application

1. Introduce EventBus in GameState to control fan-in and fan-out of game events. 2. refactor role specific event handlers to roles. Use decorator to register event handler. 3. action confirmation centrally handled. 4. Introduce PlayerID. 5. general improvement on symbol annotations. # Conflicts: # kaggle_environments/envs/werewolf/game/engine.py

1. provide factory methods and registry for proper configurable protocols. 2. refactor the protocols to be multiple modules.

The action event will clutter llm prompt

1. Providing human friendly name of protocols. 2. Improve on LLM prompts and wording of rule sets.

hannw requested review from bovard, jhtschultz, bobfraser-google, lipovetz, pengjun-git, gulliantonio, chuckcoder and solchea July 24, 2025 20:02

chuckcoder reviewed Aug 1, 2025

View reviewed changes

chuckcoder reviewed Aug 2, 2025

View reviewed changes

chuckcoder reviewed Aug 4, 2025

View reviewed changes

chuckcoder reviewed Aug 8, 2025

View reviewed changes

bovard requested changes Aug 13, 2025

View reviewed changes

hannw added 16 commits August 13, 2025 16:41

added day and night voting to werewolf.js

f9da6c3

Add doctor save information to werewolf.js

280201b

Use action from steps to render input actions.

5b796e2

add logos to test_werewolf.py

20a6300

add player_thumbnails to config schema

c50eb68

delete obselete observation prep

c2bff4c

temp commit test_engine.py

02d90ba

day of actions in event log rendered correctly now

dabde36

separate day and night event logs into two shades in werewolf.js

9617951

Event log rendering order completed correctly in werewolf.js

426c53c

add moderator announcement to event log in werewolf.js

2d86ca9

UI related fixes

5477cb5

1. add moderator observations to env.info 2. refactor action to record game phase timestamps 3. change some entry to moderator announcement type so UI will render those

add llm harness for werewolf

00aa200

fix the bug that reasoning and voting details were revealed to all pl…

aef05e7

…ayers. 1. refactor DataEntry to have public_view method, so reasoning trace is hidden from public. 2. fix the bug of voting action add_history_entry calls in SimultaneousMajority.

moderator announce roles and their abilities

35e6ab9

support better end game logging

8d0d1bc

hannw and others added 30 commits August 31, 2025 00:16

refactor raw_observation consumer to use getter and setter

1b0433a

refactor llm harness to record self action and reasoning

901c2e7

Introduce StrEnum to be better json serializable and clearer types

13d9c35

add configs for llm experiment

9da18fe

add packages to requirements.txt

9acab3e

add pairwise zero sum game tournament

b049df1

add task shuffling to reduce LLM api load

4307ae7

reuse run_single_game_cli code

531b28f

add display name to the right panel player name tag

fe0a71d

revise self_play.py to run n games from a given config

e805b15

add "random" and "no exile" tie exile options in SimultaneousMajority…

cb3b8f8

… voting

Add RoundByRoundBiddingDiscussion

b365e87

refactor _handle_night_await_actions() into smaller methods for reada…

11fa4b1

…bility.

Remove check for GOOGLE_APPLICATION_CREDENTIALS as sdk looks for them…

d53c035

… rather than application

Updating docs to remove internal OCTO project

8b5b305

Refactor protocols to be modules

ad5c048

1. provide factory methods and registry for proper configurable protocols. 2. refactor the protocols to be multiple modules.

Improve on player id annotations

20074c3

standardize variable names from history entry to event

f1449ad

fix harness phase error

69d9a33

add cost to litellm models

68f840d

remove general confirmation of action event to player

65e8491

The action event will clutter llm prompt

Fix StrEnum repr

a633b0c

Improve on prompting

64462fc

1. Providing human friendly name of protocols. 2. Improve on LLM prompts and wording of rule sets.

Improve the state machine transition to be explicit in engine.py

1236b8d

Add phase category as attribute of DetailedPhase

f8d7ec2

Resolve detailed_phase naming issue

d744d49

add utility to log git hash

91ef619

use wait random exponential to avoid "thundering herd" VM crash

ea3394e

Use threadpool instead

49a0290



		def create_players_from_agents_config(agents_config: List[Dict]) -> List[Player]:
		agents = [Agent(**agent_config) for agent_config in agents_config]

		@@ -1,6 +1,5 @@
		flask
		gym

Add werewolf game to kaggle environments #363

Are you sure you want to change the base?

Add werewolf game to kaggle environments #363

Uh oh!

Conversation

hannw commented Jul 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bovard left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!