Skip to content

Conversation

hannw
Copy link

@hannw hannw commented Jul 24, 2025

This pull request introduces a new game environment to Kaggle Environments: Werewolf. 🐺

This addition provides a new multiplayer environment for the community to develop and test agents in a social deduction game setting.

New additions:

  1. a new environment in kaggle_environments/envs/werewolf, where the main entry point is werewolf.py and the config schema in werewolf.json.
  2. The main game engine is implemented in kaggle_environments/envs/werewolf/game, where the main orchestrator is the Moderator class in engine.py.
  3. A harness module in kaggle_environments/envs/werewolf/harness.

err = err[0:max_log_length]
if self.debug:
start = perf_counter()
action = self.agent(*args)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear why there have to be two paths for self.debug here, looks like debug path is just removing try/catch and error logging?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapping execution loop in try except catch-all would prevent debugger from functioning, defeating the purpose for debugging mode, which is why we separated the two modes here. I can add a comment to address it.

out = None
err = None
# Append any environmental logs to any agent logs we collected.
def __run_interpreter_debug(self, state, logs):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function looks to duplicate a lot of functionality in original '__run_interpreter' and now share code with '__run_interpreter_prod'. Could we either give a docstring about why two are needed and consolidate shared code in a helper function or merge together?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Will address.

@@ -59,6 +59,7 @@ def get_version(rel_path):
"shimmy >= 1.2.1",
"Chessnut >= 0.4.1",
"open_spiel >= 1.6.0",
"pydantic >= 2.11.4",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also need these:
pytest
litellm
tenacity
pygame
termcolor

As I got some errors when tried running test_werewolf.py script

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add litellm. litellm and tenacity are harness specific, which might be moved out of the package once ready, so would wait until later. pygame and termcolor might be from other environments?



def create_players_from_agents_config(agents_config: List[Dict]) -> List[Player]:
agents = [Agent(**agent_config) for agent_config in agents_config]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure agent ids are unique after this initialization? Don't see checking in Agent() initialization?

if not self.done():
# Ensure _current_voter_index is within bounds before accessing
if self._current_voter_index < len(self._voter_queue):
return [self._voter_queue[self._current_voter_index]]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a bug and seems to prevent more than one player from voting for daytime elimination when I run the code. It seems like it will only return the player id at self._current_voter_index as an array, but should be doing something like return self._voter_queue[self._current_voter_index:] to get all of the voters?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've resolved the bug. Let me push the fixes.

Copy link
Member

@bovard bovard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start, just make sure tests are passing and you handle TIMEOUT/ERROR states and we should be good to merge.

Also please ensure that the docker builds are passing with this environment

err = err_buffer.getvalue()
# Get the maximum log length
# Allow up to 1k (default) log characters per step which is ~1MB per 600 step episode
max_log_length = self.configuration.get('maxLogLength', 1024)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note this is updated to 10_000 now by default

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Will rebase before I merge.

out = out_buffer.getvalue()
err = err_buffer.getvalue()
# Get the maximum log length
# Allow up to 1k (default) log characters per step which is ~1MB per 600 step episode
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update comment as well

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG

entry_type=HistoryEntryType.PHASE_CHANGE,
public=True,
)
# announce await action to doctor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefer if we can do all the voting in parallel (doctor saves, seer reveal, werewolf votes.

Given that some LLMs are taking 10 minutes a turn that could speed up a game of 8 rounds by almost 3 hours

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look fully through the eventing and visibility system though, so let me know if this is infeasible.

In the replay we could still play it out as though it happened serially.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a config option to allow fully parallel discussion, thanks for the suggestion. The discussion protocol impact the balance of the game (wether it's 50/50 win rate between teams). We are planning to run some experiments to settle on a discussion protocol.

"name": {
"description": "The name of the voting protocol to use.",
"type": "string",
"enum": ["SimultaneousMajority", "SequentialVoting"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice that you can select which to run here!

@@ -1,6 +1,5 @@
flask
gym
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure of all the places we use this rather than the list in setup.py but did you verify this doesn't break the docker build and tests?


def test_discussion_protocol():
roles = ["Werewolf", "Werewolf", "Doctor", "Seer", "Villager", "Villager", "Villager"]
names = ["gemini-2.5-pro", "gemini-2.5-flash", "gpt-4.1", "o3", "o4-mini", "claude-4-sonnet", "grok-4"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can run this test as you'll need the keys checked into the build pipeline (which we currently dont' have). Mark this as skip

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this one use random agents, so we should be fine. I will double check other tests that require keys.

@pytest.mark.skip('Slow test, meant for manual testing.')
def test_llm_players():
roles = ["Werewolf", "Werewolf", "Doctor", "Seer", "Villager", "Villager", "Villager"]
names = ["gemini-2.5-flash-0", "random-0", "gemini-2.5-flash-1", "gemini-2.5-flash-2", "gemini-2.5-flash-3", "random-1", "random-2"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same for this test

@@ -0,0 +1,1298 @@
function renderer({
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not including this in my review

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the ask?

Each dict has: {observation, action, reward, status, info}
env: the kaggle_environments.Environment object itself including the env.game_state
"""
# --- Initialize Moderator and GameState if it's the start of an episode ---
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very clean, looks good

moderator: Moderator = env.moderator
game_state: GameState = env.game_state

# 1. Collect and parse actions from Kaggle agents
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you'll need to handle ERROR and TIMEOUT states from players here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(usually just mark everyone as DONE and end the episode as a tie, or that is what we did for chess)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense. Will do.

hannw and others added 30 commits August 31, 2025 00:16
1. Introduce EventBus in GameState to control fan-in and fan-out of game events.
2. refactor role specific event handlers to roles. Use decorator to register event handler.
3. action confirmation centrally handled.
4. Introduce PlayerID.
5. general improvement on symbol annotations.

# Conflicts:
#	kaggle_environments/envs/werewolf/game/engine.py
1. provide factory methods and registry for proper configurable protocols.
2. refactor the protocols to be multiple modules.
The action event will clutter llm prompt
1. Providing human friendly name of protocols.
2. Improve on LLM prompts and wording of rule sets.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants