-
Notifications
You must be signed in to change notification settings - Fork 160
Add werewolf game to kaggle environments #363
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
err = err[0:max_log_length] | ||
if self.debug: | ||
start = perf_counter() | ||
action = self.agent(*args) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not clear why there have to be two paths for self.debug here, looks like debug path is just removing try/catch and error logging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrapping execution loop in try except catch-all would prevent debugger from functioning, defeating the purpose for debugging mode, which is why we separated the two modes here. I can add a comment to address it.
kaggle_environments/core.py
Outdated
out = None | ||
err = None | ||
# Append any environmental logs to any agent logs we collected. | ||
def __run_interpreter_debug(self, state, logs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function looks to duplicate a lot of functionality in original '__run_interpreter' and now share code with '__run_interpreter_prod'. Could we either give a docstring about why two are needed and consolidate shared code in a helper function or merge together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Will address.
@@ -59,6 +59,7 @@ def get_version(rel_path): | |||
"shimmy >= 1.2.1", | |||
"Chessnut >= 0.4.1", | |||
"open_spiel >= 1.6.0", | |||
"pydantic >= 2.11.4", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you also need these:
pytest
litellm
tenacity
pygame
termcolor
As I got some errors when tried running test_werewolf.py script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add litellm. litellm and tenacity are harness specific, which might be moved out of the package once ready, so would wait until later. pygame and termcolor might be from other environments?
|
||
|
||
def create_players_from_agents_config(agents_config: List[Dict]) -> List[Player]: | ||
agents = [Agent(**agent_config) for agent_config in agents_config] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we sure agent ids are unique after this initialization? Don't see checking in Agent() initialization?
if not self.done(): | ||
# Ensure _current_voter_index is within bounds before accessing | ||
if self._current_voter_index < len(self._voter_queue): | ||
return [self._voter_queue[self._current_voter_index]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a bug and seems to prevent more than one player from voting for daytime elimination when I run the code. It seems like it will only return the player id at self._current_voter_index as an array, but should be doing something like return self._voter_queue[self._current_voter_index:] to get all of the voters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've resolved the bug. Let me push the fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start, just make sure tests are passing and you handle TIMEOUT/ERROR states and we should be good to merge.
Also please ensure that the docker builds are passing with this environment
kaggle_environments/agent.py
Outdated
err = err_buffer.getvalue() | ||
# Get the maximum log length | ||
# Allow up to 1k (default) log characters per step which is ~1MB per 600 step episode | ||
max_log_length = self.configuration.get('maxLogLength', 1024) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note this is updated to 10_000 now by default
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Will rebase before I merge.
kaggle_environments/agent.py
Outdated
out = out_buffer.getvalue() | ||
err = err_buffer.getvalue() | ||
# Get the maximum log length | ||
# Allow up to 1k (default) log characters per step which is ~1MB per 600 step episode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update comment as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG
entry_type=HistoryEntryType.PHASE_CHANGE, | ||
public=True, | ||
) | ||
# announce await action to doctor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prefer if we can do all the voting in parallel (doctor saves, seer reveal, werewolf votes.
Given that some LLMs are taking 10 minutes a turn that could speed up a game of 8 rounds by almost 3 hours
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look fully through the eventing and visibility system though, so let me know if this is infeasible.
In the replay we could still play it out as though it happened serially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a config option to allow fully parallel discussion, thanks for the suggestion. The discussion protocol impact the balance of the game (wether it's 50/50 win rate between teams). We are planning to run some experiments to settle on a discussion protocol.
"name": { | ||
"description": "The name of the voting protocol to use.", | ||
"type": "string", | ||
"enum": ["SimultaneousMajority", "SequentialVoting"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice that you can select which to run here!
@@ -1,6 +1,5 @@ | |||
flask | |||
gym |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm unsure of all the places we use this rather than the list in setup.py
but did you verify this doesn't break the docker build and tests?
|
||
def test_discussion_protocol(): | ||
roles = ["Werewolf", "Werewolf", "Doctor", "Seer", "Villager", "Villager", "Villager"] | ||
names = ["gemini-2.5-pro", "gemini-2.5-flash", "gpt-4.1", "o3", "o4-mini", "claude-4-sonnet", "grok-4"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we can run this test as you'll need the keys checked into the build pipeline (which we currently dont' have). Mark this as skip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this one use random agents, so we should be fine. I will double check other tests that require keys.
@pytest.mark.skip('Slow test, meant for manual testing.') | ||
def test_llm_players(): | ||
roles = ["Werewolf", "Werewolf", "Doctor", "Seer", "Villager", "Villager", "Villager"] | ||
names = ["gemini-2.5-flash-0", "random-0", "gemini-2.5-flash-1", "gemini-2.5-flash-2", "gemini-2.5-flash-3", "random-1", "random-2"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same for this test
@@ -0,0 +1,1298 @@ | |||
function renderer({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not including this in my review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the ask?
Each dict has: {observation, action, reward, status, info} | ||
env: the kaggle_environments.Environment object itself including the env.game_state | ||
""" | ||
# --- Initialize Moderator and GameState if it's the start of an episode --- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very clean, looks good
moderator: Moderator = env.moderator | ||
game_state: GameState = env.game_state | ||
|
||
# 1. Collect and parse actions from Kaggle agents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you'll need to handle ERROR
and TIMEOUT
states from players here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(usually just mark everyone as DONE and end the episode as a tie, or that is what we did for chess)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense. Will do.
1. add moderator observations to env.info 2. refactor action to record game phase timestamps 3. change some entry to moderator announcement type so UI will render those
…ayers. 1. refactor DataEntry to have public_view method, so reasoning trace is hidden from public. 2. fix the bug of voting action add_history_entry calls in SimultaneousMajority.
… rather than application
1. Introduce EventBus in GameState to control fan-in and fan-out of game events. 2. refactor role specific event handlers to roles. Use decorator to register event handler. 3. action confirmation centrally handled. 4. Introduce PlayerID. 5. general improvement on symbol annotations. # Conflicts: # kaggle_environments/envs/werewolf/game/engine.py
1. provide factory methods and registry for proper configurable protocols. 2. refactor the protocols to be multiple modules.
The action event will clutter llm prompt
1. Providing human friendly name of protocols. 2. Improve on LLM prompts and wording of rule sets.
This pull request introduces a new game environment to Kaggle Environments: Werewolf. 🐺
This addition provides a new multiplayer environment for the community to develop and test agents in a social deduction game setting.
New additions:
kaggle_environments/envs/werewolf
, where the main entry point iswerewolf.py
and the config schema inwerewolf.json
.kaggle_environments/envs/werewolf/game
, where the main orchestrator is the Moderator class inengine.py
.kaggle_environments/envs/werewolf/harness
.