Skip to content

Add minimal multi turn examples based on the new rollout functions (#493)#577

Draft
GuanxingLu wants to merge 1317 commits intoradixark:mainfrom
GuanxingLu:feature/multi_turn_and_agent_examples
Draft

Add minimal multi turn examples based on the new rollout functions (#493)#577
GuanxingLu wants to merge 1317 commits intoradixark:mainfrom
GuanxingLu:feature/multi_turn_and_agent_examples

Conversation

@GuanxingLu
Copy link
Contributor

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @GuanxingLu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the framework's capabilities for tool-enabled language model training by introducing comprehensive examples for multi-turn and agentic generation. It provides a robust and secure environment for tool execution, alongside refined logging mechanisms to capture the nuances of multi-step interactions. These additions aim to streamline the development and evaluation of models that leverage external tools for complex tasks.

Highlights

  • New Retool-v2 Examples: Introduced a new retool_v2 example directory containing scripts for multi-turn and agentic tool-enabled RL training, complete with a detailed README.
  • Safe Python Sandbox and Tool Registry: Added tool_sandbox.py which provides a PythonSandbox for secure code execution with memory/time limits and module restrictions, and a ToolRegistry for managing and executing tools like a code interpreter.
  • Enhanced Rollout Functions: Implemented new rollout functions for multi-turn and agentic tool calling, allowing models to interact with tools over multiple steps.
  • Improved Logging and Metadata Tracking: Updated logging utilities to support multi-turn data and pass rate logging, and added round_number and tool_call_count to sample metadata for better tracking of tool interactions.
Changelog
  • examples/retool_v2/README.md
    • Added a new README file detailing the retool_v2 example, including its overview, file structure, usage instructions for setup, model conversion, multi-turn and agentic RL training, tool format, and safety features.
  • examples/retool_v2/run_agentic.sh
    • Added a new shell script to execute agentic RL training, configuring custom generation functions, tool specifications, and execution paths, along with various training and performance arguments.
  • examples/retool_v2/run_multi_turn.sh
    • Added a new shell script to execute multi-turn RL training, setting up custom generation functions, tool specifications, and execution paths, including evaluation arguments and specific GRPO and optimizer configurations.
  • examples/retool_v2/tool_sandbox.py
    • Added a new Python module that implements a PythonSandbox for safe code execution, a ToolRegistry for managing and executing tools, and a reward_func that incorporates tool usage into the reward calculation.
  • miles/backends/training_utils/log_utils.py
    • Modified the log_rollout_data function to pass the parallel_state argument to log_passrate when args.log_passrate is enabled.
  • miles/rollout/base_types.py
    • Removed a trailing newline character from the end of the file.
  • miles/rollout/generate_hub/multi_turn.py
    • Added round_number to the sample metadata, tracking the current turn in multi-turn generation.
  • miles/rollout/generate_utils/tool_call_utils.py
    • Updated sample metadata to include tool_call_count, incrementing it based on the number of tool messages processed.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces valuable examples for multi-turn and agentic tool usage, complete with a Python sandbox for code execution. The overall structure is good, but there are several areas for improvement to enhance robustness, security, and usability. My review focuses on making the shell scripts more portable by removing hardcoded paths, improving the security and correctness of the Python sandbox, and increasing the robustness of shell script logic. I've provided specific suggestions for each of these points in the comments below.

Comment on lines +114 to +115
def _check_code_safety(self, code: str) -> tuple[bool, str]:
"""Check code safety by scanning for dangerous patterns"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _check_code_safety method relies on a blacklist of dangerous patterns. This approach is fundamentally insecure as it can often be bypassed with obfuscation techniques (e.g., using string concatenation like 'ev' + 'al', or character encodings). For any production-like environment, this poses a significant security risk. It would be much safer to use a proper sandboxing library or execute the code in a more isolated environment like a dedicated, heavily restricted container.

Comment on lines +218 to +222
# Set memory limit (4GB)
try:
resource.setrlimit(resource.RLIMIT_AS, (4 * 1024 * 1024 * 1024, -1))
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The memory limit for the sandboxed process is hardcoded to 4GB, which ignores the self.memory_limit attribute passed to the PythonSandbox constructor. This is a bug that prevents the memory limit from being configurable. You should parse the self.memory_limit string (e.g., '4GB', '512MB') into bytes and use that value to set the resource limit.

Comment on lines +28 to +40
hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/dapo-math-17k
hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B
```

### 2. Convert Model to Megatron-LM Format

```bash
source scripts/models/qwen3-4B.sh
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
${MODEL_ARGS[@]} \
--hf-checkpoint /root/Qwen3-4B \
--rotary-base 5000000 \
--save /root/Qwen3-4B_torch_dist
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The setup instructions contain hardcoded paths like /root/dapo-math-17k and /root/Qwen3-4B. This makes the example less portable and assumes a specific user environment, likely a particular Docker container. To improve usability for a wider audience, consider using environment variables for these paths or adding a note to instruct users to replace them with their own local paths.

Comment on lines +3 to +10
pkill -9 sglang
sleep 3
ray stop --force
pkill -9 ray
pkill -9 python
sleep 3
pkill -9 ray
pkill -9 python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The cleanup block at the start of the script is quite aggressive, immediately using pkill -9. It's also repetitive. It's generally better to first attempt a graceful shutdown (e.g., pkill without -9) before forcing termination. The repetition of pkill -9 ray and pkill -9 python suggests that processes might not be terminating correctly on the first attempt, which could be investigated. For cleaner code, you could wrap this logic in a function.

Comment on lines +37 to +45
--hf-checkpoint /root/Qwen3-4B
--ref-load /root/Qwen3-4B_torch_dist
--save /root/Qwen3-4B_miles/retool_v2_agentic/
--save-interval 20
--rotary-base 1000000
)

ROLLOUT_ARGS=(
--prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script contains hardcoded paths like /root/Qwen3-4B and /root/dapo-math-17k/dapo-math-17k.jsonl. This limits portability and reusability. It's better to use environment variables to define base paths for models and data, with /root as a default value. This would make the script more flexible for users with different environment setups.

Suggested change
--hf-checkpoint /root/Qwen3-4B
--ref-load /root/Qwen3-4B_torch_dist
--save /root/Qwen3-4B_miles/retool_v2_agentic/
--save-interval 20
--rotary-base 1000000
)
ROLLOUT_ARGS=(
--prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl
--hf-checkpoint "${DATA_DIR:-/root}/Qwen3-4B"
--ref-load "${DATA_DIR:-/root}/Qwen3-4B_torch_dist"
--save "${DATA_DIR:-/root}/Qwen3-4B_miles/retool_v2_agentic/"
--save-interval 20
--rotary-base 1000000
)
ROLLOUT_ARGS=(
--prompt-data "${DATA_DIR:-/root}/dapo-math-17k/dapo-math-17k.jsonl"

WANDB_ARGS=(
--use-wandb
--wandb-project miles-dev-retool-v2
--wandb-group qwen3-4B-multi-turn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The wandb group is named qwen3-4B-multi-turn, but this script is for agentic training (run_agentic.sh). This appears to be a copy-paste error from run_multi_turn.sh. For better clarity and experiment tracking, you should consider changing it to qwen3-4B-agentic or something similar that reflects the script's purpose.

Suggested change
--wandb-group qwen3-4B-multi-turn
--wandb-group qwen3-4B-agentic

Comment on lines +111 to +117
RUNTIME_ENV_JSON="{
\"env_vars\": {
\"PYTHONPATH\": \"/root/Megatron-LM/:${SCRIPT_DIR}:/root/miles\",
\"CUDA_DEVICE_MAX_CONNECTIONS\": \"1\",
\"NCCL_NVLS_ENABLE\": \"${HAS_NVLINK}\"
}
}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Constructing JSON by embedding shell variables directly into a string is fragile. If a variable like ${SCRIPT_DIR} contains characters that need to be escaped in JSON, the string will become invalid. Using a here document is a more robust and readable way to create the JSON string.

Suggested change
RUNTIME_ENV_JSON="{
\"env_vars\": {
\"PYTHONPATH\": \"/root/Megatron-LM/:${SCRIPT_DIR}:/root/miles\",
\"CUDA_DEVICE_MAX_CONNECTIONS\": \"1\",
\"NCCL_NVLS_ENABLE\": \"${HAS_NVLINK}\"
}
}"
RUNTIME_ENV_JSON=$(cat <<EOF
{
"env_vars": {
"PYTHONPATH": "/root/Megatron-LM/:${SCRIPT_DIR}:/root/miles",
"CUDA_DEVICE_MAX_CONNECTIONS": "1",
"NCCL_NVLS_ENABLE": "${HAS_NVLINK}"
}
}
EOF
)

Comment on lines +41 to +48
--hf-checkpoint /root/Qwen3-4B
--ref-load /root/Qwen3-4B_torch_dist
--save /root/Qwen3-4B_miles/retool_v2_multi_turn/
--save-interval 1000
)

ROLLOUT_ARGS=(
--prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The script contains hardcoded paths like /root/Qwen3-4B and /root/dapo-math-17k/dapo-math-17k.jsonl. This limits portability. It's recommended to use environment variables to define base paths for models and data, with /root as a default value, to make the script more flexible.

Suggested change
--hf-checkpoint /root/Qwen3-4B
--ref-load /root/Qwen3-4B_torch_dist
--save /root/Qwen3-4B_miles/retool_v2_multi_turn/
--save-interval 1000
)
ROLLOUT_ARGS=(
--prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl
--hf-checkpoint "${DATA_DIR:-/root}/Qwen3-4B"
--ref-load "${DATA_DIR:-/root}/Qwen3-4B_torch_dist"
--save "${DATA_DIR:-/root}/Qwen3-4B_miles/retool_v2_multi_turn/"
--save-interval 1000
)
ROLLOUT_ARGS=(
--prompt-data "${DATA_DIR:-/root}/dapo-math-17k/dapo-math-17k.jsonl"

Comment on lines +131 to +137
RUNTIME_ENV_JSON="{
\"env_vars\": {
\"PYTHONPATH\": \"/root/Megatron-LM/:${SCRIPT_DIR}:/root/miles\",
\"CUDA_DEVICE_MAX_CONNECTIONS\": \"1\",
\"NCCL_NVLS_ENABLE\": \"${HAS_NVLINK}\"
}
}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Constructing JSON by embedding shell variables directly into a string is fragile. If a variable like ${SCRIPT_DIR} contains characters that need to be escaped in JSON, the string will become invalid. Using a here document is a more robust and readable way to create the JSON string.

Suggested change
RUNTIME_ENV_JSON="{
\"env_vars\": {
\"PYTHONPATH\": \"/root/Megatron-LM/:${SCRIPT_DIR}:/root/miles\",
\"CUDA_DEVICE_MAX_CONNECTIONS\": \"1\",
\"NCCL_NVLS_ENABLE\": \"${HAS_NVLINK}\"
}
}"
RUNTIME_ENV_JSON=$(cat <<EOF
{
"env_vars": {
"PYTHONPATH": "/root/Megatron-LM/:${SCRIPT_DIR}:/root/miles",
"CUDA_DEVICE_MAX_CONNECTIONS": "1",
"NCCL_NVLS_ENABLE": "${HAS_NVLINK}"
}
}
EOF
)

Comment on lines +141 to +148
r"type\s*\(",
r"isinstance\s*\(",
r"issubclass\s*\(",
r"super\s*\(",
r"property\s*\(",
r"staticmethod\s*\(",
r"classmethod\s*\(",
r"__\w+__", # double underscore methods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The safety check blacklists many fundamental Python features, including type(), isinstance(), property(), staticmethod(), classmethod(), and all dunder methods (__\w+__). This is overly restrictive and will break a lot of valid, safe, object-oriented code, which severely limits the utility of the code interpreter. The blacklist should be more targeted to allow for common and safe language constructs.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new examples for multi-turn and agentic tool use, including new run scripts, a README, and a Python sandbox for code execution. However, a critical sandbox escape vulnerability has been identified in the newly added tool_sandbox.py file. The current denylist-based sandboxing mechanism is insecure and can be bypassed, potentially leading to Remote Code Execution (RCE). Beyond this, the Python sandbox implementation is overly restrictive, and there are also areas for improvement in the example scripts and documentation for clarity and correctness.

Comment on lines +114 to +167
def _check_code_safety(self, code: str) -> tuple[bool, str]:
"""Check code safety by scanning for dangerous patterns"""
# Check for dangerous operations
dangerous_patterns = [
r"import\s+os",
r"import\s+sys",
r"import\s+subprocess",
r"import\s+shutil",
r"import\s+glob",
r"import\s+pathlib",
r"__import__",
r"eval\s*\(",
r"exec\s*\(",
r"open\s*\(",
r"file\s*\(",
r"input\s*\(",
r"raw_input\s*\(",
r"compile\s*\(",
r"execfile\s*\(",
r"getattr\s*\(",
r"setattr\s*\(",
r"delattr\s*\(",
r"hasattr\s*\(",
r"globals\s*\(",
r"locals\s*\(",
r"vars\s*\(",
r"dir\s*\(",
r"type\s*\(",
r"isinstance\s*\(",
r"issubclass\s*\(",
r"super\s*\(",
r"property\s*\(",
r"staticmethod\s*\(",
r"classmethod\s*\(",
r"__\w+__", # double underscore methods
]

for pattern in dangerous_patterns:
if re.search(pattern, code, re.IGNORECASE):
return False, f"Code contains dangerous pattern: {pattern}"

# Check imported modules
import_pattern = r"import\s+(\w+)"
from_pattern = r"from\s+(\w+)"

imports = re.findall(import_pattern, code)
froms = re.findall(from_pattern, code)

all_imports = set(imports + froms)
for imp in all_imports:
if imp not in self.allowed_modules:
return False, f"Import of '{imp}' is not allowed"

return True, "Code is safe"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-critical critical

The PythonSandbox.execute_code method, specifically the _check_code_safety function, relies on a denylist of regular expressions (e.g., r"__\w+__") to detect dangerous patterns. This denylist approach is fundamentally insecure and prone to bypasses, creating a critical sandbox escape vulnerability that could lead to Remote Code Execution (RCE). The pattern r"__\w+__" is also overly restrictive, blocking valid Python constructs like __init__ or __name__ == "__main__", which severely limits the utility of the interpreter. An attacker could craft malicious code to bypass these regex checks and achieve arbitrary code execution, leading to system compromise. Remediation: Replace the denylist-based validation with a more robust sandboxing solution. Consider using technologies like: - Containers: Run the code in a minimal, isolated Docker container with no network access and limited resources. - gVisor or Firecracker: Use a container runtime or microVM that provides a secure kernel abstraction layer. - RestrictedPython: A tool that restricts the available modules and objects within a Python environment.

Comment on lines +24 to +25
cd miles
pip install -e . --no-deps
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The setup instructions have a directory navigation issue. After cd miles, the shell's working directory is changed, which will cause subsequent commands that expect to be run from the repository root to fail. Using a subshell (...) for the installation steps will prevent changing the current directory and ensure the following commands execute correctly from the repository root.

Suggested change
cd miles
pip install -e . --no-deps
(cd miles && pip install -e . --no-deps)

Comment on lines +3 to +10
pkill -9 sglang
sleep 3
ray stop --force
pkill -9 ray
pkill -9 python
sleep 3
pkill -9 ray
pkill -9 python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The process cleanup logic at the beginning of the script is redundant. The pkill -9 ray and pkill -9 python commands are repeated. This can be simplified for better readability and maintainability. Also, pkill can accept multiple process names.

Suggested change
pkill -9 sglang
sleep 3
ray stop --force
pkill -9 ray
pkill -9 python
sleep 3
pkill -9 ray
pkill -9 python
pkill -9 sglang
ray stop --force
sleep 3 # Wait for processes to terminate
pkill -9 ray python # Force kill any remaining processes

WANDB_ARGS=(
--use-wandb
--wandb-project miles-dev-retool-v2
--wandb-group qwen3-4B-multi-turn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The wandb-group appears to be copied from run_multi_turn.sh. For better experiment tracking and to avoid confusion, this should be unique to the agentic run.

Suggested change
--wandb-group qwen3-4B-multi-turn
--wandb-group qwen3-4B-agentic

Comment on lines +4 to +11
pkill -9 sglang
sleep 3
ray stop --force
pkill -9 ray
pkill -9 python
sleep 3
pkill -9 ray
pkill -9 python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The process cleanup logic at the beginning of the script is redundant. The pkill -9 ray and pkill -9 python commands are repeated. This can be simplified for better readability and maintainability. Also, pkill can accept multiple process names.

Suggested change
pkill -9 sglang
sleep 3
ray stop --force
pkill -9 ray
pkill -9 python
sleep 3
pkill -9 ray
pkill -9 python
pkill -9 sglang
ray stop --force
sleep 3 # Wait for processes to terminate
pkill -9 ray python # Force kill any remaining processes

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces new examples for multi-turn and agentic generation with tool usage, which is a great addition. The changes include new run scripts, a README file, and a Python tool sandbox, along with updates to track multi-turn and tool call metadata. My review focuses on improving the clarity of the documentation, correcting a potential configuration error, and addressing a significant issue in the tool sandbox's security implementation that could block valid code.

r"property\s*\(",
r"staticmethod\s*\(",
r"classmethod\s*\(",
r"__\w+__", # double underscore methods
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The regex pattern r"__\w+__" for checking code safety is too broad and will block valid Python code that uses dunder methods, such as __init__ in class definitions. This will cause the sandbox to reject legitimate code, limiting its usability. It's better to remove this overly restrictive pattern.

### 1. Setup and download datasets

```bash
cd miles
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The cd miles command is likely to cause confusion or errors. Users typically run setup commands like pip install -e . from the project's root directory. Changing directories may break subsequent commands that rely on relative paths from the root.

WANDB_ARGS=(
--use-wandb
--wandb-project miles-dev-retool-v2
--wandb-group qwen3-4B-multi-turn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The wandb-group is set to qwen3-4B-multi-turn, which appears to be a copy-paste from the run_multi_turn.sh script. Using a group name specific to the agentic run will improve experiment tracking and clarity.

Suggested change
--wandb-group qwen3-4B-multi-turn
--wandb-group qwen3-4B-agentic

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe we can use python script, and maybe run_agentic.sh and run_multi_turn.sh can become one single file w/ a simple bool flag and a short if-else

Copy link
Collaborator

@fzyzcjy fzyzcjy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the general direction LGTM, i.e. almost zero lines of code change, except for copy-pasting the sandbox code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants