Add minimal multi turn examples based on the new rollout functions (#493)#577
Add minimal multi turn examples based on the new rollout functions (#493)#577GuanxingLu wants to merge 1317 commits intoradixark:mainfrom
Conversation
Co-authored-by: PopSoda2002 [zhouhp.me@gmail.com](mailto:zhouhp.me@gmail.com)
…and_agent_examples
Co-authored-by: PopSoda2002 [zhouhp.me@gmail.com](mailto:zhouhp.me@gmail.com)
Co-authored-by: PopSoda2002 [zhouhp.me@gmail.com](mailto:zhouhp.me@gmail.com)
Co-authored-by: PopSoda2002 [zhouhp.me@gmail.com](mailto:zhouhp.me@gmail.com)
…and_agent_examples
Summary of ChangesHello @GuanxingLu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the framework's capabilities for tool-enabled language model training by introducing comprehensive examples for multi-turn and agentic generation. It provides a robust and secure environment for tool execution, alongside refined logging mechanisms to capture the nuances of multi-step interactions. These additions aim to streamline the development and evaluation of models that leverage external tools for complex tasks. Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces valuable examples for multi-turn and agentic tool usage, complete with a Python sandbox for code execution. The overall structure is good, but there are several areas for improvement to enhance robustness, security, and usability. My review focuses on making the shell scripts more portable by removing hardcoded paths, improving the security and correctness of the Python sandbox, and increasing the robustness of shell script logic. I've provided specific suggestions for each of these points in the comments below.
| def _check_code_safety(self, code: str) -> tuple[bool, str]: | ||
| """Check code safety by scanning for dangerous patterns""" |
There was a problem hiding this comment.
The _check_code_safety method relies on a blacklist of dangerous patterns. This approach is fundamentally insecure as it can often be bypassed with obfuscation techniques (e.g., using string concatenation like 'ev' + 'al', or character encodings). For any production-like environment, this poses a significant security risk. It would be much safer to use a proper sandboxing library or execute the code in a more isolated environment like a dedicated, heavily restricted container.
| # Set memory limit (4GB) | ||
| try: | ||
| resource.setrlimit(resource.RLIMIT_AS, (4 * 1024 * 1024 * 1024, -1)) | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
The memory limit for the sandboxed process is hardcoded to 4GB, which ignores the self.memory_limit attribute passed to the PythonSandbox constructor. This is a bug that prevents the memory limit from being configurable. You should parse the self.memory_limit string (e.g., '4GB', '512MB') into bytes and use that value to set the resource limit.
| hf download --repo-type dataset zhuzilin/dapo-math-17k --local-dir /root/dapo-math-17k | ||
| hf download Qwen/Qwen3-4B --local-dir /root/Qwen3-4B | ||
| ``` | ||
|
|
||
| ### 2. Convert Model to Megatron-LM Format | ||
|
|
||
| ```bash | ||
| source scripts/models/qwen3-4B.sh | ||
| PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \ | ||
| ${MODEL_ARGS[@]} \ | ||
| --hf-checkpoint /root/Qwen3-4B \ | ||
| --rotary-base 5000000 \ | ||
| --save /root/Qwen3-4B_torch_dist |
There was a problem hiding this comment.
The setup instructions contain hardcoded paths like /root/dapo-math-17k and /root/Qwen3-4B. This makes the example less portable and assumes a specific user environment, likely a particular Docker container. To improve usability for a wider audience, consider using environment variables for these paths or adding a note to instruct users to replace them with their own local paths.
| pkill -9 sglang | ||
| sleep 3 | ||
| ray stop --force | ||
| pkill -9 ray | ||
| pkill -9 python | ||
| sleep 3 | ||
| pkill -9 ray | ||
| pkill -9 python |
There was a problem hiding this comment.
The cleanup block at the start of the script is quite aggressive, immediately using pkill -9. It's also repetitive. It's generally better to first attempt a graceful shutdown (e.g., pkill without -9) before forcing termination. The repetition of pkill -9 ray and pkill -9 python suggests that processes might not be terminating correctly on the first attempt, which could be investigated. For cleaner code, you could wrap this logic in a function.
| --hf-checkpoint /root/Qwen3-4B | ||
| --ref-load /root/Qwen3-4B_torch_dist | ||
| --save /root/Qwen3-4B_miles/retool_v2_agentic/ | ||
| --save-interval 20 | ||
| --rotary-base 1000000 | ||
| ) | ||
|
|
||
| ROLLOUT_ARGS=( | ||
| --prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl |
There was a problem hiding this comment.
The script contains hardcoded paths like /root/Qwen3-4B and /root/dapo-math-17k/dapo-math-17k.jsonl. This limits portability and reusability. It's better to use environment variables to define base paths for models and data, with /root as a default value. This would make the script more flexible for users with different environment setups.
| --hf-checkpoint /root/Qwen3-4B | |
| --ref-load /root/Qwen3-4B_torch_dist | |
| --save /root/Qwen3-4B_miles/retool_v2_agentic/ | |
| --save-interval 20 | |
| --rotary-base 1000000 | |
| ) | |
| ROLLOUT_ARGS=( | |
| --prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl | |
| --hf-checkpoint "${DATA_DIR:-/root}/Qwen3-4B" | |
| --ref-load "${DATA_DIR:-/root}/Qwen3-4B_torch_dist" | |
| --save "${DATA_DIR:-/root}/Qwen3-4B_miles/retool_v2_agentic/" | |
| --save-interval 20 | |
| --rotary-base 1000000 | |
| ) | |
| ROLLOUT_ARGS=( | |
| --prompt-data "${DATA_DIR:-/root}/dapo-math-17k/dapo-math-17k.jsonl" |
| WANDB_ARGS=( | ||
| --use-wandb | ||
| --wandb-project miles-dev-retool-v2 | ||
| --wandb-group qwen3-4B-multi-turn |
There was a problem hiding this comment.
The wandb group is named qwen3-4B-multi-turn, but this script is for agentic training (run_agentic.sh). This appears to be a copy-paste error from run_multi_turn.sh. For better clarity and experiment tracking, you should consider changing it to qwen3-4B-agentic or something similar that reflects the script's purpose.
| --wandb-group qwen3-4B-multi-turn | |
| --wandb-group qwen3-4B-agentic |
| RUNTIME_ENV_JSON="{ | ||
| \"env_vars\": { | ||
| \"PYTHONPATH\": \"/root/Megatron-LM/:${SCRIPT_DIR}:/root/miles\", | ||
| \"CUDA_DEVICE_MAX_CONNECTIONS\": \"1\", | ||
| \"NCCL_NVLS_ENABLE\": \"${HAS_NVLINK}\" | ||
| } | ||
| }" |
There was a problem hiding this comment.
Constructing JSON by embedding shell variables directly into a string is fragile. If a variable like ${SCRIPT_DIR} contains characters that need to be escaped in JSON, the string will become invalid. Using a here document is a more robust and readable way to create the JSON string.
| RUNTIME_ENV_JSON="{ | |
| \"env_vars\": { | |
| \"PYTHONPATH\": \"/root/Megatron-LM/:${SCRIPT_DIR}:/root/miles\", | |
| \"CUDA_DEVICE_MAX_CONNECTIONS\": \"1\", | |
| \"NCCL_NVLS_ENABLE\": \"${HAS_NVLINK}\" | |
| } | |
| }" | |
| RUNTIME_ENV_JSON=$(cat <<EOF | |
| { | |
| "env_vars": { | |
| "PYTHONPATH": "/root/Megatron-LM/:${SCRIPT_DIR}:/root/miles", | |
| "CUDA_DEVICE_MAX_CONNECTIONS": "1", | |
| "NCCL_NVLS_ENABLE": "${HAS_NVLINK}" | |
| } | |
| } | |
| EOF | |
| ) |
| --hf-checkpoint /root/Qwen3-4B | ||
| --ref-load /root/Qwen3-4B_torch_dist | ||
| --save /root/Qwen3-4B_miles/retool_v2_multi_turn/ | ||
| --save-interval 1000 | ||
| ) | ||
|
|
||
| ROLLOUT_ARGS=( | ||
| --prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl |
There was a problem hiding this comment.
The script contains hardcoded paths like /root/Qwen3-4B and /root/dapo-math-17k/dapo-math-17k.jsonl. This limits portability. It's recommended to use environment variables to define base paths for models and data, with /root as a default value, to make the script more flexible.
| --hf-checkpoint /root/Qwen3-4B | |
| --ref-load /root/Qwen3-4B_torch_dist | |
| --save /root/Qwen3-4B_miles/retool_v2_multi_turn/ | |
| --save-interval 1000 | |
| ) | |
| ROLLOUT_ARGS=( | |
| --prompt-data /root/dapo-math-17k/dapo-math-17k.jsonl | |
| --hf-checkpoint "${DATA_DIR:-/root}/Qwen3-4B" | |
| --ref-load "${DATA_DIR:-/root}/Qwen3-4B_torch_dist" | |
| --save "${DATA_DIR:-/root}/Qwen3-4B_miles/retool_v2_multi_turn/" | |
| --save-interval 1000 | |
| ) | |
| ROLLOUT_ARGS=( | |
| --prompt-data "${DATA_DIR:-/root}/dapo-math-17k/dapo-math-17k.jsonl" |
| RUNTIME_ENV_JSON="{ | ||
| \"env_vars\": { | ||
| \"PYTHONPATH\": \"/root/Megatron-LM/:${SCRIPT_DIR}:/root/miles\", | ||
| \"CUDA_DEVICE_MAX_CONNECTIONS\": \"1\", | ||
| \"NCCL_NVLS_ENABLE\": \"${HAS_NVLINK}\" | ||
| } | ||
| }" |
There was a problem hiding this comment.
Constructing JSON by embedding shell variables directly into a string is fragile. If a variable like ${SCRIPT_DIR} contains characters that need to be escaped in JSON, the string will become invalid. Using a here document is a more robust and readable way to create the JSON string.
| RUNTIME_ENV_JSON="{ | |
| \"env_vars\": { | |
| \"PYTHONPATH\": \"/root/Megatron-LM/:${SCRIPT_DIR}:/root/miles\", | |
| \"CUDA_DEVICE_MAX_CONNECTIONS\": \"1\", | |
| \"NCCL_NVLS_ENABLE\": \"${HAS_NVLINK}\" | |
| } | |
| }" | |
| RUNTIME_ENV_JSON=$(cat <<EOF | |
| { | |
| "env_vars": { | |
| "PYTHONPATH": "/root/Megatron-LM/:${SCRIPT_DIR}:/root/miles", | |
| "CUDA_DEVICE_MAX_CONNECTIONS": "1", | |
| "NCCL_NVLS_ENABLE": "${HAS_NVLINK}" | |
| } | |
| } | |
| EOF | |
| ) |
| r"type\s*\(", | ||
| r"isinstance\s*\(", | ||
| r"issubclass\s*\(", | ||
| r"super\s*\(", | ||
| r"property\s*\(", | ||
| r"staticmethod\s*\(", | ||
| r"classmethod\s*\(", | ||
| r"__\w+__", # double underscore methods |
There was a problem hiding this comment.
The safety check blacklists many fundamental Python features, including type(), isinstance(), property(), staticmethod(), classmethod(), and all dunder methods (__\w+__). This is overly restrictive and will break a lot of valid, safe, object-oriented code, which severely limits the utility of the code interpreter. The blacklist should be more targeted to allow for common and safe language constructs.
There was a problem hiding this comment.
Code Review
This pull request introduces new examples for multi-turn and agentic tool use, including new run scripts, a README, and a Python sandbox for code execution. However, a critical sandbox escape vulnerability has been identified in the newly added tool_sandbox.py file. The current denylist-based sandboxing mechanism is insecure and can be bypassed, potentially leading to Remote Code Execution (RCE). Beyond this, the Python sandbox implementation is overly restrictive, and there are also areas for improvement in the example scripts and documentation for clarity and correctness.
| def _check_code_safety(self, code: str) -> tuple[bool, str]: | ||
| """Check code safety by scanning for dangerous patterns""" | ||
| # Check for dangerous operations | ||
| dangerous_patterns = [ | ||
| r"import\s+os", | ||
| r"import\s+sys", | ||
| r"import\s+subprocess", | ||
| r"import\s+shutil", | ||
| r"import\s+glob", | ||
| r"import\s+pathlib", | ||
| r"__import__", | ||
| r"eval\s*\(", | ||
| r"exec\s*\(", | ||
| r"open\s*\(", | ||
| r"file\s*\(", | ||
| r"input\s*\(", | ||
| r"raw_input\s*\(", | ||
| r"compile\s*\(", | ||
| r"execfile\s*\(", | ||
| r"getattr\s*\(", | ||
| r"setattr\s*\(", | ||
| r"delattr\s*\(", | ||
| r"hasattr\s*\(", | ||
| r"globals\s*\(", | ||
| r"locals\s*\(", | ||
| r"vars\s*\(", | ||
| r"dir\s*\(", | ||
| r"type\s*\(", | ||
| r"isinstance\s*\(", | ||
| r"issubclass\s*\(", | ||
| r"super\s*\(", | ||
| r"property\s*\(", | ||
| r"staticmethod\s*\(", | ||
| r"classmethod\s*\(", | ||
| r"__\w+__", # double underscore methods | ||
| ] | ||
|
|
||
| for pattern in dangerous_patterns: | ||
| if re.search(pattern, code, re.IGNORECASE): | ||
| return False, f"Code contains dangerous pattern: {pattern}" | ||
|
|
||
| # Check imported modules | ||
| import_pattern = r"import\s+(\w+)" | ||
| from_pattern = r"from\s+(\w+)" | ||
|
|
||
| imports = re.findall(import_pattern, code) | ||
| froms = re.findall(from_pattern, code) | ||
|
|
||
| all_imports = set(imports + froms) | ||
| for imp in all_imports: | ||
| if imp not in self.allowed_modules: | ||
| return False, f"Import of '{imp}' is not allowed" | ||
|
|
||
| return True, "Code is safe" |
There was a problem hiding this comment.
The PythonSandbox.execute_code method, specifically the _check_code_safety function, relies on a denylist of regular expressions (e.g., r"__\w+__") to detect dangerous patterns. This denylist approach is fundamentally insecure and prone to bypasses, creating a critical sandbox escape vulnerability that could lead to Remote Code Execution (RCE). The pattern r"__\w+__" is also overly restrictive, blocking valid Python constructs like __init__ or __name__ == "__main__", which severely limits the utility of the interpreter. An attacker could craft malicious code to bypass these regex checks and achieve arbitrary code execution, leading to system compromise. Remediation: Replace the denylist-based validation with a more robust sandboxing solution. Consider using technologies like: - Containers: Run the code in a minimal, isolated Docker container with no network access and limited resources. - gVisor or Firecracker: Use a container runtime or microVM that provides a secure kernel abstraction layer. - RestrictedPython: A tool that restricts the available modules and objects within a Python environment.
| cd miles | ||
| pip install -e . --no-deps |
There was a problem hiding this comment.
The setup instructions have a directory navigation issue. After cd miles, the shell's working directory is changed, which will cause subsequent commands that expect to be run from the repository root to fail. Using a subshell (...) for the installation steps will prevent changing the current directory and ensure the following commands execute correctly from the repository root.
| cd miles | |
| pip install -e . --no-deps | |
| (cd miles && pip install -e . --no-deps) |
| pkill -9 sglang | ||
| sleep 3 | ||
| ray stop --force | ||
| pkill -9 ray | ||
| pkill -9 python | ||
| sleep 3 | ||
| pkill -9 ray | ||
| pkill -9 python |
There was a problem hiding this comment.
The process cleanup logic at the beginning of the script is redundant. The pkill -9 ray and pkill -9 python commands are repeated. This can be simplified for better readability and maintainability. Also, pkill can accept multiple process names.
| pkill -9 sglang | |
| sleep 3 | |
| ray stop --force | |
| pkill -9 ray | |
| pkill -9 python | |
| sleep 3 | |
| pkill -9 ray | |
| pkill -9 python | |
| pkill -9 sglang | |
| ray stop --force | |
| sleep 3 # Wait for processes to terminate | |
| pkill -9 ray python # Force kill any remaining processes |
| WANDB_ARGS=( | ||
| --use-wandb | ||
| --wandb-project miles-dev-retool-v2 | ||
| --wandb-group qwen3-4B-multi-turn |
| pkill -9 sglang | ||
| sleep 3 | ||
| ray stop --force | ||
| pkill -9 ray | ||
| pkill -9 python | ||
| sleep 3 | ||
| pkill -9 ray | ||
| pkill -9 python |
There was a problem hiding this comment.
The process cleanup logic at the beginning of the script is redundant. The pkill -9 ray and pkill -9 python commands are repeated. This can be simplified for better readability and maintainability. Also, pkill can accept multiple process names.
| pkill -9 sglang | |
| sleep 3 | |
| ray stop --force | |
| pkill -9 ray | |
| pkill -9 python | |
| sleep 3 | |
| pkill -9 ray | |
| pkill -9 python | |
| pkill -9 sglang | |
| ray stop --force | |
| sleep 3 # Wait for processes to terminate | |
| pkill -9 ray python # Force kill any remaining processes |
There was a problem hiding this comment.
Code Review
This pull request introduces new examples for multi-turn and agentic generation with tool usage, which is a great addition. The changes include new run scripts, a README file, and a Python tool sandbox, along with updates to track multi-turn and tool call metadata. My review focuses on improving the clarity of the documentation, correcting a potential configuration error, and addressing a significant issue in the tool sandbox's security implementation that could block valid code.
| r"property\s*\(", | ||
| r"staticmethod\s*\(", | ||
| r"classmethod\s*\(", | ||
| r"__\w+__", # double underscore methods |
There was a problem hiding this comment.
| ### 1. Setup and download datasets | ||
|
|
||
| ```bash | ||
| cd miles |
| WANDB_ARGS=( | ||
| --use-wandb | ||
| --wandb-project miles-dev-retool-v2 | ||
| --wandb-group qwen3-4B-multi-turn |
There was a problem hiding this comment.
There was a problem hiding this comment.
nit: maybe we can use python script, and maybe run_agentic.sh and run_multi_turn.sh can become one single file w/ a simple bool flag and a short if-else
fzyzcjy
left a comment
There was a problem hiding this comment.
the general direction LGTM, i.e. almost zero lines of code change, except for copy-pasting the sandbox code
No description provided.