Skip to content

[Distributed] Replace os.system with os.kill in launch/main.py#78522

Open
whatyourname12345 wants to merge 12 commits intoPaddlePaddle:developfrom
whatyourname12345:fix-os-kill-robustness
Open

[Distributed] Replace os.system with os.kill in launch/main.py#78522
whatyourname12345 wants to merge 12 commits intoPaddlePaddle:developfrom
whatyourname12345:fix-os-kill-robustness

Conversation

@whatyourname12345
Copy link
Copy Markdown
Contributor

PR Category

Distributed Strategy

PR Types

Improvements

Description

python/paddle/distributed/launch/main.py 中用于清理进程的 os.system("kill -9 " + pid) 替换为 Python 原生 API os.kill(int(pid), 9),并增加了更完善的进程状态检测与异常防护机制。

主要收益与架构考量:

  1. 消除安全隐患(CWE-78):弃用 os.system 拼接字符串执行 Shell 命令,从根本上防止了潜在的命令注入风险,提升了分布式拉起脚本的安全性。
  2. 提升执行性能os.kill 直接发起底层系统调用,避免了通过 os.system 强行 Fork 子进程去执行 /bin/kill 命令的不必要系统开销,逻辑更加轻量和 Pythonic。
  3. 增强鲁棒性与严格语义对齐
    • 原先的 os.system 遇到无效 PID 或目标进程已退出时,通常只返回非零退出码,不会中断主程序的 auto-tuner 循环。
    • 本次重构增加了前置的空值与 .isdigit() 校验,并对 os.kill 可能在竞态条件下引发的 ProcessLookupError(目标进程已退出)和 PermissionError(无权限)进行了精准捕获与忽略。

是否引起精度变化

@paddle-bot
Copy link
Copy Markdown

paddle-bot bot commented Mar 29, 2026

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added the contributor External developers label Mar 29, 2026
@luotao1 luotao1 added the HappyOpenSource 快乐开源活动issue与PR label Mar 30, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 30, 2026

Codecov Report

❌ Patch coverage is 94.44444% with 1 line in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@1a009f4). Learn more about missing BASE report.

Files with missing lines Patch % Lines
python/paddle/distributed/launch/main.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             develop   #78522   +/-   ##
==========================================
  Coverage           ?   94.44%           
==========================================
  Files              ?        2           
  Lines              ?       18           
  Branches           ?        0           
==========================================
  Hits               ?       17           
  Misses             ?        1           
  Partials           ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@luotao1 luotao1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image Please add unittest.

@whatyourname12345
Copy link
Copy Markdown
Contributor Author

我将进程清理循环提取到一个辅助函数 _terminate_other_processes 中,以将其与 launch() 上下文解耦。我还使用 unittest.mock.patch 添加了一个稳健的单元测试,以模拟 os.kill 期间的 ProcessLookupError 和 PermissionError。这保证了防御性逻辑可以被 Codecov 完全覆盖,同时不会使测试变得脆弱。

@whatyourname12345
Copy link
Copy Markdown
Contributor Author

/re-run all-failed

zrr1999
zrr1999 previously approved these changes Apr 1, 2026
Copy link
Copy Markdown
Member

@zrr1999 zrr1999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

只是写法上的改进,功能上没什么问题

print("------------------------------------------------")


def terminate_other_processes(processes: list[str], self_pid: str) -> None:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def terminate_other_processes(processes: list[str], self_pid: str) -> None:
def terminate_processes(processes: Sequence[int]) -> bool:

这里改成接受 int 序列,然后返回kill 成功和失败吧(任意一个失败就视为失败)。

筛选逻辑提到 launch

terminate_processes([int(pid.strip()) for pid in processes if not pid.isdigit() or pid == self_pid])

空字符串 isdigit 返回为False,所以就没必要再判断是不是空了。

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我已经按这个方向完成了重构。另外,在上述的列表推导式逻辑好像写反了,我改成了:if pid.strip().isdigit() and pid.strip() != self_pid

Copy link
Copy Markdown
Contributor Author

@whatyourname12345 whatyourname12345 Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @zrr1999 @luotao1,方便的话请帮忙Review一下,如果没问题的话,希望能帮忙Bypass CI并合并

@zrr1999 zrr1999 requested a review from SigureMo April 5, 2026 02:44
Copy link
Copy Markdown
Contributor

@ShigureNyako ShigureNyako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTMeow 🐾

我看了这次 PR 的目标、diff 和新增测试,改动范围比较聚焦:

  • python/paddle/distributed/launch/main.py 将 auto_tuner 清理进程的逻辑收敛到公共辅助函数;
  • python/paddle/distributed/utils/launch_utils.pyos.kill 替代了字符串拼接的 os.system("kill -9 " + pid),同时对无效 PID、目标进程已退出和权限不足这几类情况做了更稳妥的处理;
  • test/legacy_test/test_launch_main_kill.py 补了对应单测,覆盖了 PID 过滤以及 ProcessLookupError / PermissionError 分支。

从实现和兼容性上看,这次变更没有改变原有 auto_tuner 的主流程语义,只是把原先的 shell kill 改成了更安全、也更 Pythonic 的实现;我目前没有看到阻塞合入的问题。

补充说明一下,当前唯一失败的 Linux-IXUCA / Build and Test 日志停在 Paddle-iluvatar patch apply 失败,和这次 Python 侧的改动没有直接关系。

print("------------------------------------------------")


def filter_pids(processes: list[str], self_pid: str) -> list[int]:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def filter_pids(processes: list[str], self_pid: str) -> list[int]:
def filter_pids(processes: Sequence[str], self_pid: str) -> list[int]:

]


def cleanup_processes(processes: list[str], self_pid: str) -> bool:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def cleanup_processes(processes: list[str], self_pid: str) -> bool:
def cleanup_processes(processes: Sequence[str], self_pid: str) -> bool:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这就不该转成 str 吧?

pid = str(process.strip())
if pid != self_pid:
os.system("kill -9 " + pid)
launch_utils.cleanup_processes(processes, self_pid)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

传进去之前就转成 int,我不觉得应该有一个接口签名应该是 (processes: list[str], self_pid: str),进程 id 只应该是 int,在传进去之前就处理好它

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

传进去之前就转成 int,我不觉得应该有一个接口签名应该是 (processes: list[str], self_pid: str),进程 id 只应该是 int,在传进去之前就处理好它

同意。这里现在更明确的 blocker 已经不是 list[str] / Sequence[str] 这种注解细节,而是接口边界本身放错了层级:当前实现把原始 shell 输出解析和真正的进程终止揉在了一起,导致 PID 在 helper 层仍然是 stringly-typed。

按当前 review 结论,我认为应当在调用侧先完成过滤与 str -> int 转换,再把 Sequence[int] 传给终止逻辑;如果还需要保留过滤辅助函数,再单独承担 Sequence[str] -> list[int] 的职责会更清晰。

Copy link
Copy Markdown
Contributor

@ShigureNyako ShigureNyako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

跟进 @SigureMo 新增的 4 条 review comments 后,我同意这里还需要再收一轮接口设计,当前版本不太适合继续维持我前面的 approve 结论。

核心原因有两点:

  • pid 在语义上应该尽早收敛为 int,不建议在工具层继续暴露 (processes: list[str], self_pid: str) 这一类 stringly-typed 接口;
  • 目前 cleanup_processes() 把“原始字符串解析”和“真正的进程终止”揉在一起了,边界有些反直觉。更合理的方向是:调用侧先完成字符串到 PID 的过滤/转换,再把 Sequence[int] 传给 terminate_processes();如果 filter_pids() 保留,入参也更适合写成 Sequence[str]

也就是说,这组评论指出的不是功能性 bug,而是一个真实的接口建模问题;我建议按这个方向把类型边界收敛后再看。

补充说明:CI 现状没有变化,除 Linux-IXUCA / Build and Test 外其余检查已通过;IXUCA 失败仍然是 Paddle-iluvatar patch apply 的问题,与这里的 Python 改动无直接关系。

@SigureMo SigureMo changed the title [Improvement] Replace os.system with os.kill in launch/main.py [Improvement] Replace os.system with os.kill in launch/main.py Apr 5, 2026
@SigureMo SigureMo changed the title [Improvement] Replace os.system with os.kill in launch/main.py [Distributed] Replace os.system with os.kill in launch/main.py Apr 5, 2026
Copy link
Copy Markdown
Contributor

@ShigureNyako ShigureNyako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTMeow 🐾

我这边此前卡住的两个设计点已经解决了:

  • self_pid 与终止逻辑已经统一收敛为 int
  • 调用侧先通过 filter_pids() 完成过滤与 str -> int 转换,再把 Sequence[int] 传给 terminate_processes(),不再暴露之前那个 stringly-typed 的 (processes: list[str], self_pid: str) 接口边界。

对应测试也同步更新到了新的 int PID / signal.SIGKILL 语义。从我这边看,之前的设计 blocker 已经解除。filter_pids() 里最新那条评论更像实现形式上的建议,不影响我对核心问题已修复的判断。

补充一下,当前还有部分 CI 在跑;已出现的 Linux-IXUCA / Build and Test 失败看起来仍然是 Paddle-iluvatar patch apply 问题,和这次 Python 侧改动没有直接关系。

Co-authored-by: Nyakku Shigure <sigure.qaq@gmail.com>
SigureMo
SigureMo previously approved these changes Apr 5, 2026
Copy link
Copy Markdown
Contributor

@ShigureNyako ShigureNyako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTMeow 🐾

我补看了最新这个 commit:这次改动很小,核心是把 terminate_processes() 里的信号选择改成 getattr(signal, "SIGKILL", signal.SIGTERM),避免在 Windows 环境下直接访问 signal.SIGKILL 触发属性错误;对应测试也同步改成按同样逻辑断言预期信号。

从我这边看:

  • 之前关于 PID 类型边界的设计 blocker 没有回退;
  • 这次修正主要是补平台兼容性,不影响前面已经理顺的 str -> int 边界;
  • 目前没有看到新的阻塞问题。

补充说明:当前 CI 仍在重新跑,所以 PR 整体状态还没有最终稳定;这条 approve 只针对代码层面的最新改动结论。

Co-authored-by: Nyakku Shigure <sigure.qaq@gmail.com>
Copy link
Copy Markdown
Contributor

@ShigureNyako ShigureNyako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这次新提交的方向我认可:把 Windows fallback 从 getattr(...) 改成显式的平台分支,语义更清楚。

但当前版本还有一个明确阻塞项,需要先修掉:

  • python/paddle/distributed/utils/launch_utils.py:578 引入了 platform.system(),但文件顶部没有同步 import platform
  • 最新 Pre Commit 已因此失败,ruff 报错为 F821 Undefined name 'platform'

也就是说,这次提交已经在方向上回应了 review,但当前 head 还没有到可合入状态。请先补上缺失导入并让 codestyle 通过。

Copy link
Copy Markdown
Contributor

@ShigureNyako ShigureNyako left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTMeow 🐾

我这边上一轮卡住的是一个很具体的问题:platform.system() 已经引入到实现里,但缺少 import platform,导致 Pre Commit 里的 ruff 直接报 F821 Undefined name 'platform'

最新这个 commit 已经把缺失导入补齐,并且测试里的预期信号判断也同步改成了与实现一致的显式平台分支。就我提出的这个 blocker 而言,现在已经修复。

补充说明:这条 approve 只表示我这边最新代码问题已解除;PR 整体是否可合并,仍取决于其他 reviewer 状态和 CI 最终结果。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contributor External developers HappyOpenSource 快乐开源活动issue与PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants