Skip to content

Need MLIR-level MC2 collective / comm-context surface and better 8-rank routing diagnostics #254

@zhoubot

Description

@zhoubot

Summary

PTOAS can already lower local routing primitives like TGATHER / TSCATTER, but MC2 routing ports still have no MLIR-level surface for the distributed communication context itself.

Current state

In the local 910B migration workspace:

  • moe_distribute_dispatch and moe_distribute_combine now compile through PTO-DSL -> PTOAS -> bisheng for the real A2 contract (epWorldSize=8, H=7168).
  • PTO source is explicit-sync-free and PTOAS insert-sync is active.
  • The actual 8-rank routing benchmark still times out before any rank report is emitted, so failures are currently only visible at the outer harness level.

Gaps exposed by bring-up

  • No MLIR-level op or abstraction for HCCL / parallel-group / window context used by MC2 collectives.
  • No direct way to represent the communication boundary in PTO IR, so MC2 ports fall back to host-managed HCCL steps outside PTOAS.
  • Diagnostics for multi-rank routing failures are too indirect; once ranks stall, PTOAS provides no comm-specific breadcrumbs to distinguish context/setup issues from lowered-kernel issues.

Requested work

  • Add PTO IR / lowering surface for MC2 collective context and collective-style operations.
  • Improve diagnostics around comm-lowered kernels so 8-rank routing failures can be attributed earlier and more precisely.
  • Keep A2/A3 autosync behavior explicit for scalar-pipe comm instructions such as TSCATTER.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions