Skip to content

Comments

Fix: resolve performance profiling deadlock with dynamic task count u…#94

Open
ChaoZheng109 wants to merge 1 commit intoChaoWao:mainfrom
ChaoZheng109:pref-bug
Open

Fix: resolve performance profiling deadlock with dynamic task count u…#94
ChaoZheng109 wants to merge 1 commit intoChaoWao:mainfrom
ChaoZheng109:pref-bug

Conversation

@ChaoZheng109
Copy link
Contributor

…pdates

Fixes spinlock deadlock in AICPU performance profiling where Device threads hang waiting for Host to read buffers. The root cause was Host exiting collection early due to seeing total_tasks=0 during parallel orchestration.

Changes:

  • Use 0xFFFFFFFF as uninitialized marker for total_tasks instead of 0
  • AICPU dynamically updates total_tasks from orchestrator's current_task_index
  • Host polls and refreshes expected_tasks during collection
  • Optimize to avoid duplicate atomic reads by reusing visible_tasks
  • Reduce log frequency to only first update and orchestration completion
  • Increase profiling timeout from 2s to 30s for large graphs
  • Add timeout detection in switch_perf_buffer spinlock

This enables real-time visibility into orchestrator progress and prevents deadlock when orchestration and scheduling run in parallel.

…pdates

Fixes spinlock deadlock in AICPU performance profiling where Device threads
hang waiting for Host to read buffers. The root cause was Host exiting
collection early due to seeing total_tasks=0 during parallel orchestration.

Changes:
- Use 0xFFFFFFFF as uninitialized marker for total_tasks instead of 0
- AICPU dynamically updates total_tasks from orchestrator's current_task_index
- Host polls and refreshes expected_tasks during collection
- Optimize to avoid duplicate atomic reads by reusing visible_tasks
- Reduce log frequency to only first update and orchestration completion
- Increase profiling timeout from 2s to 30s for large graphs
- Add timeout detection in switch_perf_buffer spinlock

This enables real-time visibility into orchestrator progress and prevents
deadlock when orchestration and scheduling run in parallel.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant