Modular runtime for building and executing task dependency runtimes on Ascend devices with coordinated AICPU and AICore execution. Three independently compiled programs work together through clearly defined APIs.
The PTO Runtime consists of three separate programs that communicate through well-defined APIs:
┌─────────────────────────────────────────────────────────────┐
│ Python Application │
│ (examples/basic/main.py) │
└─────────────────────────┬───────────────────────────────────┘
│
┌────────────────┼────────────────┐
│ │ │
Python Bindings (ctypes) Device I/O
runtime_bindings.py
│ │ │
▼ ▼ ▼
┌──────────────────┐ ┌──────────────────┐
│ Host Runtime │ │ Binary Data │
│ (src/host/) │ │ (AICPU + AICore)│
├──────────────────┤ └──────────────────┘
│ DeviceRunner │ │
│ Runtime │ │
│ MemoryAllocator │ Loaded at runtime
│ C API │ │
└────────┬─────────┘ │
│ │
└───────────────────┘
│
▼
┌────────────────────────────┐
│ Ascend Device (Hardware) │
├────────────────────────────┤
│ AICPU: Task Scheduler │ (src/aicpu/)
│ AICore: Compute Kernels │ (src/aicore/)
└────────────────────────────┘
C++ library - Device orchestration and management
DeviceRunner: Singleton managing device operationsRuntime: Task dependency runtime data structureMemoryAllocator: Device tensor memory managementpto_runtime_c_api.h: Pure C API for Python bindings- Compiled to shared library (.so) at runtime
Key Responsibilities:
- Allocate/free device memory
- Host ↔ Device data transfer
- AICPU kernel launching and configuration
- AICore kernel registration and loading
- Runtime execution workflow coordination
Device program - Task scheduler running on AICPU processor
kernel.cpp: Kernel entry points and handshake protocolexecute.cpp: Task scheduler implementation- Compiled to device binary at build time
Key Responsibilities:
- Initialize handshake protocol with AICore cores
- Identify initially ready tasks (fanin=0)
- Dispatch ready tasks to idle AICore cores
- Track task completion and update dependencies
- Continue until all tasks complete
Device program - Computation kernels executing on AICore processors
kernel.cpp: Task execution kernels (add, mul, etc.)- Compiled to object file (.o) at build time
Key Responsibilities:
- Wait for task assignment via handshake buffer
- Read task arguments and kernel address
- Execute kernel using PTO ISA
- Signal task completion
- Poll for next task or quit signal
Three layers of APIs enable the separation:
DeviceRunner& runner = DeviceRunner::Get();
runner.Init(device_id, num_cores, aicpu_bin, aicore_bin, pto_isa_root);
runner.AllocateTensor(bytes);
runner.CopyToDevice(device_ptr, host_ptr, bytes);
runner.Run(runtime);
runner.Finalize();int DeviceRunner_Init(device_id, num_cores, aicpu_binary, aicpu_size,
aicore_binary, aicore_size, pto_isa_root);
int DeviceRunner_Run(runtime_handle, launch_aicpu_num);
int InitRuntime(runtime_handle);
int FinalizeRuntime(runtime_handle);
int DeviceRunner_Finalize();Runtime = load_runtime(host_binary)
runtime = Runtime()
runtime.initialize()
launch_runtime(runtime, aicpu_thread_num=1, block_dim=1,
device_id=device_id, aicpu_binary=aicpu_bytes,
aicore_binary=aicore_bytes)
runtime.finalize()runtime/
├── src/
│ ├── host/ # Host runtime program
│ │ ├── devicerunner.h/cpp # Device management
│ │ ├── memoryallocator.h/cpp # Memory allocation
│ │ ├── kernel_compiler.h/cpp # Runtime kernel compilation
│ │ ├── binary_loader.h/cpp # Binary loading utilities
│ │ ├── pto_runtime_c_api.h/cpp # C API for bindings
│ │ └── function_cache.h # Kernel binary cache
│ ├── aicpu/ # AICPU kernel (device program)
│ │ ├── kernel.cpp # Entry points & handshake
│ │ ├── execute.cpp # Task scheduler
│ │ └── device_log.h/cpp # Device logging
│ ├── aicore/ # AICore kernel (device program)
│ │ └── kernel.cpp # Task execution kernels
│ └── common/ # Shared structures
│ └── kernel_args.h # Kernel argument structures
│
├── python/ # Language bindings
│ ├── runtime_bindings.py # ctypes wrapper (C → Python)
│ ├── binary_compiler.py # Multi-platform compiler
│ └── toolchain.py # Toolchain configuration
│
├── examples/basic/ # Complete working example
│ ├── main.py # Python orchestration
│ ├── host/runtimemaker.cpp # C++ runtime builder & validator
│ ├── aicpu/execute.cpp # Example scheduler
│ ├── runtime/ # Task runtime definitions
│ │ ├── runtime.h/cpp # Task runtime and handshake structures
│ │ └── kernel_args.h
│ └── kernels/aiv/ # Example kernels
│ ├── kernel_add.cpp
│ ├── kernel_add_scalar.cpp
│ └── kernel_mul.cpp
│
└── CMakeLists.txt # Build configuration
- CMake 3.15+
- CANN toolkit with:
cceccompiler (AICore Bisheng CCE)- Cross-compiler for AICPU (aarch64-target-linux-gnu-gcc/g++)
- Standard C/C++ compiler (gcc/g++) for host
- Python 3 with development headers
source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash
export ASCEND_HOME_PATH=/usr/local/Ascend/ascend-toolkit/latestThe BinaryCompiler class handles compilation of all three components separately:
from binary_compiler import BinaryCompiler
compiler = BinaryCompiler()
# Compile each component to independent binaries
aicore_binary = compiler.compile("aicore", include_dirs, source_dirs) # → .o file
aicpu_binary = compiler.compile("aicpu", include_dirs, source_dirs) # → .so file
host_binary = compiler.compile("host", include_dirs, source_dirs) # → .so fileToolchains used:
- AICore: Bisheng CCE (
cceccompiler) →.oobject file - AICPU: aarch64 cross-compiler →
.soshared object - Host: Standard gcc/g++ →
.soshared library
Each component is compiled independently with its own toolchain, allowing modular development.
from runtime_bindings import load_runtime
from binary_compiler import BinaryCompiler
# Compile all binaries
compiler = BinaryCompiler()
aicore_bin = compiler.compile("aicore", [...include_dirs...], [...source_dirs...])
aicpu_bin = compiler.compile("aicpu", [...include_dirs...], [...source_dirs...])
host_bin = compiler.compile("host", [...include_dirs...], [...source_dirs...])
# Load and initialize runtime
Runtime = load_runtime(host_bin)
runtime = Runtime()
runtime.initialize() # C++ builds runtime and allocates tensors
# Execute runtime on device
launch_runtime(runtime,
aicpu_thread_num=1,
block_dim=1,
device_id=9,
aicpu_binary=aicpu_bin,
aicore_binary=aicore_bin)
runtime.finalize() # Verify and cleanupcd runtime/examples/basic
python3 main.py [device_id]This example:
- Compiles AICPU, AICore, and Host binaries using BinaryCompiler
- Loads the host runtime library
- Initializes DeviceRunner with compiled binaries
- Creates a task runtime:
f = (a + b + 1)(a + b + 2)with 4 tasks and dependencies - Executes on device (AICPU scheduling, AICore computing)
- Validates results and cleans up
Expected output:
=== Creating and Initializing Runtime ===
Formula: (a + b + 1)(a + b + 2)
=== Executing Runtime on Device ===
=== Validating Results and Cleaning Up ===
✓ SUCCESS: All 16384 elements are correct (42.0)
Formula verified: (a + b + 1)(a + b + 2) = (2+3+1)*(2+3+2) = 42
Python main.py
│
├─→ BinaryCompiler.compile("host", ...) → host_binary (.so)
├─→ BinaryCompiler.compile("aicpu", ...) → aicpu_binary (.so)
├─→ BinaryCompiler.compile("aicore", ...) → aicore_binary (.o)
│
└─→ load_runtime(host_binary)
└─→ RuntimeLibraryLoader(host_binary)
└─→ CDLL(host_binary) ← Loads .so into memory
runner.init(device_id, num_cores, aicpu_binary, aicore_binary, pto_isa_root)
│
├─→ DeviceRunner_Init (C API)
│ ├─→ Initialize CANN device
│ ├─→ Allocate device streams
│ ├─→ Load AICPU binary to device
│ ├─→ Register AICore kernel binary
│ └─→ Create handshake buffers (one per core)
│
└─→ DeviceRunner singleton ready
runtime.initialize()
│
└─→ InitRuntime (C API)
└─→ InitRuntimeImpl (C++)
├─→ Compile kernels at runtime (CompileAndLoadKernel)
│ ├─→ KernelCompiler calls ccec
│ ├─→ Load .o to device GM memory
│ └─→ Update kernel function address table
│
├─→ Allocate device tensors via MemoryAllocator
├─→ Copy input data to device
├─→ Build task runtime with dependencies
└─→ Return Runtime pointer
launch_runtime(runtime, aicpu_thread_num=1, block_dim=1, device_id=device_id,
aicpu_binary=aicpu_bytes, aicore_binary=aicore_bytes)
│
└─→ launch_runtime (C API)
│
├─→ Copy Runtime to device memory
│
├─→ LaunchAiCpuKernel (init kernel)
│ └─→ Execute on AICPU: Initialize handshake
│
├─→ LaunchAiCpuKernel (main scheduler kernel)
│ └─→ Execute on AICPU: Task scheduler loop
│ ├─→ Find initially ready tasks
│ ├─→ Loop: dispatch tasks, wait for completion
│ └─→ Continue until all tasks done
│
├─→ LaunchAicoreKernel
│ └─→ Execute on AICore cores: Task workers
│ ├─→ Wait for task assignment
│ ├─→ Execute kernel
│ └─→ Signal completion, repeat
│
└─→ rtStreamSynchronize (wait for completion)
runtime.finalize()
│
└─→ FinalizeRuntime (C API)
└─→ FinalizeRuntimeImpl (C++)
├─→ Copy results from device to host
├─→ Verify correctness (compare with expected values)
├─→ Free all device tensors
├─→ Delete runtime
└─→ Return success/failure
AICPU and AICore cores coordinate via handshake buffers (one per core):
struct Handshake {
volatile uint32_t aicpu_ready; // AICPU→AICore: scheduler ready
volatile uint32_t aicore_done; // AICore→AICPU: core ready
volatile uint64_t task; // AICPU→AICore: task pointer
volatile int32_t task_status; // Task state: 1=busy, 0=done
volatile int32_t control; // AICPU→AICore: 1=quit
};Flow:
- AICPU finds a ready task
- AICPU writes task pointer to handshake buffer and sets
aicpu_ready - AICore polls buffer, sees task, reads from device memory
- AICore sets
task_status = 1(busy) and executes - AICore sets
task_status = 0(done) andaicore_done - AICPU reads result and continues
DeviceRunner: Singleton managing device operations
- Allocate/free device tensor memory
- Copy data between host and device
- Launch AICPU and AICore kernels
- Manage handshake buffers
- Coordinate runtime execution
Runtime: Task dependency runtime
- Add tasks with arguments and function IDs
- Add dependencies between tasks (fanin/fanout)
- Query task information and dependency structure
- Calculate topologically ready tasks
MemoryAllocator: Device memory management
- Allocate blocks from device GM memory
- Track allocations automatically
- Free with automatic cleanup on finalization
pto_runtime_c_api: Pure C interface
- Enables Python ctypes bindings
- Wraps C++ classes as opaque pointers
- Error codes: 0=success, negative=failure
- All memory management in C++
kernel.cpp: Kernel entry points
- Initialization kernel: Sets up handshake protocol
- Main scheduler kernel: Task scheduling loop
- Handshake initialization and management
execute.cpp: Task scheduler
- Ready task identification
- Task dispatch to cores
- Dependency tracking and updates
- Loop until completion
kernel.cpp: Computation kernels
- Task execution implementations
- Kernel function pointers indexed by func_id
- Memory access and PTO ISA operations
- Handshake buffer polling
Compile and load kernels at runtime without rebuilding:
// In host code
runner.CompileAndLoadKernel(func_id, "path/to/kernel.cpp", core_type);This compiles the kernel source using ccec, loads the binary to device memory, and registers it for task dispatch.
Full Python API with ctypes:
- No C++ knowledge required
- NumPy integration for arrays
- Easy data transfer between host and device
- Three programs compile independently
- Clear API boundaries
- Develop components in parallel
- Runtime linking via binary loading
In src/runtime/runtime/runtime.h:
#define RUNTIME_MAX_TASKS 1024 // Maximum number of tasks
#define RUNTIME_MAX_ARGS 16 // Maximum arguments per task
#define RUNTIME_MAX_FANOUT 512 // Maximum successors per taskrunner.init(
device_id=0, # Device ID (0-15)
num_cores=3, # Number of cores for handshake
aicpu_binary=..., # AICPU .so binary
aicore_binary=..., # AICore .o binary
pto_isa_root="/path/to/pto-isa" # PTO-ISA headers location
)- Device IDs: 0-15 (typically device 9 used for examples)
- Handshake cores: Usually 3 (1c2v configuration: 1 core, 2 vector units)
- Kernel compilation: Requires
ASCEND_HOME_PATHenvironment variable - Memory management: MemoryAllocator automatically tracks allocations
- Python requirement: NumPy for efficient array operations
Device logs written to ~/ascend/log/debug/device-<id>/
Kernel uses macros:
DEV_INFO: Informational messagesDEV_DEBUG: Debug messagesDEV_WARN: WarningsDEV_ERROR: Error messages
./ci.sh- src/host/ - Host runtime implementation details
- src/aicpu/ - AICPU scheduler implementation
- src/aicore/ - AICore kernel implementation
- examples/basic/ - Complete working example
- python/ - Python bindings and compiler