A Python/PyTorch implementation of the automatic task parallelization pipeline described in:
S. Das and L. Rauchwerger, "Automatic Task Parallelization of Dataflow Graphs in ML/DL Models," IEEE IPDPS 2024, pp. 728–739.
Deep learning inference at batch size 1 has no data-level parallelism, but many model architectures (GoogLeNet, Inception-v3) contain branching operator graphs that expose task-level parallelism. This project implements the full pipeline from the paper:
- Graph extraction — uses
torch.fxsymbolic tracing to build a weighted operator DAG - Distance-to-end — computes critical-path costs for every node (Algorithm 1)
- Linear clustering — partitions nodes into sequential chains along critical paths (Algorithm 2)
- Iterative cluster merging — reduces cluster count while preserving parallelism (Algorithms 3–4)
- Parallel execution — generates one Python function per cluster, forks one process per cluster, transfers tensors via pre-allocated shared memory buffers and signal queues (Algorithm 7)
pip install torch torchvisionTested on Python 3.12, PyTorch 2.x, Ubuntu 24.04 (WSL2). Requires Linux or WSL — the parallel execution uses fork, which is not available on Windows natively.
python3 parallel_inference.pyThis will benchmark all five models and save results to results.json.
| Model | Nodes | Clusters (pre-merge) | Clusters (post-merge) |
|---|---|---|---|
| SqueezeNet | 66 | 9 | 2 |
| GoogLeNet | 197 | 28 | 4 |
| ResNet-50 | 175 | 5 | 2 |
| DenseNet-121 | 431 | 1 | 1 |
| Inception-v3 | 314 | 36 | 6 |
Benchmarked on an AMD Ryzen 7 3800X (8 cores, 16 logical processors), 30 runs per model:
| Model | Sequential (ms) | Parallel (ms) | Speedup |
|---|---|---|---|
| SqueezeNet | 37.8 ± 1.9 | 38.8 ± 2.2 | 0.97× |
| GoogLeNet | 70.8 ± 1.9 | 58.7 ± 1.8 | 1.21× |
| ResNet-50 | 125.5 ± 3.4 | 123.7 ± 2.6 | 1.01× |
| DenseNet-121 | 93.6 ± 3.4 | 109.0 ± 18.6 | 0.86× |
| Inception-v3 | 103.0 ± 3.0 | 76.1 ± 1.8 | 1.35× |
Models with substantial parallel branches (GoogLeNet, Inception-v3) achieve genuine speedup. Models with dense connectivity (DenseNet-121) or tiny residual branches (ResNet-50) show near break-even or overhead.
- GIL bypass — uses
multiprocessing.Processwithforkrather than threads, which cannot run CPU-bound work in parallel due to Python's GIL - Shared memory transfers — inter-cluster tensors are written into pre-allocated
share_memory_()buffers before forking; only a single integer signal passes through the queue per transfer, avoiding tensor serialization overhead - Code generation — each cluster is compiled into a standalone Python function (
fparcluster_i) with hardcoded operator calls and queue synchronization, eliminating runtime graph interpretation overhead - Persistent workers — processes are spawned once per model and reused across all timed runs, removing process startup cost from measurements
S. Das and L. Rauchwerger, "Automatic Task Parallelization of Dataflow Graphs in ML/DL Models," in 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 728–739, May 2024.