Skip to content

Goalaso/parallelization_ML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Parallel Inference via Linear Task Clustering

A Python/PyTorch implementation of the automatic task parallelization pipeline described in:

S. Das and L. Rauchwerger, "Automatic Task Parallelization of Dataflow Graphs in ML/DL Models," IEEE IPDPS 2024, pp. 728–739.

Overview

Deep learning inference at batch size 1 has no data-level parallelism, but many model architectures (GoogLeNet, Inception-v3) contain branching operator graphs that expose task-level parallelism. This project implements the full pipeline from the paper:

  1. Graph extraction — uses torch.fx symbolic tracing to build a weighted operator DAG
  2. Distance-to-end — computes critical-path costs for every node (Algorithm 1)
  3. Linear clustering — partitions nodes into sequential chains along critical paths (Algorithm 2)
  4. Iterative cluster merging — reduces cluster count while preserving parallelism (Algorithms 3–4)
  5. Parallel execution — generates one Python function per cluster, forks one process per cluster, transfers tensors via pre-allocated shared memory buffers and signal queues (Algorithm 7)

Requirements

pip install torch torchvision

Tested on Python 3.12, PyTorch 2.x, Ubuntu 24.04 (WSL2). Requires Linux or WSL — the parallel execution uses fork, which is not available on Windows natively.

Usage

python3 parallel_inference.py

This will benchmark all five models and save results to results.json.

Models Evaluated

Model Nodes Clusters (pre-merge) Clusters (post-merge)
SqueezeNet 66 9 2
GoogLeNet 197 28 4
ResNet-50 175 5 2
DenseNet-121 431 1 1
Inception-v3 314 36 6

Results

Benchmarked on an AMD Ryzen 7 3800X (8 cores, 16 logical processors), 30 runs per model:

Model Sequential (ms) Parallel (ms) Speedup
SqueezeNet 37.8 ± 1.9 38.8 ± 2.2 0.97×
GoogLeNet 70.8 ± 1.9 58.7 ± 1.8 1.21×
ResNet-50 125.5 ± 3.4 123.7 ± 2.6 1.01×
DenseNet-121 93.6 ± 3.4 109.0 ± 18.6 0.86×
Inception-v3 103.0 ± 3.0 76.1 ± 1.8 1.35×

Models with substantial parallel branches (GoogLeNet, Inception-v3) achieve genuine speedup. Models with dense connectivity (DenseNet-121) or tiny residual branches (ResNet-50) show near break-even or overhead.

Implementation Notes

  • GIL bypass — uses multiprocessing.Process with fork rather than threads, which cannot run CPU-bound work in parallel due to Python's GIL
  • Shared memory transfers — inter-cluster tensors are written into pre-allocated share_memory_() buffers before forking; only a single integer signal passes through the queue per transfer, avoiding tensor serialization overhead
  • Code generation — each cluster is compiled into a standalone Python function (fparcluster_i) with hardcoded operator calls and queue synchronization, eliminating runtime graph interpretation overhead
  • Persistent workers — processes are spawned once per model and reused across all timed runs, removing process startup cost from measurements

Reference

S. Das and L. Rauchwerger, "Automatic Task Parallelization of Dataflow Graphs in ML/DL Models," in 2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp. 728–739, May 2024.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages