Skip to content

akdswordguy/Parrallel_Dask

Repository files navigation

Parallel Image Processing with Dask

Python Version License Dask

A high-performance parallel image processing pipeline using Dask for Shared Memory Parallelism (SMP) on multi-core systems. This implementation demonstrates efficient utilization of CPU cores for batch image processing tasks with automatic workload distribution.

🎯 Overview

Modern image processing applications require efficient handling of large image datasets within strict time constraints. This project implements automatic parallelization of image processing tasks using Dask, achieving:

  • 1.99Γ— speedup for I/O-bound operations (simple resize)
  • 3.56Γ— speedup for CPU-intensive operations (filters + transformations)
  • Zero failures processing 10,000+ images
  • 72% time reduction for complex processing tasks

Features

  • Automatic Parallelization: Leverages all available CPU cores without manual thread management
  • Dual Processing Modes: Optimized schedulers for both I/O-bound and CPU-bound workloads
  • Batch Processing: Handles thousands of images efficiently
  • Performance Metrics: Built-in benchmarking and comparison tools
  • Validation Tools: Automated verification of processing accuracy
  • Error Handling: Robust error management with detailed reporting
  • Cross-Platform: Works on Windows, Linux, and macOS

πŸ“‹ Table of Contents

πŸ”§ Installation

Prerequisites

  • Python 3.7 or higher
  • pip package manager

Setup

  1. Clone the repository

    git clone https://github.com/yourusername/parallel-image-processing.git
    cd parallel-image-processing
  2. Install required dependencies

    pip install -r requirements.txt

    Or install manually:

    pip install dask pillow numpy
  3. Verify installation

    python -c "import dask, PIL, numpy; print('All dependencies installed successfully!')"

Quick Start

Process 10,000 images in 4 simple steps:

# Step 1: Generate test dataset (10,000 images)
python dummy_image_gen.py

# Step 2: Run I/O-bound processing (basic resize)
python new.py

# Step 3: Run CPU-intensive processing (filters + enhancements)
python cpu_intensive.py

# Step 4: Verify results
python verify.py

Expected Output:

Total images found: 10000
Sequential Processing Time: 26.79 seconds
Parallel Processing Time: 13.44 seconds
Speedup Achieved: 1.99x faster
βœ… All checks passed! Processing was successful.

πŸ“– Usage

Basic Usage

Process your own images:

  1. Place your images in a folder (e.g., my_images/)
  2. Update the input folder in the script:
    input_folder = "my_images"
  3. Run the processing script:
    python new.py

Advanced Usage

Customize processing parameters:

# In new.py or cpu_intensive.py

# Change output dimensions
img_resized = img.resize((512, 512), Image.Resampling.LANCZOS)

# Adjust number of workers
compute(*tasks, scheduler='threads', num_workers=16)

# Change output directory
output_folder = "my_output_folder"

Processing Modes

Mode 1: I/O-Bound (Fast, Simple)

python new.py
  • Best for: Large batches, simple transformations
  • Operations: Load β†’ Resize β†’ Save
  • Scheduler: Thread-based
  • Speedup: ~2Γ— faster

Mode 2: CPU-Intensive (Slower, High Quality)

python cpu_intensive.py
  • Best for: Quality enhancement, complex filters
  • Operations: Load β†’ Resize β†’ Filters β†’ Matrix Operations β†’ Save
  • Scheduler: Process-based
  • Speedup: ~3.5Γ— faster

πŸ“ Project Structure

parallel-image-processing/
β”œβ”€β”€ dummy_image_gen.py          # Test dataset generator
β”œβ”€β”€ new.py                      # I/O-bound parallel processing
β”œβ”€β”€ cpu_intensive.py            # CPU-intensive parallel processing
β”œβ”€β”€ verify.py                   # Result validation script
β”œβ”€β”€ requirements.txt            # Python dependencies
β”œβ”€β”€ README.md                   # This file
β”‚
β”œβ”€β”€ images/                     # Input images (generated)
β”‚   β”œβ”€β”€ test_0.jpg
β”‚   β”œβ”€β”€ test_1.jpg
β”‚   └── ...
β”‚
β”œβ”€β”€ processed_seq/              # Sequential processing output
β”œβ”€β”€ processed_par/              # Parallel processing output
β”œβ”€β”€ processed_seq_intensive/    # Sequential CPU-intensive output
└── processed_par_intensive/    # Parallel CPU-intensive output

πŸ“Š Performance Results

Test Environment

  • CPU: 28 cores
  • Dataset: 10,000 images (300-600px, random colors)
  • Output: 256Γ—256 pixels
  • Platform: Windows with Python 3.13

Benchmark Results

I/O-Bound Processing (Simple Resize)

Metric Sequential Parallel Improvement
Time 26.79s 13.44s 13.35s saved
Speedup 1.0Γ— 1.99Γ— 99% faster
Throughput 373 img/s 744 img/s +99%
Efficiency 100% 7.1% -

CPU-Intensive Processing (Filters + Transformations)

Metric Sequential Parallel Improvement
Time 51.23s 14.38s 36.85s saved
Speedup 1.0Γ— 3.56Γ— 256% faster
Throughput 195 img/s 695 img/s +256%
Efficiency 100% 12.7% -

Speedup Comparison

I/O-Bound:      β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 1.99Γ—
CPU-Intensive:  β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 3.56Γ— (79% better!)

βš™οΈ Configuration

Adjust Processing Parameters

Target Image Size:

# Change in process_image() function
img_resized = img.resize((512, 512), Image.Resampling.LANCZOS)

Number of Workers:

# Auto-detect (recommended)
num_workers = os.cpu_count()

# Manual setting
num_workers = 16

Scheduler Type:

# For I/O-bound tasks
scheduler = 'threads'

# For CPU-bound tasks
scheduler = 'processes'

Supported Image Formats

  • JPEG (.jpg, .jpeg)
  • PNG (.png)
  • BMP (.bmp)
  • GIF (.gif)

πŸ”¬ Technical Details

Architecture

Input Layer β†’ Task Generation β†’ Parallel Execution β†’ Output Layer
     ↓              ↓                    ↓                ↓
File Discovery  Dask Delayed      Scheduler        Result Aggregation

Parallelization Strategy

  1. Lazy Task Graph: Create delayed tasks without immediate execution

    tasks = [delayed(process_image)(path, output) for path in images]
  2. Parallel Execution: Execute all tasks concurrently

    results = compute(*tasks, scheduler='threads', num_workers=28)
  3. Automatic Load Balancing: Dask distributes work across available cores

Scheduler Comparison

Feature Thread Scheduler Process Scheduler
Best For I/O operations CPU computations
Overhead Low High
GIL Impact Limited by GIL Bypasses GIL
Memory Shared Replicated
Setup Simple Requires if __name__

πŸ› Troubleshooting

Common Issues

1. RuntimeError: freeze_support() on Windows

Problem: Process scheduler fails with "freeze_support" error

Solution: Wrap code in if __name__ == '__main__': block

if __name__ == '__main__':
    main()

2. ModuleNotFoundError: No module named 'PIL'

Problem: Pillow not installed

Solution:

pip install pillow

3. Low Speedup (<1.5Γ—)

Problem: Task is I/O-bound, disk is bottleneck

Solution:

  • Use faster storage (SSD/NVMe)
  • Reduce number of workers to avoid I/O contention
  • Consider CPU-intensive processing mode

4. High Memory Usage

Problem: Processing large images exhausts RAM

Solution:

  • Process in smaller batches
  • Reduce number of workers
  • Use thread scheduler (shared memory)

Performance Tips

βœ… Use SSD storage for faster I/O
βœ… Match workers to cores (don't over-provision)
βœ… Choose correct scheduler for workload type
βœ… Monitor resource usage during execution
βœ… Profile bottlenecks before optimizing

πŸ”— Related Projects

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages