Skip to content

MonadMorph/parallel-kmeans-go

Repository files navigation

Parallel K-Means in Go

A parallel implementation of K-Means clustering in Go, exploring different parallelization strategies and their performance characteristics.

This project implements Lloyd's algorithm for K-Means clustering and evaluates three execution models:

  • Sequential baseline
  • BSP-style parallelization
  • Work-stealing parallelization

The goal is to study how algorithm structure, synchronization strategy, and scheduling policies influence performance in parallel systems.


Problem

K-Means clustering partitions a set of points into K clusters by iteratively assigning each point to the nearest centroid and recomputing cluster centers.

Given:

  • N points
  • dimension d
  • K clusters

The algorithm repeatedly performs two steps:

  1. Assign each point to the nearest centroid
  2. Recompute centroids as the average of the assigned points

The assignment step dominates runtime because every point must compare against every centroid.

Time complexity per iteration:

O(N * K)

Because each point assignment is independent, K-Means is highly parallelizable.


Algorithm

This project implements Lloyd's algorithm:

  1. Initialize K centroids
  2. Assign each point to its nearest centroid
  3. Update centroid positions
  4. Repeat until convergence

Instead of separating steps 2 and 3, the implementation fuses assignment and accumulation into a single pass.

Instead of:

assign points
then recompute centroids

the implementation does:

assign point
update centroid accumulator

This avoids an additional pass over the dataset.


Implementation Overview

Three execution models are implemented.

Sequential Baseline

The sequential implementation directly follows the algorithm described above.

Key design decisions:

  • assignment and centroid update are fused
  • centroid buffers are reused between iterations
  • accumulators are reset instead of reallocated

These decisions simplify the later parallel implementations.


Parallel Design

The dominant computation is the assignment loop over all points.

for each point:
    find closest centroid
    update centroid accumulator

This loop has complexity:

O(N * K)

Since each point can be processed independently, it is ideal for parallelization.


BSP Parallel Implementation

The first parallel implementation follows a BSP-style execution model.

The dataset is divided evenly across worker goroutines.

Each worker processes its assigned points and accumulates partial centroid statistics.

Avoiding Synchronization Bottlenecks

Updating global centroids directly would require heavy synchronization.

Instead, each worker maintains thread-local accumulators.

Example structure:

worker accumulator:
    K * dimension sums
    K counts

Workers accumulate results locally.

After all workers finish, the main thread performs a reduction step to combine the accumulators and compute new centroids.

Execution flow:

points
  -> worker threads
      -> local accumulators
          -> reduction
              -> centroid update

This avoids fine-grained synchronization during the main computation.


Synchronization Strategy

Two synchronization mechanisms coordinate execution:

WaitGroup

Used to detect when all workers complete their portion of the loop.

Condition Variable

Workers sleep between iterations and are woken when the next iteration begins.

This design ensures:

  • no race conditions
  • no busy waiting
  • clear iteration boundaries

Work Stealing Implementation

The second parallel implementation introduces work stealing.

Instead of statically dividing work among threads, the assignment loop is split into small tasks.

Each worker owns a task queue.

Workers execute tasks from their own queue first.

If a worker runs out of tasks, it attempts to steal tasks from other workers.

Motivation

In BSP scheduling:

runtime ≈ slowest worker

In work-stealing scheduling:

runtime ≈ average worker time

As thread count increases, load imbalance becomes more likely.

Work stealing reduces this imbalance by redistributing unfinished tasks.


Task Queue Design

The work stealing implementation uses a lightweight queue.

Each queue is implemented as:

tasks slice
left pointer

Popping a task increments the pointer atomically.

Tasks themselves remain unchanged across iterations.

Only the pointer is reset between iterations.

This design avoids expensive memory operations.


Execution Flow

Each worker repeatedly performs:

pop task from own queue
if empty:
    attempt stealing
if all queues empty:
    sleep until next iteration

Workers only use atomic operations when accessing queues.

All other operations are local.


Performance Evaluation

Experiments were run on randomly generated datasets.

Configuration:

  • 200000 points
  • dimension = 5
  • K clusters
  • fixed random seed
  • work stealing task size = 1000 points

Three execution modes were tested:

seq     sequential baseline
bsp     static parallel partition
steal   work stealing scheduler

Sequential runtime is used as the speedup baseline.


Results

Speedup results show strong scalability for both parallel implementations.

Speedup Graph

Observations:

  • both parallel methods achieve substantial speedup
  • performance scales nearly linearly with thread count
  • work stealing slightly outperforms BSP at higher thread counts

Analysis

Both implementations share the same core computation, so their performance curves are similar.

However, their scheduling strategies differ.

BSP Behavior

BSP runtime depends on the slowest worker.

Even small load imbalance can delay the entire iteration.

Work Stealing Behavior

Work stealing distributes tasks dynamically.

Idle workers can assist overloaded workers.

As thread count increases, the difference between average and worst-case worker time grows, making work stealing more effective.


Remaining Bottlenecks

Most computation is parallelized.

The remaining sequential sections include:

  • accumulator reduction
  • centroid update
  • accumulator reset

These operations scale with:

O(K * threads)

Since K is small relative to N, their cost is minor.


Memory Bandwidth Considerations

K-Means has relatively low arithmetic intensity.

Each point assignment requires:

  • reading coordinates
  • computing distances
  • updating accumulators

As thread count increases, memory bandwidth may become the limiting factor.

This may eventually cap achievable speedup.


Running the Program

Run using Go:

go run . <mode> <threads>

Example:

go run . seq 1
go run . bsp 8
go run . steal 8

Modes:

seq
bsp
steal

Summary

This project explores how different parallel scheduling strategies influence the performance of a simple machine learning algorithm.

Key ideas demonstrated:

  • fused assignment and centroid update
  • thread-local accumulators to reduce synchronization
  • BSP-style parallelization
  • work stealing for dynamic load balancing
  • performance analysis of parallel algorithms

The results show that even for embarrassingly parallel problems, scheduling policies can influence scalability as the number of workers increases.

About

An implementation of parallel K-mean clustering in Go. Supports various scheduling modes.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages