Parallel K-Means in Go

A parallel implementation of K-Means clustering in Go, exploring different parallelization strategies and their performance characteristics.

This project implements Lloyd's algorithm for K-Means clustering and evaluates three execution models:

Sequential baseline
BSP-style parallelization
Work-stealing parallelization

The goal is to study how algorithm structure, synchronization strategy, and scheduling policies influence performance in parallel systems.

Problem

K-Means clustering partitions a set of points into K clusters by iteratively assigning each point to the nearest centroid and recomputing cluster centers.

Given:

N points
dimension d
K clusters

The algorithm repeatedly performs two steps:

Assign each point to the nearest centroid
Recompute centroids as the average of the assigned points

The assignment step dominates runtime because every point must compare against every centroid.

Time complexity per iteration:

O(N * K)

Because each point assignment is independent, K-Means is highly parallelizable.

Algorithm

This project implements Lloyd's algorithm:

Initialize K centroids
Assign each point to its nearest centroid
Update centroid positions
Repeat until convergence

Instead of separating steps 2 and 3, the implementation fuses assignment and accumulation into a single pass.

Instead of:

assign points
then recompute centroids

the implementation does:

assign point
update centroid accumulator

This avoids an additional pass over the dataset.

Implementation Overview

Three execution models are implemented.

Sequential Baseline

The sequential implementation directly follows the algorithm described above.

Key design decisions:

assignment and centroid update are fused
centroid buffers are reused between iterations
accumulators are reset instead of reallocated

These decisions simplify the later parallel implementations.

Parallel Design

The dominant computation is the assignment loop over all points.

for each point:
    find closest centroid
    update centroid accumulator

This loop has complexity:

O(N * K)

Since each point can be processed independently, it is ideal for parallelization.

BSP Parallel Implementation

The first parallel implementation follows a BSP-style execution model.

The dataset is divided evenly across worker goroutines.

Each worker processes its assigned points and accumulates partial centroid statistics.

Avoiding Synchronization Bottlenecks

Updating global centroids directly would require heavy synchronization.

Instead, each worker maintains thread-local accumulators.

Example structure:

worker accumulator:
    K * dimension sums
    K counts

Workers accumulate results locally.

After all workers finish, the main thread performs a reduction step to combine the accumulators and compute new centroids.

Execution flow:

points
  -> worker threads
      -> local accumulators
          -> reduction
              -> centroid update

This avoids fine-grained synchronization during the main computation.

Synchronization Strategy

Two synchronization mechanisms coordinate execution:

WaitGroup

Used to detect when all workers complete their portion of the loop.

Condition Variable

Workers sleep between iterations and are woken when the next iteration begins.

This design ensures:

no race conditions
no busy waiting
clear iteration boundaries

Work Stealing Implementation

The second parallel implementation introduces work stealing.

Instead of statically dividing work among threads, the assignment loop is split into small tasks.

Each worker owns a task queue.

Workers execute tasks from their own queue first.

If a worker runs out of tasks, it attempts to steal tasks from other workers.

Motivation

In BSP scheduling:

runtime ≈ slowest worker

In work-stealing scheduling:

runtime ≈ average worker time

As thread count increases, load imbalance becomes more likely.

Work stealing reduces this imbalance by redistributing unfinished tasks.

Task Queue Design

The work stealing implementation uses a lightweight queue.

Each queue is implemented as:

tasks slice
left pointer

Popping a task increments the pointer atomically.

Tasks themselves remain unchanged across iterations.

Only the pointer is reset between iterations.

This design avoids expensive memory operations.

Execution Flow

Each worker repeatedly performs:

pop task from own queue
if empty:
    attempt stealing
if all queues empty:
    sleep until next iteration

Workers only use atomic operations when accessing queues.

All other operations are local.

Performance Evaluation

Experiments were run on randomly generated datasets.

Configuration:

200000 points
dimension = 5
K clusters
fixed random seed
work stealing task size = 1000 points

Three execution modes were tested:

seq     sequential baseline
bsp     static parallel partition
steal   work stealing scheduler

Sequential runtime is used as the speedup baseline.

Results

Speedup results show strong scalability for both parallel implementations.

Observations:

both parallel methods achieve substantial speedup
performance scales nearly linearly with thread count
work stealing slightly outperforms BSP at higher thread counts

Analysis

Both implementations share the same core computation, so their performance curves are similar.

However, their scheduling strategies differ.

BSP Behavior

BSP runtime depends on the slowest worker.

Even small load imbalance can delay the entire iteration.

Work Stealing Behavior

Work stealing distributes tasks dynamically.

Idle workers can assist overloaded workers.

As thread count increases, the difference between average and worst-case worker time grows, making work stealing more effective.

Remaining Bottlenecks

Most computation is parallelized.

The remaining sequential sections include:

accumulator reduction
centroid update
accumulator reset

These operations scale with:

O(K * threads)

Since K is small relative to N, their cost is minor.

Memory Bandwidth Considerations

K-Means has relatively low arithmetic intensity.

Each point assignment requires:

reading coordinates
computing distances
updating accumulators

As thread count increases, memory bandwidth may become the limiting factor.

This may eventually cap achievable speedup.

Running the Program

Run using Go:

go run . <mode> <threads>

Example:

go run . seq 1
go run . bsp 8
go run . steal 8

Modes:

seq
bsp
steal

Summary

This project explores how different parallel scheduling strategies influence the performance of a simple machine learning algorithm.

Key ideas demonstrated:

fused assignment and centroid update
thread-local accumulators to reduce synchronization
BSP-style parallelization
work stealing for dynamic load balancing
performance analysis of parallel algorithms

The results show that even for embarrassingly parallel problems, scheduling policies can influence scalability as the number of workers increases.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.gitignore		.gitignore
KMeans.go		KMeans.go
KMeansBSP.go		KMeansBSP.go
KMeansStealing.go		KMeansStealing.go
README.md		README.md
dataloader.go		dataloader.go
go.mod		go.mod
lockFree.go		lockFree.go
main.go		main.go
speedup-images.png		speedup-images.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parallel K-Means in Go

Problem

Algorithm

Implementation Overview

Sequential Baseline

Parallel Design

BSP Parallel Implementation

Avoiding Synchronization Bottlenecks

Synchronization Strategy

WaitGroup

Condition Variable

Work Stealing Implementation

Motivation

Task Queue Design

Execution Flow

Performance Evaluation

Results

Analysis

BSP Behavior

Work Stealing Behavior

Remaining Bottlenecks

Memory Bandwidth Considerations

Running the Program

Summary

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Parallel K-Means in Go

Problem

Algorithm

Implementation Overview

Sequential Baseline

Parallel Design

BSP Parallel Implementation

Avoiding Synchronization Bottlenecks

Synchronization Strategy

WaitGroup

Condition Variable

Work Stealing Implementation

Motivation

Task Queue Design

Execution Flow

Performance Evaluation

Results

Analysis

BSP Behavior

Work Stealing Behavior

Remaining Bottlenecks

Memory Bandwidth Considerations

Running the Program

Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages