A parallel implementation of K-Means clustering in Go, exploring different parallelization strategies and their performance characteristics.
This project implements Lloyd's algorithm for K-Means clustering and evaluates three execution models:
- Sequential baseline
- BSP-style parallelization
- Work-stealing parallelization
The goal is to study how algorithm structure, synchronization strategy, and scheduling policies influence performance in parallel systems.
K-Means clustering partitions a set of points into K clusters by iteratively assigning each point to the nearest centroid and recomputing cluster centers.
Given:
- N points
- dimension d
- K clusters
The algorithm repeatedly performs two steps:
- Assign each point to the nearest centroid
- Recompute centroids as the average of the assigned points
The assignment step dominates runtime because every point must compare against every centroid.
Time complexity per iteration:
O(N * K)
Because each point assignment is independent, K-Means is highly parallelizable.
This project implements Lloyd's algorithm:
- Initialize K centroids
- Assign each point to its nearest centroid
- Update centroid positions
- Repeat until convergence
Instead of separating steps 2 and 3, the implementation fuses assignment and accumulation into a single pass.
Instead of:
assign points
then recompute centroids
the implementation does:
assign point
update centroid accumulator
This avoids an additional pass over the dataset.
Three execution models are implemented.
The sequential implementation directly follows the algorithm described above.
Key design decisions:
- assignment and centroid update are fused
- centroid buffers are reused between iterations
- accumulators are reset instead of reallocated
These decisions simplify the later parallel implementations.
The dominant computation is the assignment loop over all points.
for each point:
find closest centroid
update centroid accumulator
This loop has complexity:
O(N * K)
Since each point can be processed independently, it is ideal for parallelization.
The first parallel implementation follows a BSP-style execution model.
The dataset is divided evenly across worker goroutines.
Each worker processes its assigned points and accumulates partial centroid statistics.
Updating global centroids directly would require heavy synchronization.
Instead, each worker maintains thread-local accumulators.
Example structure:
worker accumulator:
K * dimension sums
K counts
Workers accumulate results locally.
After all workers finish, the main thread performs a reduction step to combine the accumulators and compute new centroids.
Execution flow:
points
-> worker threads
-> local accumulators
-> reduction
-> centroid update
This avoids fine-grained synchronization during the main computation.
Two synchronization mechanisms coordinate execution:
Used to detect when all workers complete their portion of the loop.
Workers sleep between iterations and are woken when the next iteration begins.
This design ensures:
- no race conditions
- no busy waiting
- clear iteration boundaries
The second parallel implementation introduces work stealing.
Instead of statically dividing work among threads, the assignment loop is split into small tasks.
Each worker owns a task queue.
Workers execute tasks from their own queue first.
If a worker runs out of tasks, it attempts to steal tasks from other workers.
In BSP scheduling:
runtime ≈ slowest worker
In work-stealing scheduling:
runtime ≈ average worker time
As thread count increases, load imbalance becomes more likely.
Work stealing reduces this imbalance by redistributing unfinished tasks.
The work stealing implementation uses a lightweight queue.
Each queue is implemented as:
tasks slice
left pointer
Popping a task increments the pointer atomically.
Tasks themselves remain unchanged across iterations.
Only the pointer is reset between iterations.
This design avoids expensive memory operations.
Each worker repeatedly performs:
pop task from own queue
if empty:
attempt stealing
if all queues empty:
sleep until next iteration
Workers only use atomic operations when accessing queues.
All other operations are local.
Experiments were run on randomly generated datasets.
Configuration:
- 200000 points
- dimension = 5
- K clusters
- fixed random seed
- work stealing task size = 1000 points
Three execution modes were tested:
seq sequential baseline
bsp static parallel partition
steal work stealing scheduler
Sequential runtime is used as the speedup baseline.
Speedup results show strong scalability for both parallel implementations.
Observations:
- both parallel methods achieve substantial speedup
- performance scales nearly linearly with thread count
- work stealing slightly outperforms BSP at higher thread counts
Both implementations share the same core computation, so their performance curves are similar.
However, their scheduling strategies differ.
BSP runtime depends on the slowest worker.
Even small load imbalance can delay the entire iteration.
Work stealing distributes tasks dynamically.
Idle workers can assist overloaded workers.
As thread count increases, the difference between average and worst-case worker time grows, making work stealing more effective.
Most computation is parallelized.
The remaining sequential sections include:
- accumulator reduction
- centroid update
- accumulator reset
These operations scale with:
O(K * threads)
Since K is small relative to N, their cost is minor.
K-Means has relatively low arithmetic intensity.
Each point assignment requires:
- reading coordinates
- computing distances
- updating accumulators
As thread count increases, memory bandwidth may become the limiting factor.
This may eventually cap achievable speedup.
Run using Go:
go run . <mode> <threads>
Example:
go run . seq 1
go run . bsp 8
go run . steal 8
Modes:
seq
bsp
steal
This project explores how different parallel scheduling strategies influence the performance of a simple machine learning algorithm.
Key ideas demonstrated:
- fused assignment and centroid update
- thread-local accumulators to reduce synchronization
- BSP-style parallelization
- work stealing for dynamic load balancing
- performance analysis of parallel algorithms
The results show that even for embarrassingly parallel problems, scheduling policies can influence scalability as the number of workers increases.
