K-Means Clustering from Scratch in 5 lines of code (Python)

In this short post, I will walk you through implementing the K-means clustering algorithm from scratch in the most efficient way.

The post assumes that you already know the theory of the algorithm.

Let’s start by creating some short dataset to use:

import numpy as np
import random
import matplotlib.pyplot as plt
import pandas as pd
group1 = [(random.randint(0, 100), random.randint(0, 100)) for i in range(50)]
group2 = [(random.randint(150, 250), random.randint(250, 350)) for i in range(50)]

I created two distinct groups so it’s clear for us to see the clusters. Let’s plot them and see:

x_data = np.concatenate([np.array(group1)[:, 0], np.array(group2)[:, 0]])
y_data = np.concatenate([np.array(group1)[:,1], np.array(group2)[:, 1]])
all_data = np.concatenate((x_data.reshape(-1, 1), y_data.reshape(-1, 1)), axis=1)
plt.scatter(x_data, y_data)

Cool. We have two distinct groups that should be easily clustered.

Now, let’s start with the first step in the K-Means algorithm: initialize the clusters randomly. From the plot, we know we need two clusters. So let’s pick two random coordinates for them:

k = 2
centroids= [(random.randint(min(x_data), max(x_data)), random.randint(min(y_data), max(y_data))) for i in range(k)]

Let’s take a look at them:

plt.scatter(x_data, y_data)
plt.scatter(np.array(centroids)[:, 0], np.array(centroids)[:, 1], c= [i for i in range(len(centroids))], s=200)

Looks good. We have the two groups and the two clusters that were initialized randomly.

Now, the second step in the algorithm is to calculate the distance between each point in the dataset and each of the clusters. Let’s do that using NumPy:

distances = np.concatenate([np.linalg.norm(all_data - centroid, axis=1).reshape(-1, 1) for centroid in centroids], 

Notice that in the code above I wanted to generalize the code to any number of centroids. That’s why I applied the function np.linalg.norm over the entire dataset and all the available centroids and then I just concatenated all these side by side. That will provide us with K values for each point in the dataset where the ith value represents the distance between that point and the ith cluster. Now, all we need to do is just to take the minimum of each of these values and assign the point to the label that corresponds to the index of the value (which corresponds to the centroid). Let’s do that:

labels = np.argmin(distances, axis=1)

Now, we already assigned the labels of each point in the dataset to the nearest cluster to it. Let’s take a look at them:

Looking good. The next step is to change the coordinate of the cluster to be the mean of the points that have their labels. This step can be implemented in various different ways. However, I’m determined not to use for loops and be as efficient as possible. So I will do it using Pandas. I will create a panda data frame using the data and the labels. Then, I will just group all values by their labels and get their mean. That should do the trick in 1 line of code and it’s super-efficient:

pd.DataFrame(np.concatenate([all_data, labels.reshape(-1, 1)], axis=1)).groupby(2).mean().values

Note that the above line of code provides the mean of all the data points that have the same labels. Since we have K labels, then it should give K coordinates. These are the new coordinates for the K centroids:

centroids =pd.DataFrame(np.concatenate([all_data, labels.reshape(-1, 1)], axis=1)).groupby(2).mean().values

That was the final step. All we need to do is just repeat that step a few times until convergence happens:

for i in range(100):distances = np.concatenate([np.linalg.norm(all_data - centroid, axis=1).reshape(-1, 1) for centroid in centroids], 
labels = np.argmin(distances, axis=1)
centroids =pd.DataFrame(np.concatenate([all_data, labels.reshape(-1, 1)], axis=1)).groupby(2).mean().values

and here are the results:

Now, as promised let’s put all of that code together to get the algorithm in 5 lines of code:

Let’s now run the algorithm on three groups instead of two:

centroids, labels = scratch_k_means(3, all_data)

Let’s plot the results:

plt.scatter(all_data[:, 0], all_data[:, 1], c=labels)

Looking great. Let’s try to run K-Means from sklearn on the same dataset and compare the two results:

from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, random_state=0).fit(all_data)

Let’s print the coordinates of the centroids of both:

The coordinates of the centroids from the two algorithms are identical as expected.

Hope this was helpful. Find the 5-lines Kmeans code here

Data Science and Modeling and Simulation Enthusiast| Computational Sciences and Physics Student | gaber@minerva.kgi.edu

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store