import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
X_gauss = np.loadtxt("2D_gauss_clusters.txt", delimiter=",", skiprows=1)
X_gauss
Let's use now matplotlib
to explore our 2D data. Since we are likely going to need a scatter plot multiple times, we can define a simple function to handle it.
def plot_2d_scatter(X):
"""Display a 2D scatter plot
:param X: input data points, array
:return: fig, ax, objects
"""
fig, ax = plt.subplots(figsize=(6, 6), dpi=90)
ax.scatter(X[:,0], X[:,1])
return fig, ax # use them for further modifications
_, _ = plot_2d_scatter(X_gauss) # the two underscores let you discard the returned values
Gaussian, globular clusters are easily visible in the euclidean space. At first sight, you can count up to 15 clusters. Most of them have a regular shape, while some others are more elongated. Given this distribution, we known from theory that the K-means algorithm can achieve positive clustering performance (provided that the right number of cluster is chosen).
Let's implement now our version of the K-means algorithm. We will use a custom Python class. The basic blocks will be:
fit_predict
method exposed to the users of the classdump_to_file
method to save cluster labels to CSV file (see Exercise 1.4)class KMeans:
def __init__(self, n_clusters, max_iter=100):
self.n_clusters = n_clusters
self.max_iter = max_iter
self.centroids = None
self.labels = None
def __plot_clusters(self, X, labels, c=None):
fig, ax = plt.subplots(figsize=(6,6), dpi=90)
ax.scatter(X[:,0], X[:,1], c=labels, cmap='Set3')
if c is not None:
ax.scatter(c[:,0], c[:,1], marker="*", color="red")
return fig, ax
def __init_random_centroids(self, X):
c_idx = np.random.randint(0, X.shape[0], self.n_clusters)
c = X[c_idx] # fancy-indexing
return c
def fit_predict(self, X, plot_clusters=False, plot_steps=5):
"""Run the K-means clustering on X.
:param X: input data points, array, shape = (N,C).
:return: labels : array, shape = N.
"""
# For each point, store the positional index of the closest centroid
nearest = np.empty(X.shape[0])
c_out = self.__init_random_centroids(X)
for j in range(self.max_iter):
nearest_prev = nearest
# For each point in X, compute the squared distance from each centroid
# dist will have shape: (n_clusters, X.shape[0])
dist = np.array([np.sum((X-c_out[i])**2, axis=1) for i in range(self.n_clusters)])
# Find the nearest centroid for each point using argmin
# nearest will have shape: (X.shape[0],)
nearest = dist.argmin(axis=0)
if plot_clusters and (j % plot_steps == 0):
fig, ax = self.__plot_clusters(X, nearest, c_out)
ax.set_title(f"Iteration {j}")
# Early stopping if centroids have not changed
if np.array_equal(nearest, nearest_prev):
print(f"Early stopping at iteration {j}!")
break
# For each cluster, compute the average coordinates considering only points currently
# assigned to it. Then, use them as the new centroid coordinates.
for i in range(self.n_clusters):
c_temp = X[nearest == i].mean(axis=0) # indexing with a mask
if not np.isnan(c_temp).any(): # handle the case of an empty cluster
c_out[i] = c_temp
self.centroids = c_out
self.labels = nearest
return self.labels
def dump_to_file(self, filename):
"""Dump the evaluated labels to a CSV file."""
with open(filename, 'w') as fp:
fp.write('Id,ClusterId\n')
for i, label in enumerate(self.labels):
fp.write(f'{i},{label:d}\n')
There are several things worth noting in the code above:
nearest
array is a good example of that. For each point, it keeps track of the index of the centroid of the cluster that point is assigned to. In other words, if the ith item of nearest
has value j, we know that the ith point of the dataset is assigned to the jth cluster. Using this convention, the choice of the cluster Ids is straightforward: the ith cluster will have ID = iargmin
is used to soft-sort them and find the closest centroid Id. Finally, array_equal
lets you efficiently compare the current and previous centroid positions (if we wanted to compare them with a tolerance, we could have used the numpy.allclose
method)We strongly suggest you to stop and try to fully understand the first two points. For your convenience, we commented the shape of the numpy arrays involved at their respective lines.
Let's try now to use our class and save our labels to file.
np.random.seed(0)
kmeans_model = KMeans(10)
l_gauss = kmeans_model.fit_predict(X_gauss)
l_gauss
The algorithm has stopped after 11 iterations since every centroid has not moved between iteration 10 and 11. In other words, at the 11th iteration, no point has changed its closest centroid (i.e. its membership cluster).
Note also line 1 of the previous cell: it initializes the NumPy's random generator with a specific seed. With a fixed random generator, the same points will be chosen as initial centroids among different runs. This is commonplace whenever you want your results to be reproducible. Many scikit-learn classes offer you this functionality via the random_state
parameter.
kmeans_model.dump_to_file('gauss_labels.csv')
Let's analyze now the Chameleon dataset just as before.
X_cham = np.loadtxt("chameleon_clusters.txt", delimiter=",", skiprows=1)
X_cham
_, _ = plot_2d_scatter(X_cham)
As you can notice, there are six interleaved areas with an high density of points and irregular shape. Also, the dataset looks much more noisier (note the sine wave superimposed). Given this distribution, we known from theory that the K-Means algorithm might suffer even choosing the right number of clusters.
It is time to enhance our plotting toolkit. We hence define a new function that highlights the final centroids. Specifically, we can use a different marker style and color for them.
def plot_centroids(X, c, title=None):
fig, ax = plot_2d_scatter(X)
ax.scatter(c[:,0], c[:,1], marker="*", color="red")
if title:
ax.set_title(title)
return fig, ax
Let's now try to inspect the final centroids obtained in both our datasets.
np.random.seed(0)
k_gauss = KMeans(10)
k_cham = KMeans(10)
k_gauss.fit_predict(X_gauss)
k_cham.fit_predict(X_cham)
plot_centroids(X_gauss, k_gauss.centroids, title="Gaussian, K=10")
plot_centroids(X_cham, k_cham.centroids, title="Chameleon, K=10")
We can visually see that the cluster identification might be improved choosing a different K. Our initial choice is too low for Gaussian data while too high for Chameleon.
np.random.seed(0)
k_gauss = KMeans(15)
k_cham = KMeans(6)
k_gauss.fit_predict(X_gauss)
k_cham.fit_predict(X_cham)
plot_centroids(X_gauss, k_gauss.centroids, title="Gaussian, K=15")
plot_centroids(X_cham, k_cham.centroids, title="Chameleon, K=6")
Here some interesting happened. On one hand, the Gaussian dataset was clustered in a better way (again, visually speaking). Yet, there are two pairs of centroids in which the two points are close enough to stay in the same globular cluster. This is likely due to the random initialization of centroids (in literature, there exist other approaches rather than randomly picking them). On the other hand, centroids identified in the Chameleon dataset did not match the real data distribution. This is exactly what we expected: the initial distribution has shape and sizes the K-means algorithm can hardly fit.
For the sake of completeness, you can run multiple executions with different seed generators. Results are likely to remain the same for Chameleon clusters. Instead, a different random initialization, for the first dataset, could lead to a better result, as shown below.
np.random.seed(6)
k_gauss = KMeans(15)
k_cham = KMeans(6)
k_gauss.fit_predict(X_gauss)
k_cham.fit_predict(X_cham)
plot_centroids(X_gauss, k_gauss.centroids, title="Gaussian, K=15")
plot_centroids(X_cham, k_cham.centroids, title="Chameleon, K=6")
We might be interested to the intermediate position of centroids, other than the final ones. We included this plotting functionality in the KMeans
class defined above via the plot_clusters
and plot_steps
parameters of fit_predict
.
The plot is created in __plot_clusters
. Take a close look at the c
and cmap
parameters. c
accepts a list used to specify which points must have the same color. Points at indices with equal values in that list will have the same color. cmap
instead lets you choose a color map among the ones predefined.
Let's test it with Gaussian clusters.
np.random.seed(6)
k_gauss = KMeans(15)
_ = k_gauss.fit_predict(X_gauss, True, 3)
So far, we have talked about "good" and "bad" clustering results in relation with the capability of K-means to adapt to different data distribution. Our discussions were mainly founded on the final centroids positions. This visual approach could not be feasible with many more points and dimensions.
Whenever numerical data is processed, several distance-based metrics can be used to assess the quality of our clustering. Let's now focus on the Silhouette measure. We can define a couple of functions to compute its value (namely compute_a
and compute_b
). Note that we will use the euclidean distance.
def compute_a(x, x_label, X, labels):
c_idx = np.where(labels == x_label) # we are including the point itself...
a = np.mean(np.linalg.norm(x - X[c_idx], axis=1)) # ...because this distance will be 0 when computed from itself.
return a
def compute_b(x, x_label, X, labels):
bs = []
other_labels = np.unique(labels[labels != x_label]) # array with all cluster Ids different from the current one
for ol in other_labels:
c_idx = np.where(labels == ol)
b = np.mean(np.linalg.norm(x - X[c_idx], axis=1))
bs.append((ol, b))
neigh, b = min(bs, key=lambda x: x[1])
return neigh, b
def silhouette_samples(X, labels):
"""Evaluate the silhouette for each point and return them as a list.
:param X: input data points, array, shape = (N,C).
:param labels: the list of cluster labels, shape = N. :return: silhouette : array, shape = N
"""
silhouette = np.zeros(X.shape[0])
for idx in range(X.shape[0]):
x = X[idx]
x_label = labels[idx]
# compute a
a = compute_a(x, x_label, X, labels)
# compute b
_, b = compute_b(x, x_label, X, labels)
silhouette[idx] = (b - a) / np.max([a, b])
return silhouette
def silhouette_score(X, labels):
"""Evaluate the silhouette for each point and return the mean.
:param X: input data points, array, shape = (N,C).
:param labels: the list of cluster labels, shape = N. :return: silhouette : float
"""
return np.mean(silhouette_samples(X, labels))
Let's see how our clustering performs.
np.random.seed(6)
k_gauss = KMeans(15)
k_cham = KMeans(6)
l_gauss = k_gauss.fit_predict(X_gauss)
l_cham = k_cham.fit_predict(X_cham)
sil_gauss = silhouette_score(X_gauss, l_gauss)
sil_cham = silhouette_score(X_cham, l_cham)
print('Gaussian clusters, average silhouette:', sil_gauss)
print('Chameleon clusters, average silhouette:', sil_cham)
As expected from our initial inspection, we achieved a much higher silhouette score on the first dataset.
Let's now analyze how the Silhouette values are distributed between our points. To do so, we can plot them in ascending order on the x-axis.
def plot_silhoeutte(silhouette, title=None):
fig, ax = plt.subplots(figsize=(6,6), dpi=90)
ax.plot(np.sort(silhouette))
ax.set_ylabel("Silhouette")
ax.set_ylim(-1,1)
ax.set_xlabel("Records")
if title:
ax.set_title(title)
return fig, ax
_, _ = plot_silhoeutte(silhouette_samples(X_gauss, l_gauss), "Silhouette for gaussian points")
_, _ = plot_silhoeutte(silhouette_samples(X_cham, l_cham), "Silhouette for chameleon points")
Again, as expected, we got more points shifted towards the maximum silhouette value (i.e. 1) clustering the first dataset rather than the second one.
Until now, we analized results achieved with the right number of cluster as K. Even with the best educational purposes, there is a trick here. We did that because we knew approximately the number of clusters thanks to a prior visual inspection. In many real-world tasks, your data can be spanned on more than two or three dimensions and you cannot visualize them. Hence, it is commonplace to search for the K value that leads to the best possible clustering subdivision. For what concerns the silhouette measure, the higher is the average silhouette the better are the intra-cluster cohesion and the inter-cluster separation.
Let's analyze how our average silhouette varies with different K values for the Gaussian clusters.
np.random.seed(6)
scores = []
K_values = list(range(2, 21))
for k in K_values:
print(f"Computing K:{k}")
kmeans_model = KMeans(k)
labels = kmeans_model.fit_predict(X_gauss)
scores.append(silhouette_score(X_gauss, labels))
fig, ax = plt.subplots(figsize=(6,6), dpi=90)
ax.plot(scores, marker="o")
ax.set_ylabel("Mean Silhouette")
ax.set_ylim(-1,1)
ax.set_xlabel("K")
ax.set_xticks(list(range(len(K_values))))
_ = ax.set_xticklabels(K_values)
The figure shows a clear trend. The average silhouette value increases almost constantly when K rails from 2 to 14. Then, a plateau is reached with no further improvements after K=15. This is what we expected from our previous analysis.
Note that here the highest value is achieved with K=14. This is possible, although the right number of cluster is 15, since we are working with euclidean distances and clusters are not completely separated from each other.