Introduction to K-Means Clustering Algorithm in Python

Halima Tusyakdiah
8 min readJan 4, 2020

Introduction:

K-Means algorithm which is unsupervised is usually used in data mining and pattern recognition. Aiming at minimizing cluster performance index, square-error and error criterion are foundations of this algorithm. To seek the ooptimizingoutcome, this algorithm tries to find K divisions to satisfy a certain criterion. Firstly, choose some dots to represent the initial cluster focal points(usually, we choose the first K sample dots of income to represent the initial cluster focal point); secondly, gather the remaining sample dots to their focal points in accordance with the criterion of minimum distance, then we will get the initial classification, and if the classification if unreasonable, we will modify it(calculate each cluster focal points again), iterate repetitively till we get a reasonable classification. (source: https://www.researchgate.net/publication/271616608_A_Clustering_Method_Based_on_K-Means_Algorithm)

A description of the algorithm can be found: https://github.com/andrewxiechina/DataScience/blob/master/K-Means/cs229-notes7a%202.pdf

Problem:

Want to understand the customers like who can easily converge [Target Customers] so that the sense can be given to the marketing team and plan the strategy accordingly.

Dataset:

Download the dataset from www.kaggle.com (url: https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python)

Description Variables:
*CustomerID: Unique ID assigned to the customer
*Gender: Gender of the customer
*Age: Age of the customer
*Annual Income (k$): Annual Income of the customee
*Spending Score (1–100): Score assigned by the mall based on customer behavior and spending nature

Steps to solve this problem :
→Importing Libraries
→Importing Data
→Data Visualization
→Clustering using K-Means
→Selection of Clusters
→Plotting the Cluster Boundary and Clusters
→Visualization of cluster result

Let’s look at the steps on how the K-means Clustering algorithm uses Python:

Step 1: Import Libraries

First, we must Import some packages in Python, maybe you need a few minutes to import the packages. :) still be patient

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler
from yellowbrick.cluster import KElbowVisualizer
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm

Step 2: Import the Dataset

I have previously explained about the data used, you can also download the data here: DATASET

# Membaca datadataset = pd.read_csv("Mall_Customers.csv")
dataset.head()

So in the data set, we have the Customer Id, Genre, Age, Annual Income and the Spending Score of the customer purchasing at the mall. Then let see the information about the dataset we have.

dataset.info()

Step 5: Visualize the relationship plot

Eliminating unnecessary columns, now only leaves 4 columns namely Annual Income and Spending Score.

# Menghilangkan kolom yang tidak perlu
data = dataset.drop(["CustomerID","Gender","Age"], axis = 1)
data.head()
sns.scatterplot(x="Annual Income (k$)", y="Spending Score (1-100)", data=data, s=30, color="red", alpha = 0.8)

From the above scatter plot, that the algorithm well caught relation between Annual Income and Spending Score. We have a total of 200 points. We can see there is a little pattern. but it’s still uncertain how many groups there are.

Step 6: Determination of k value

Generating an Array of Features.

# Menentukan variabel yang akan di klusterkan
data_x = dataset.iloc[:, 3:5]
data_x.head()
# Mengubah variabel data frame menjadi array
x_array = np.array(data_x)
print(x_array)

Standardize variable sizes. Especially if the value of the variable used is large, it needs to be standardized.

# Menstandarkan ukuran variabel
scaler = MinMaxScaler() #fungsinya untuk mengskalakan
x_scaled = scaler.fit_transform(x_array)
x_scaled

To determine the K value, I use 2 methods Elbow-Method using WCSS
and Cluster Quality using Silhouette Coefficient.
Elbow-Method using WCS, This is based on the principle that while clustering performance as measured by WCSS increases (i.e. WCSS decreases) with an increase in k, the rate of increase is usually decreasing.

#Elbow method to minimize WSS (within-cluster Sum of Square)
Sum_of_squared_distances =[]
K = range(1,15)
for k in K:
km =KMeans(n_clusters =k)
km =km.fit(x_scaled)
Sum_of_squared_distances.append(km.inertia_)
###plotting Elbow
plt.plot(K, Sum_of_squared_distances, 'bx-')
plt.xlabel('k')
plt.ylabel('Sum_of_squared_distances')
plt.title('Elbow Method For Optimal k')
plt.show()

Now if we observe the point after which there isn’t a sudden change in WCSS in K=5. So we will choose K=5 as an appropriate number of clusters.

then I use the Silhouette Coefficient method, the silhouette coefficient of a data measures how well data are assigned to its own cluster and how far they are from other clusters. A silhouette close to 1 means the data points are in an appropriate cluster and a silhouette coefficient close to −1 implies out data is in the wrong cluster.

model = KMeans(random_state=123) 
# Instantiate the KElbowVisualizer with the number of clusters and the metric
visualizer = KElbowVisualizer(model, k=(2,6), metric='silhouette', timings=False)
# Fit the data and visualize
visualizer.fit(x_scaled)
visualizer.poof()

Plotting Silhouette Coefficient against K we see the highest coefficient about 0,57 with 5 clusters.

Step 7: Plotting the Cluster Boundary and Clusters

Applying K to the Dataset.

Displays the cluster center.

# Menentukan kluster dari data
kmeans.fit(x_scaled)
# Menampilkan pusat cluster
print(kmeans.cluster_centers_)

Showing cluster results.

# Menampilkan hasil kluster
print(kmeans.labels_)

Add cluster results columns to the dataset dataframe.

# Menambahkan kolom "kluster" dalam data frame dataset
dataset["kluster"] = kmeans.labels_
dataset.head()

Okay, now we will see what data got into clusters. We will visualize the cluster..

# Memvisualkan hasil klusterplt.scatter(x_scaled[kmeans.labels_==0,0],x_scaled[kmeans.labels_==0,1],s=80,c='magenta',label='Careful')
plt.scatter(x_scaled[kmeans.labels_==1,0],x_scaled[kmeans.labels_==1,1],s=80,c='yellow',label='Standard')
plt.scatter(x_scaled[kmeans.labels_==2,0],x_scaled[kmeans.labels_==2,1],s=80,c='green',label='Target')
plt.scatter(x_scaled[kmeans.labels_==3,0],x_scaled[kmeans.labels_==3,1],s=80,c='cyan',label='Careless')
plt.scatter(x_scaled[kmeans.labels_==4,0],x_scaled[kmeans.labels_==4,1],s=80,c='burlywood',label='Sensible')
plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1],marker = "o", alpha = 0.9,s=250,c='red',label='Centroids')
plt.title('Cluster of Clients')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.show

From the picture beside, we can see that the customer data has been clustered into 5

Cluster 1- High income low spending =Careful

Cluster 2- Medium income medium spending =Standard

Cluster 3- High Income and high spending =Target

Cluster 4- Low Income and high spending =Careless

Cluster 5- Low Income and low spending =Sensible

Then you can also use the cluster silhouette plot, starting from 2–5 cluster. So, rewrite code:

print(__doc__)# Generating the sample data from make_blobs
# This particular setting has one distinct cluster and 3 clusters placed close
# together.

range_n_clusters = [2, 3, 4, 5, 6]
for n_clusters in range_n_clusters:
# Create a subplot with 1 row and 2 columns
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.set_size_inches(18, 7)
# The 1st subplot is the silhouette plot
# The silhouette coefficient can range from -1, 1 but in this example all
# lie within [-0.1, 1]
ax1.set_xlim([-0.1, 1])
# The (n_clusters+1)*10 is for inserting blank space between silhouette
# plots of individual clusters, to demarcate them clearly.
ax1.set_ylim([0, len(x_array) + (n_clusters + 1) * 10])
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
clusterer = KMeans(n_clusters=n_clusters, random_state=10)
cluster_labels = clusterer.fit_predict(x_array)
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(x_array, cluster_labels)
print("For n_clusters =", n_clusters,
"The average silhouette_score is :", silhouette_avg)
# Compute the silhouette scores for each sample
sample_silhouette_values = silhouette_samples(x_array, cluster_labels)
y_lower = 10
for i in range(n_clusters):
# Aggregate the silhouette scores for samples belonging to
# cluster i, and sort them
ith_cluster_silhouette_values = \
sample_silhouette_values[cluster_labels == i]
ith_cluster_silhouette_values.sort()size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = cm.nipy_spectral(float(i) / n_clusters)
ax1.fill_betweenx(np.arange(y_lower, y_upper),
0, ith_cluster_silhouette_values,
facecolor=color, edgecolor=color, alpha=0.7)
# Label the silhouette plots with their cluster numbers at the middle
ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
# Compute the new y_lower for next plot
y_lower = y_upper + 10 # 10 for the 0 samples
ax1.set_title("The silhouette plot for the various clusters.")
ax1.set_xlabel("The silhouette coefficient values")
ax1.set_ylabel("Cluster label")
# The vertical line for average silhouette score of all the values
ax1.axvline(x=silhouette_avg, color="red", linestyle="--")
ax1.set_yticks([]) # Clear the yaxis labels / ticks
ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])
# 2nd Plot showing the actual clusters formed
colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
ax2.scatter(x_array[:, 0], x_array[:, 1], marker='.', s=30, lw=0, alpha=0.7,
c=colors, edgecolor='k')
# Labeling the clusters
centers = clusterer.cluster_centers_
# Draw white circles at cluster centers
ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
c="white", alpha=1, s=200, edgecolor='k')
for i, c in enumerate(centers):
ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
s=50, edgecolor='k')
ax2.set_title("The visualization of the clustered data.")
ax2.set_xlabel("Feature space for the 1st feature")
ax2.set_ylabel("Feature space for the 2nd feature")
plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
"with n_clusters = %d" % n_clusters),
fontsize=14, fontweight='bold')
plt.show()

If you are doing clustering in more than two dimensions you are not running the last code section to visualize the clusters because it’s only for two-dimensional clustering. you can use this code by using the dimension reduction technique.

THX U :)

Reference:

https://www.researchgate.net/publication/271616608_A_Clustering_Method_Based_on_K-Means_Algorithm

https://github.com/andrewxiechina/DataScience/blob/master/K-Means/cs229-notes7a%202.pdf

--

--

Halima Tusyakdiah

Statistics student at Islamic University of Indonesian