# Introduction to K-Means Clustering Algorithm in Python

## Introduction:

K-Means algorithm which is unsupervised is usually used in data mining and pattern recognition. Aiming at minimizing cluster performance index, square-error and error criterion are foundations of this algorithm. To seek the ooptimizingoutcome, this algorithm tries to find K divisions to satisfy a certain criterion. Firstly, choose some dots to represent the initial cluster focal points(usually, we choose the first K sample dots of income to represent the initial cluster focal point); secondly, gather the remaining sample dots to their focal points in accordance with the criterion of minimum distance, then we will get the initial classification, and if the classification if unreasonable, we will modify it(calculate each cluster focal points again), iterate repetitively till we get a reasonable classification. (source: https://www.researchgate.net/publication/271616608_A_Clustering_Method_Based_on_K-Means_Algorithm)

A description of the algorithm can be found: https://github.com/andrewxiechina/DataScience/blob/master/K-Means/cs229-notes7a%202.pdf

**Problem**:

Want to understand the customers like who can easily converge [Target Customers] so that the sense can be given to the marketing team and plan the strategy accordingly.

**Dataset:**

Download the dataset from www.kaggle.com (url: https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python)

Description Variables:

*CustomerID: Unique ID assigned to the customer

*Gender: Gender of the customer

*Age: Age of the customer

*Annual Income (k$): Annual Income of the customee

*Spending Score (1–100): Score assigned by the mall based on customer behavior and spending nature

Steps to solve this problem :

→Importing Libraries

→Importing Data

→Data Visualization

→Clustering using K-Means

→Selection of Clusters

→Plotting the Cluster Boundary and Clusters

→Visualization of cluster result

Let’s look at the steps on how the K-means Clustering algorithm uses Python:

Step 1: Import Libraries

First, we must Import some packages in Python, maybe you need a few minutes to import the packages. :) still be patient

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

import seaborn as sns

sns.set(style="darkgrid")

from sklearn.cluster import KMeans

from sklearn.preprocessing import MinMaxScalerfrom yellowbrick.cluster import KElbowVisualizer

from sklearn.metrics import silhouette_samples, silhouette_score

import matplotlib.cm as cm

Step 2: Import the Dataset

I have previously explained about the data used, you can also download the data here: DATASET

# Membaca datadataset = pd.read_csv("Mall_Customers.csv")

dataset.head()

So in the data set, we have the Customer Id, Genre, Age, Annual Income and the Spending Score of the customer purchasing at the mall. Then let see the information about the dataset we have.

`dataset.info()`

Step 5: Visualize the relationship plot

Eliminating unnecessary columns, now only leaves 4 columns namely Annual Income and Spending Score.

`# Menghilangkan kolom yang tidak perlu`

data = dataset.drop(["CustomerID","Gender","Age"], axis = 1)

data.head()

`sns.scatterplot(x="Annual Income (k$)", y="Spending Score (1-100)", data=data, s=30, color="red", alpha = 0.8)`

From the above scatter plot, that the algorithm well caught relation between Annual Income and Spending Score. We have a total of 200 points. We can see there is a little pattern. but it’s still uncertain how many groups there are.

Step 6: Determination of k value

Generating an Array of Features.

# Menentukan variabel yang akan di klusterkan

data_x = dataset.iloc[:, 3:5]

data_x.head()# Mengubah variabel data frame menjadi array

x_array = np.array(data_x)

print(x_array)

Standardize variable sizes. Especially if the value of the variable used is large, it needs to be standardized.

`# Menstandarkan ukuran variabel`

scaler = MinMaxScaler() #fungsinya untuk mengskalakan

x_scaled = scaler.fit_transform(x_array)

x_scaled

To determine the K value, I use 2 methods Elbow-Method using WCSS

and Cluster Quality using Silhouette Coefficient.**Elbow-Method using WCS, **This is based on the principle that while clustering performance as measured by WCSS increases (i.e. WCSS decreases) with an increase in k, the rate of increase is usually decreasing.

#Elbow method to minimize WSS (within-cluster Sum of Square)

Sum_of_squared_distances =[]

K = range(1,15)

for k in K:

km =KMeans(n_clusters =k)

km =km.fit(x_scaled)

Sum_of_squared_distances.append(km.inertia_)###plotting Elbow

plt.plot(K, Sum_of_squared_distances, 'bx-')

plt.xlabel('k')

plt.ylabel('Sum_of_squared_distances')

plt.title('Elbow Method For Optimal k')

plt.show()

Now if we observe the point after which there isn’t a sudden change in WCSS in K=5. So we will choose K=5 as an appropriate number of clusters.

then I use the Silhouette Coefficient method, the silhouette coefficient of a data measures how well data are assigned to its own cluster and how far they are from other clusters. A silhouette close to 1 means the data points are in an appropriate cluster and a silhouette coefficient close to −1 implies out data is in the wrong cluster.

model = KMeans(random_state=123)

# Instantiate the KElbowVisualizer with the number of clusters and the metric

visualizer = KElbowVisualizer(model, k=(2,6), metric='silhouette', timings=False)# Fit the data and visualize

visualizer.fit(x_scaled)

visualizer.poof()

Plotting Silhouette Coefficient against *K* we see the highest coefficient about 0,57 with 5 clusters.

Step 7: Plotting the Cluster Boundary and Clusters

Applying K to the Dataset.

Displays the cluster center.

# Menentukan kluster dari data

kmeans.fit(x_scaled)# Menampilkan pusat cluster

print(kmeans.cluster_centers_)

Showing cluster results.

`# Menampilkan hasil kluster`

print(kmeans.labels_)

Add cluster results columns to the dataset dataframe.

`# Menambahkan kolom "kluster" dalam data frame dataset`

dataset["kluster"] = kmeans.labels_

dataset.head()

Okay, now we will see what data got into clusters. We will visualize the cluster..

`# Memvisualkan hasil klusterplt.scatter(x_scaled[kmeans.labels_==0,0],x_scaled[kmeans.labels_==0,1],s=80,c='magenta',label='Careful')`

plt.scatter(x_scaled[kmeans.labels_==1,0],x_scaled[kmeans.labels_==1,1],s=80,c='yellow',label='Standard')

plt.scatter(x_scaled[kmeans.labels_==2,0],x_scaled[kmeans.labels_==2,1],s=80,c='green',label='Target')

plt.scatter(x_scaled[kmeans.labels_==3,0],x_scaled[kmeans.labels_==3,1],s=80,c='cyan',label='Careless')

plt.scatter(x_scaled[kmeans.labels_==4,0],x_scaled[kmeans.labels_==4,1],s=80,c='burlywood',label='Sensible')

plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1],marker = "o", alpha = 0.9,s=250,c='red',label='Centroids')

plt.title('Cluster of Clients')

plt.xlabel('Annual Income (k$)')

plt.ylabel('Spending Score (1-100)')

plt.legend()

plt.show

From the picture beside, we can see that the customer data has been clustered into 5

**Cluster 1-** High income low spending =Careful

**Cluster 2**- Medium income medium spending =Standard

**Cluster 3**- High Income and high spending =Target

**Cluster 4**- Low Income and high spending =Careless

**Cluster 5**- Low Income and low spending =Sensible

Then you can also use the cluster silhouette plot, starting from 2–5 cluster. So, rewrite code:

print(__doc__)# Generating the sample data from make_blobs

# This particular setting has one distinct cluster and 3 clusters placed close

# together.

range_n_clusters = [2, 3, 4, 5, 6]for n_clusters in range_n_clusters:

# Create a subplot with 1 row and 2 columns

fig, (ax1, ax2) = plt.subplots(1, 2)

fig.set_size_inches(18, 7)# The 1st subplot is the silhouette plot

# The silhouette coefficient can range from -1, 1 but in this example all

# lie within [-0.1, 1]

ax1.set_xlim([-0.1, 1])

# The (n_clusters+1)*10 is for inserting blank space between silhouette

# plots of individual clusters, to demarcate them clearly.

ax1.set_ylim([0, len(x_array) + (n_clusters + 1) * 10])# Initialize the clusterer with n_clusters value and a random generator

# seed of 10 for reproducibility.

clusterer = KMeans(n_clusters=n_clusters, random_state=10)

cluster_labels = clusterer.fit_predict(x_array)# The silhouette_score gives the average value for all the samples.

# This gives a perspective into the density and separation of the formed

# clusters

silhouette_avg = silhouette_score(x_array, cluster_labels)

print("For n_clusters =", n_clusters,

"The average silhouette_score is :", silhouette_avg)# Compute the silhouette scores for each sample

sample_silhouette_values = silhouette_samples(x_array, cluster_labels)y_lower = 10

for i in range(n_clusters):

# Aggregate the silhouette scores for samples belonging to

# cluster i, and sort them

ith_cluster_silhouette_values = \

sample_silhouette_values[cluster_labels == i]ith_cluster_silhouette_values.sort()size_cluster_i = ith_cluster_silhouette_values.shape[0]

y_upper = y_lower + size_cluster_icolor = cm.nipy_spectral(float(i) / n_clusters)

ax1.fill_betweenx(np.arange(y_lower, y_upper),

0, ith_cluster_silhouette_values,

facecolor=color, edgecolor=color, alpha=0.7)# Label the silhouette plots with their cluster numbers at the middle

ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))# Compute the new y_lower for next plot

y_lower = y_upper + 10 # 10 for the 0 samplesax1.set_title("The silhouette plot for the various clusters.")

ax1.set_xlabel("The silhouette coefficient values")

ax1.set_ylabel("Cluster label")# The vertical line for average silhouette score of all the values

ax1.axvline(x=silhouette_avg, color="red", linestyle="--")ax1.set_yticks([]) # Clear the yaxis labels / ticks

ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])# 2nd Plot showing the actual clusters formed

colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)

ax2.scatter(x_array[:, 0], x_array[:, 1], marker='.', s=30, lw=0, alpha=0.7,

c=colors, edgecolor='k')# Labeling the clusters

centers = clusterer.cluster_centers_

# Draw white circles at cluster centers

ax2.scatter(centers[:, 0], centers[:, 1], marker='o',

c="white", alpha=1, s=200, edgecolor='k')for i, c in enumerate(centers):

ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,

s=50, edgecolor='k')ax2.set_title("The visualization of the clustered data.")

ax2.set_xlabel("Feature space for the 1st feature")

ax2.set_ylabel("Feature space for the 2nd feature")plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "

"with n_clusters = %d" % n_clusters),

fontsize=14, fontweight='bold')plt.show()

If you are doing clustering in more than two dimensions you are not running the last code section to visualize the clusters because it’s only for two-dimensional clustering. you can use this code by using the dimension reduction technique.

THX U :)

Reference:

https://www.researchgate.net/publication/271616608_A_Clustering_Method_Based_on_K-Means_Algorithm

https://github.com/andrewxiechina/DataScience/blob/master/K-Means/cs229-notes7a%202.pdf