# Introduction to K-Means Clustering Algorithm in Python

## Introduction:

A description of the algorithm can be found: https://github.com/andrewxiechina/DataScience/blob/master/K-Means/cs229-notes7a%202.pdf

Problem:

Want to understand the customers like who can easily converge [Target Customers] so that the sense can be given to the marketing team and plan the strategy accordingly.

Dataset:

Description Variables:
*CustomerID: Unique ID assigned to the customer
*Gender: Gender of the customer
*Age: Age of the customer
*Annual Income (k\$): Annual Income of the customee
*Spending Score (1–100): Score assigned by the mall based on customer behavior and spending nature

Steps to solve this problem :
→Importing Libraries
→Importing Data
→Data Visualization
→Clustering using K-Means
→Selection of Clusters
→Plotting the Cluster Boundary and Clusters
→Visualization of cluster result

Let’s look at the steps on how the K-means Clustering algorithm uses Python:

Step 1: Import Libraries

First, we must Import some packages in Python, maybe you need a few minutes to import the packages. :) still be patient

`import pandas as pdimport numpy as npimport matplotlib.pyplot as plt import seaborn as snssns.set(style="darkgrid")from sklearn.cluster import KMeans from sklearn.preprocessing import MinMaxScalerfrom yellowbrick.cluster import KElbowVisualizerfrom sklearn.metrics import silhouette_samples, silhouette_scoreimport matplotlib.cm as cm`

Step 2: Import the Dataset

`# Membaca datadataset = pd.read_csv("Mall_Customers.csv")dataset.head()`

So in the data set, we have the Customer Id, Genre, Age, Annual Income and the Spending Score of the customer purchasing at the mall. Then let see the information about the dataset we have.

`dataset.info()`

Step 5: Visualize the relationship plot

Eliminating unnecessary columns, now only leaves 4 columns namely Annual Income and Spending Score.

`# Menghilangkan kolom yang tidak perludata = dataset.drop(["CustomerID","Gender","Age"], axis = 1)data.head()`
`sns.scatterplot(x="Annual Income (k\$)", y="Spending Score (1-100)", data=data, s=30, color="red", alpha = 0.8)`

From the above scatter plot, that the algorithm well caught relation between Annual Income and Spending Score. We have a total of 200 points. We can see there is a little pattern. but it’s still uncertain how many groups there are.

Step 6: Determination of k value

Generating an Array of Features.

`# Menentukan variabel yang akan di klusterkandata_x = dataset.iloc[:, 3:5]data_x.head()# Mengubah variabel data frame menjadi arrayx_array =  np.array(data_x)print(x_array)`

Standardize variable sizes. Especially if the value of the variable used is large, it needs to be standardized.

`# Menstandarkan ukuran variabelscaler = MinMaxScaler() #fungsinya untuk mengskalakanx_scaled = scaler.fit_transform(x_array)x_scaled`

To determine the K value, I use 2 methods Elbow-Method using WCSS
and Cluster Quality using Silhouette Coefficient.
Elbow-Method using WCS, This is based on the principle that while clustering performance as measured by WCSS increases (i.e. WCSS decreases) with an increase in k, the rate of increase is usually decreasing.

`#Elbow method to minimize WSS (within-cluster Sum of Square)Sum_of_squared_distances =[]K = range(1,15)for k in K:    km =KMeans(n_clusters =k)    km =km.fit(x_scaled)    Sum_of_squared_distances.append(km.inertia_)###plotting Elbowplt.plot(K, Sum_of_squared_distances, 'bx-')plt.xlabel('k')plt.ylabel('Sum_of_squared_distances')plt.title('Elbow Method For Optimal k')plt.show()`

Now if we observe the point after which there isn’t a sudden change in WCSS in K=5. So we will choose K=5 as an appropriate number of clusters.

then I use the Silhouette Coefficient method, the silhouette coefficient of a data measures how well data are assigned to its own cluster and how far they are from other clusters. A silhouette close to 1 means the data points are in an appropriate cluster and a silhouette coefficient close to −1 implies out data is in the wrong cluster.

`model = KMeans(random_state=123) # Instantiate the KElbowVisualizer with the number of clusters and the metric visualizer = KElbowVisualizer(model, k=(2,6), metric='silhouette', timings=False)# Fit the data and visualizevisualizer.fit(x_scaled)    visualizer.poof()`

Plotting Silhouette Coefficient against K we see the highest coefficient about 0,57 with 5 clusters.

Step 7: Plotting the Cluster Boundary and Clusters

Applying K to the Dataset.

Displays the cluster center.

`# Menentukan kluster dari datakmeans.fit(x_scaled)# Menampilkan pusat clusterprint(kmeans.cluster_centers_)`

Showing cluster results.

`# Menampilkan hasil klusterprint(kmeans.labels_)`

Add cluster results columns to the dataset dataframe.

`# Menambahkan kolom "kluster" dalam data frame datasetdataset["kluster"] = kmeans.labels_dataset.head()`

Okay, now we will see what data got into clusters. We will visualize the cluster..

`# Memvisualkan hasil klusterplt.scatter(x_scaled[kmeans.labels_==0,0],x_scaled[kmeans.labels_==0,1],s=80,c='magenta',label='Careful')plt.scatter(x_scaled[kmeans.labels_==1,0],x_scaled[kmeans.labels_==1,1],s=80,c='yellow',label='Standard')plt.scatter(x_scaled[kmeans.labels_==2,0],x_scaled[kmeans.labels_==2,1],s=80,c='green',label='Target')plt.scatter(x_scaled[kmeans.labels_==3,0],x_scaled[kmeans.labels_==3,1],s=80,c='cyan',label='Careless')plt.scatter(x_scaled[kmeans.labels_==4,0],x_scaled[kmeans.labels_==4,1],s=80,c='burlywood',label='Sensible')plt.scatter(kmeans.cluster_centers_[:,0],kmeans.cluster_centers_[:,1],marker = "o", alpha = 0.9,s=250,c='red',label='Centroids')plt.title('Cluster of Clients')plt.xlabel('Annual Income (k\$)')plt.ylabel('Spending Score (1-100)')plt.legend()plt.show`

From the picture beside, we can see that the customer data has been clustered into 5

Cluster 1- High income low spending =Careful

Cluster 2- Medium income medium spending =Standard

Cluster 3- High Income and high spending =Target

Cluster 4- Low Income and high spending =Careless

Cluster 5- Low Income and low spending =Sensible

Then you can also use the cluster silhouette plot, starting from 2–5 cluster. So, rewrite code:

`print(__doc__)# Generating the sample data from make_blobs# This particular setting has one distinct cluster and 3 clusters placed close# together.    range_n_clusters = [2, 3, 4, 5, 6]for n_clusters in range_n_clusters:    # Create a subplot with 1 row and 2 columns    fig, (ax1, ax2) = plt.subplots(1, 2)    fig.set_size_inches(18, 7)# The 1st subplot is the silhouette plot    # The silhouette coefficient can range from -1, 1 but in this example all    # lie within [-0.1, 1]    ax1.set_xlim([-0.1, 1])    # The (n_clusters+1)*10 is for inserting blank space between silhouette    # plots of individual clusters, to demarcate them clearly.    ax1.set_ylim([0, len(x_array) + (n_clusters + 1) * 10])# Initialize the clusterer with n_clusters value and a random generator    # seed of 10 for reproducibility.    clusterer = KMeans(n_clusters=n_clusters, random_state=10)    cluster_labels = clusterer.fit_predict(x_array)# The silhouette_score gives the average value for all the samples.    # This gives a perspective into the density and separation of the formed    # clusters    silhouette_avg = silhouette_score(x_array, cluster_labels)    print("For n_clusters =", n_clusters,          "The average silhouette_score is :", silhouette_avg)# Compute the silhouette scores for each sample    sample_silhouette_values = silhouette_samples(x_array, cluster_labels)y_lower = 10    for i in range(n_clusters):        # Aggregate the silhouette scores for samples belonging to        # cluster i, and sort them        ith_cluster_silhouette_values = \            sample_silhouette_values[cluster_labels == i]ith_cluster_silhouette_values.sort()size_cluster_i = ith_cluster_silhouette_values.shape        y_upper = y_lower + size_cluster_icolor = cm.nipy_spectral(float(i) / n_clusters)        ax1.fill_betweenx(np.arange(y_lower, y_upper),                          0, ith_cluster_silhouette_values,                          facecolor=color, edgecolor=color, alpha=0.7)# Label the silhouette plots with their cluster numbers at the middle        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))# Compute the new y_lower for next plot        y_lower = y_upper + 10  # 10 for the 0 samplesax1.set_title("The silhouette plot for the various clusters.")    ax1.set_xlabel("The silhouette coefficient values")    ax1.set_ylabel("Cluster label")# The vertical line for average silhouette score of all the values    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")ax1.set_yticks([])  # Clear the yaxis labels / ticks    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])# 2nd Plot showing the actual clusters formed    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)    ax2.scatter(x_array[:, 0], x_array[:, 1], marker='.', s=30, lw=0, alpha=0.7,                c=colors, edgecolor='k')# Labeling the clusters    centers = clusterer.cluster_centers_    # Draw white circles at cluster centers    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',                c="white", alpha=1, s=200, edgecolor='k')for i, c in enumerate(centers):        ax2.scatter(c, c, marker='\$%d\$' % i, alpha=1,                    s=50, edgecolor='k')ax2.set_title("The visualization of the clustered data.")    ax2.set_xlabel("Feature space for the 1st feature")    ax2.set_ylabel("Feature space for the 2nd feature")plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "                  "with n_clusters = %d" % n_clusters),                 fontsize=14, fontweight='bold')plt.show()`

If you are doing clustering in more than two dimensions you are not running the last code section to visualize the clusters because it’s only for two-dimensional clustering. you can use this code by using the dimension reduction technique.

THX U :)

Reference:

https://github.com/andrewxiechina/DataScience/blob/master/K-Means/cs229-notes7a%202.pdf

Statistics student at Islamic University of Indonesian

## More from Halima Tusyakdiah

Statistics student at Islamic University of Indonesian