K-nearest Neighbors Classification in RStudio

Halima Tusyakdiah
6 min readJan 12, 2020

Introduction:

K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for both classification as well as regression predictive problems. However, it is mainly used for classification predictive problems in industry. The following two properties would define KNN well −

  • Lazy learning algorithm − KNN is a lazy learning algorithm because it does not have a specialized training phase and uses all the data for training while classification.
  • Non-parametric learning algorithm − KNN is also a non-parametric learning algorithm because it doesn’t assume anything about the underlying data.

Working of KNN Algorithm

K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new datapoints which further means that the new data point will be assigned a value based on how closely it matches the points in the training set.

Example of kNN Algorithm

Let’s consider 10 ’drinking items’ which are rated on two parameters on a scale of 1 to 10. The two parameters are “sweetness” and “fizziness”. This is more of a perception based rating and so may vary between individuals. I would be considering my ratings (which might differ) to take this illustration ahead. The ratings of few items look somewhat as:

“Sweetness” determines the perception of the sugar content in the items. “Fizziness” ascertains the presence of bubbles in the drink due to the carbon dioxide content in the drink. Again, all these ratings used are based on personal perception and are strictly relative.

From the beside figure, it is clear we have bucketed the 10 items into 4 groups namely, ’COLD DRINKS’, ‘ENERGY DRINKS’, ‘HEALTH DRINKS’ and ‘HARD DRINKS’. The question here is, to which group would ‘Maaza’ fall into? This will be determined by calculating distance.

Now, calculating distance between ‘Maaza’ and its nearest neighbors (‘ACTIV’, ‘Vodka’, ‘Pepsi’ and ‘Monster’) requires the usage of a distance formula, the most popular being Euclidean distance formula i.e. the shortest distance between the 2 points which may be obtained using a ruler.

Using the co-ordinates of Maaza (8,2) and Vodka (2,1), the distance between ‘Maaza’ and ‘Vodka’ can be calculated as:

dist(Maaza,Vodka) = 6.08

Distance between Maaza and ACTIV being the least, it may be inferred that Maaza is same as ACTIV in nature which in turn belongs to the group of drinks (Health Drinks).

If k=1, the algorithm considers the nearest neighbor to Maaza i.e, ACTIV; if k=3, the algorithm considers ‘3’ nearest neighbors to Maaza to compare the distances (ACTIV, Vodka, Monster) — ACTIV stands the nearest to Maaza. (source: https://www.analyticsvidhya.com/blog/2015/08/learning-concept-knn-algorithms-programming/)

DATASET:

You can download data on the link → https://www.kaggle.com/leonardopena/top50spotify2019
The top 50 most listened songs in the world by spotify. This dataset has several variables about the songs. This database contains 50 songs 13 variables.

Variables Information:

  1. Track.Name: Name of the Track
  2. Artist.Name: Name of the Artist
  3. Genre: the genre of the track
  4. Beats.Per.Minute: The tempo of the song.
  5. Energy: The energy of a song — the higher the value, the more energtic. song
  6. Danceability: The higher the value, the easier it is to dance to this song.
  7. Loudness..dB..: The higher the value, the louder the song.
  8. Liveness: The higher the value, the more likely the song is a live recording.
  9. Valence.: The higher the value, the more positive mood for the song.
  10. Length.: The duration of the song.
  11. Acousticness..: The higher the value the more acoustic the song is.
  12. Speechiness.: The higher the value the more spoken word the song contains.
  13. Popularity: The higher the value the more popular the song is.

the previous medium can be seen at the following link: HERE
I have explored data.
okay let’s start classification using the KNN method.

Step 1: import dataset

spotify=read.csv(file = "D:\\KULIAH\\BIML\\KNN\\top50.csv", sep = ",")
View(spotify)

the number of rows of the dataset

nrow(dataset)

Statistics Descriptive all variables

summary(spotify[1:13])

Step 2: Data Normalization

you must always normalize the data set so that the output remains unbiased. Because the scale of each variable is different, weights must be carried out to avoid bias.
If the data isn’t normalized it will lead to a baised outcome.

##the normalization function is created
normalize <-function(x) { (x -min(x))/(max(x)-min(x)) }
##Run nomalization 10 coulumns of dataset because they are the predictors
spotify_norm <- as.data.frame(lapply(spotify[,c(2:14)], normalize))
head(spotify_norm)

Step 3: Data Splicing

The kNN algorithm is applied to the training data set and the results are verified on the test data set. Splicing the data depends on the researcher, this time I used a 20% to test data and 80% data train.

##Generate a random number that is 80% of the total number of rows in dataset.
data_split <- sample(1:nrow(spotify_norm), 0.8 * nrow(spotify_norm))
##extract training set
train <- spotify_norm[data_split,]
head(train)
dim(train)
##extract testing set
test <- spotify_norm[-data_split,]

After obtaining training and testing data sets, then we will create a separate data frame for the ‘Creditability’ variable so that our final results can be compared with actual values.

##extract 14th column of train dataset because it will be used as 'cl' argument in knn function.
target_category <- spotify[data_split,14]
##extract 14th column if test dataset to measure the accuracy
test_category <- spotify[-data_split,14]
target_category
test_category

Step 4: Making predictions

Next, we will build a Machine Learning model. We build models using training data sets. To use the KNN algorithm to build a model, we must first install the ‘class’ package provided by R. This package has the KNN function in it.

##load the package class
library(class)
##run knn function
test_pred <- knn(train,test,cl=target_category,k=5)
test_pred

From the output above, the classification results obtained by using k = 5. The K value is determined by the researcher.
then, we compare actual data and prediction data.

df_pred=data.frame(test_category,test_pred)
df_pred

we can see the classification output there are still some data that does not match.

Step 6: Accuracy

after building the model, then we can check the accuracy of forecasting using confusion matrix.

##create confusion matrix
table <- table(test_category,test_pred)
table

so that the results of the confusion matrix are neater, you can use this script.

#Evaluate the model performance
install.packages("gmodels")
library(gmodels)
CrossTable(x=test_category, y=test_pred,prop.chisq = FALSE)

Checking accuracy

##this function divides the correct predictions by total number of predictions that tell us how accurate the model is.accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(table)

The test data consisted of 50 observations. this case obtained predictive value with little accuracy.

Source:

--

--

Halima Tusyakdiah

Statistics student at Islamic University of Indonesian