Exploratory Data Analysis for Top 50 Spotify Songs in Python

Halima Tusyakdiah
4 min readJan 11, 2020

--

https://ultimagz.com/iptek/setelah-13-tahun-spotify-akhirnya-mendapat-untung/

Case Study: Top 50 Spotify Songs — 2019.

Introduction of Exploratory Data Analysis (EDA)

Exploratory Data Analysis refers to the critical process of performing initial investigations on data so as to discover patterns, to spot anomalies,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations. (Source: https://towardsdatascience.com/exploratory-data-analysis-8fc1cb20fd15)

Dataset:
Download the dataset from kaggle (url: https://www.kaggle.com/leonardopena/top50spotify2019)

Steps…
First, download and install some Packages in Python. Maybe you need a few minutes to import the packages.

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(style="ticks", color_codes=True)
sns.set(style="darkgrid")
import statistics as stat
import plotly.express as px

Import the Dataset.

spotify= pd.read_csv("top50.csv", encoding="ISO-8859-1")
spotify.head()
#shape function is used for checking the data size i.e length and width of data

Shape function is used for checking the data size i.e length and width of data

spotify.shape

Check the datatypes of the predictor variables

spotify.info()

Using integer variables

spotify_int = spotify.iloc[:, 4:14]
spotify_int.head()

display descriptive statistics from the dataset

spotify_int.describe()

Plot correlation of all integer variables.

sns.pairplot(spotify_int);

shows bar gra aph of Popularity and Track Name

spotify.plot(y='Popularity',x= 'Track.Name',kind='bar',figsize=(26,6),legend =True,title="Popularity Vs Track Name",
fontsize=18,stacked=True,color=['y', 'r', 'b','y', 'r', 'b', 'y'])
plt.ylabel('Popularity', fontsize=18)
plt.xlabel('Track Name', fontsize=18)
plt.show()

Count of artist name

plt.figure(figsize=(10,10))
sns.countplot(y='Artist.Name', data=spotify, order=spotify["Artist.Name"].value_counts().index)
plt.show()

Count by Genre

spotify['Genre'].value_counts().plot.bar()
plt.title('Count by Genre')
plt.ylabel('quanity')
plt.show()
print(spotify.groupby('Genre').size())

Create wordcloud based on music genre

from wordcloud import WordCloud, STOPWORDS
# Create the wordcloud object
wordcloud = WordCloud(width=700, height=600, margin=3).generate(str(spotify.Genre))
# Display the generated image:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()

Visualization of relationship between genre and popularity using SwarmPlot

plt.figure(figsize=(10,5))
swarmplot=sns.swarmplot(x="Genre",y="Popularity",data=spotify, s=13)
swarmplot.set_xticklabels(swarmplot.get_xticklabels(),rotation=90)
swarmplot.set_title("Relationship between Genre & Popularity")

Visualization the relationship between Beats Per Minute and artists based on Popularity

sns.catplot(x="Artist.Name", y="Beats.Per.Minute",hue="Popularity", s=15,data=artist, kind="swarm")

Box plot of the relationship between Loudness..dB..and Energy

sns.catplot(x = "Loudness..dB..", y = "Energy", kind = "box", data = spotify)

Spearman correlation statistics for all integer variables

pd.set_option('precision', 3)
corr = spotify.corr(method='spearman')
print(corr)

Marginal plot between Acousticness and Beat Per Minute.

sns.jointplot(x="Beats.Per.Minute", y="Acousticness..", data=spotify, kind="kde");

Script for filter several artists

artist = spotify[spotify[“Artist.Name”].isin([“Ed Sheeran”, “J Balvin”, “Ariana Grande”, “Marshmello”, “The Chainsmokers”, “Shawn Mendes”])]

okay maybethats all about EDA discussion. These are just a few visualizations of op 50 Spotify Songs information, there is much more that can be explored more deeply.
THX :)

--

--

Halima Tusyakdiah
Halima Tusyakdiah

Written by Halima Tusyakdiah

Statistics student at Islamic University of Indonesian

Responses (1)