Popularity of Songs on Spotify

Andrew Lee
4 min readApr 26, 2021

Introduction

Spotify is one of the largest music streaming platforms in the world, boasting 155 million premium subscribers and 345 million monthly active users. With that comes a treasure trove of data on users’ preferences when it comes to music. I wanted to dive deeper into some of this data to find insights on what makes certain songs more popular than others.

Description of the Data

The Spotify dataset was found on Kaggle, and contains audio features on ~175,000 songs from 1921–2021. For this analysis, I restricted the dataset to songs that were released from 2010–2020 (I love music from the 2010s).

Features

Primary

id (Id of track generated by Spotify)

Continuous Variables:

acousticness (Ranges from 0 to 1)

danceability (Ranges from 0 to 1)

energy (Ranges from 0 to 1)

duration_ms (Integer typically ranging from 200k to 300k)

instrumentalness (Ranges from 0 to 1)

valence (Ranges from 0 to 1)

popularity (Ranges from 0 to 100)

tempo (Float typically ranging from 50 to 150)

liveness (Ranges from 0 to 1)

loudness (Float typically ranging from -60 to 0)

speechiness (Ranges from 0 to 1)

year (Ranges from 1921 to 2020)

Dummy

mode (0 = Minor, 1 = Major)

explicit (0 = No explicit content, 1 = Explicit content)

Categorial Variables

key (All keys on octave encoded as values ranging from 0 to 11, starting on C as 0, C# as 1 and so on…)

artists (List of artists mentioned)

release_date (Date of release mostly in yyyy-mm-dd format, however precision of date may vary)

name (Name of the song)

DataFrame

Statistical Methods

Before I proposed a hypothesis, I wanted to see which feature of this dataset had the highest correlation to ‘popularity’, which measures the popularity of the song from a scale of 0–100.

After careful examination, I found that ‘instrumentalness’ had the highest absolute correlation to song ‘popularity’ with a correlation of -0.4195. Instrumentalness is a measure from 0 to 1, where a value closer to 1 indicates a greater likelihood the track contains no vocal content. Instrumentalness would be interesting to analyze because I will be able to figure out if there is a statistically significant relationship between a song’s vocal content and its popularity.

Correlation Table

Correlation Table showing correlations between columns

Hypothesis

Null Hypothesis: Songs with low ‘instrumentalness’ values and songs with high ‘instrumentalness’ values have THE SAME mean ‘popularity’ scores (‘Instrumentalness’ does not affect song ‘Popularity’).

Alternative Hypothesis: Songs with low ‘instrumentalness’ values and songs with high ‘instrumentalness’ values have DIFFERENT mean ‘popularity’ scores (‘Instrumentalness’ affects song ‘Popularity’).

Significance Level: 95%

When preparing the data for analysis, I created a new feature ‘instrumental_level’ that provided a value of 0 if a song’s ‘instrumentalness’ value was less than or equal to 0.5, and 1 otherwise. Doing this was helpful for the two-sample statistical t-test I performed in the next step.

Create a new feature ‘instrumental_level’

I separated the dataset based on ‘instrumental_level’ (either 0 or 1) into separate DataFrames and compared the ‘popularity’ values of these two DataFrames in a two-sample t-test. This type of statistical test will help me figure out if ‘instrumentalness’ has a significant relationship with song ‘popularity’.

Two-sample t-test

Two-sample t-test

Results

The resulting t-statistic is 68.88 and p-value is 0. Based on these results, I conclude that I should reject the null hypothesis.

Recall that the null hypothesis was songs with low ‘instrumentalness’ values and songs with high ‘instrumentalness’ values have the same mean ‘popularity’ scores on Spotify. In rejecting the null hypothesis, I am 95% confident that ‘instrumentalness’ has a statistically significant relationship with ‘popularity’ of songs on Spotify.

Below is a graph of 2,674 observations from the DataFrame plotting ‘instrumentalness’ and ‘popularity’ with a line of best fit.

Scatter Plot

Number of data points reduced to 10% of total observations for clarity.

Below is a model summary of a linear model containing ‘popularity’ as the dependent variable and ‘instrumentalness’ as the independent variable.

Conclusion

In conclusion, there is a significant relationship between ‘instrumentalness’ and ‘popularity’ of songs on Spotify. When comparing the mean ‘popularity’ scores of songs with high instrumentalness to the mean ‘popularity’ scores of songs with low instrumentalness, I found that they are statistically different enough to conclude something significant is affecting these differences.

When applying a two-sample t-test to investigate the statistical significance of this difference, I found that the t-statistic is 68.88 and p-value is 0. This indicates that at the 95% significance level, I reject the null hypothesis and conclude that there is a statistically significant relationship between ‘instrumentalness’ and ‘popularity’ of songs on Spotify.

Further research should include examining the graphs, as both songs with low instrumentalness and songs with high instrumentalness have a similar distribution in ‘popularity’ of songs.

--

--