This project analyzes the Spotify Music Dataset to uncover key trends in popular music. Using Python, we perform data cleaning, visualization, and pattern discovery to understand how factors like danceability, energy, and duration impact song popularity.
Key Objectives:
Identify the most popular genres on Spotify.
Analyze the relationship between danceability and energy.
Examine the distribution of song durations.
Discover the top artists with the most songs on Spotify.
Tools:
Python (Pandas, Matplotlib, Seaborn) – Data manipulation & visualization
Jupyter Notebook – Interactive coding & analysis
GitHub Repository – Full project documentation & code
Dataset:
https://www.kaggle.com/datasets/maharshipandya/-spotify-tracks-dataset
Lets first load the dataset and do some data cleaning.
import pandas as pd
# Load the dataset
df = pd.read_csv("dataset.csv")
df.head()
Checking missing values and removing duplicate records.
print(df.isnull().sum()) # Check missing values
df.drop_duplicates(inplace=True) # Remove duplicate rows
While looking at the dataset, I saw that the track_id column contains complex values that are of no use to data analysis. It’s always a good idea to remove such columns before starting analysis.
df.drop(columns=["track_id"], inplace=True) #dropping track ID as it doesnt have valuable data that can be analyzed
df.fillna({"artists":"Not Available"}, inplace=True)
df.fillna({"album_name":"Not Available"}, inplace=True)
df.fillna({"track_name":"Not Available"}, inplace=True)
Now lets analyze the data and try to identify some patters. First I’m going to find the top 5 genres of music on Spotify.
import matplotlib.pyplot as plt
import seaborn as sns
# Get the top 5 genres
top_5_genres = df["track_genre"].value_counts().head(5)
plt.figure(figsize=(10, 5))
sns.countplot(y=df["track_genre"], order=top_5_genres.index, hue = df["track_genre"], palette="coolwarm", legend = False)
plt.title("Top 5 Most Popular Genres on Spotify")
plt.xlabel("Count")
plt.ylabel("Genre")
plt.show()
Now I’m going to find out the average duration of songs on Spotify.
plt.figure(figsize=(8, 5))
sns.histplot(df["duration_ms"] / 60000, bins=30, kde=True, color="purple")
plt.title("Distribution of Song Durations")
plt.xlabel("Duration (Minutes)")
plt.ylabel("Count")
plt.show()
Now lets see who are the top artists on Spotify.
top_artists = df["artists"].value_counts().head(10)
plt.figure(figsize=(10, 5))
sns.barplot(x=top_artists.index, y=top_artists.values, hue = top_artists.index, palette="Blues_r", legend = False)
plt.xticks(rotation=45)
plt.title("Top 10 Artists with Most Songs on Spotify")
plt.xlabel("Artist")
plt.ylabel("Number of Songs")
plt.show()
Key Findings:
Acoustic, afrobeat, alternative rock, alternative and ambient are the top 5 genres.
Most songs are between 2-5 minutes long
The Beatles are the top artist based on the number of songs followed by George Jones and Stevie Wonder.
GitHub Link:
https://github.com/isuri-balasooriya2/TheMathLab/tree/main/Spotify_Music_Analysis
Conclusion & Future Work:
This project demonstrates my ability to analyze and visualize real-world music data, providing insights into trends in popular songs. I hope to extend this project to include predicting song popularity using machine learning models.