Project Overview
The aim of this project is to practice data analysis on a large dataset. IMDB user reviews comprises of natural language and by using Python libraries I aim to clean, format and analyze the data to identify the general sentiment conveyed. Later I plan to visual the data for better understanding.
Project Objectives:
Clean and prepare a larger dataset for analysis
Further clean the data using re, nltk tokenization, nltk stopwords
Use Textblob to derive sentiment score and categorize them into 3 categories, positive, negative and neutral
Visualize the data to examine patterns
Tools:
Python (Libraries used: Pandas, Matplotlib, Seaborn, re, nltk)
Jupyter Notebook
GitHub Repository
Dataset: Kaggle “IMDB Movie Reviews Dataset” -
https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
This dataset has 50K records which makes it one of the biggest datasets I’ve worked with so far. Firstly, lets load the dataset using Python.
import pandas as pd
df = pd.read_csv("IMDB Dataset.csv")
df.head()
Working with a larger dataset I’ve learned a lot in terms of technology as much as data analysis. First thing was uploading the dataset to GitHub as the web interface allows only a maximum file size of 25MB. I was able to resolve this by using the command line. If you are in a similar situation you can use the below commands to upload a larger file. First you need to copy your dataset file to the folder in your local machine.
git add .
git commit -m ("Adding large file")
git push origin main
Okay, now we have our dataset ready. Lets clean the data. First I’m going to see if there are any empty fields and drop them.
df = df.dropna()
My dataset already had a column named sentiment which I’m going to remove.
df = df.drop("sentiment", axis=1)
Now we have a basic cleaned dataset. But in this instance further cleaning is required. We are dealing with movie reviews which are in the form of natural language, and in a platform like IMDB you get a lot of users typing their reviews in different ways. These could include special characters, symbols or numbers. To make it easier for analysis I’m going to do the following,
Convert the text to lowercase
Remove special characters and numbers
Tokenization
Removing stopwords
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
nltk.download('punkt')
# Function to clean text
def clean_text(text):
text = text.lower() # Convert to lowercase
text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove special characters & numbers
tokens = word_tokenize(text) # Tokenization
tokens = [word for word in tokens if word not in stopwords.words('english')] # Remove stopwords
return " ".join(tokens)
# Apply cleaning function
df["cleaned_review"] = df["review"].apply(clean_text)
df.head()
You will notice that we have used two new libraries in the above code, re and nltk.
re is a in built module in Python that provides support for regular expressions. Regular expressions can be used to match characters, repeat things, perform string substitutions, and check if a string contains a specified search pattern.
nltk or natural language toolkit s a suite of libraries and programs for symbolic and statistical natural language processing for English written in Python. We are going to heavily rely on nltk for this dataset.
First we converted our text to lowercase to make things easier. Then we removed special character and numbers using re. After that we did tokenization which means breaking down a piece of text into smaller, meaningful units called "tokens". This could be individual words, punctuation marks, or even characters, allowing machines to better understand and analyze the text by dividing it into manageable parts. The reason we are using tokenization will be clear when we discuss the next step.
Removing stopwords means a process that removes common words from text data. This is done to improve the performance of natural language processing (NLP) tasks. Words like “and“, “is“ don’t add much value to the overall text or its meaning. Therefore, its important to remove such words from your data before analyzing.
Now we have a properly cleaned dataset ready to be analyzed.
from textblob import TextBlob
# Function to get sentiment score
def get_sentiment(text):
return TextBlob(text).sentiment.polarity
# Apply sentiment function
df["sentiment_score"] = df["cleaned_review"].apply(get_sentiment)
# Categorize as Positive, Negative, or Neutral
df["sentiment"] = df["sentiment_score"].apply(lambda x: "Positive" if x > 0 else ("Negative" if x < 0 else "Neutral"))
df.head()
TextBlob is a Python library that analyzes text. It's used for Natural Language Processing tasks and can help you understand a text's grammatical structure. One of the features of TextBlob is sentiment analysis, which we have used here.
Once the sentiment score is determined for each record in the dataset, its time to visualize the results.
import matplotlib.pyplot as plt
import seaborn as sns
plt.figure(figsize=(6,4))
sns.countplot(x=df["sentiment"], hue= df["sentiment"], palette="coolwarm", legend=False)
plt.title("Distribution of Sentiments in Movie Reviews")
plt.xlabel("Sentiment")
plt.ylabel("Count")
plt.show()
# Calculate review length
df["review_length"] = df["review"].apply(lambda x: len(x.split()))
# Plot sentiment trends by review length
plt.figure(figsize=(10, 5))
sns.boxplot(x=df["sentiment"], y=df["review_length"], hue = df["sentiment"], palette="coolwarm", legend = False)
plt.xlabel("Sentiment")
plt.ylabel("Review Length (Word Count)")
plt.title("Sentiment Trend Across Review Length")
plt.show()
Above code aims to find a trend based on the review length.
Key Findings:
The dataset contained positive, negative and neutral reviews but most leaned towards positive or negative
The average length of negative reviews turned out be less than the average length of positive reviews.
GitHub Link:
https://github.com/isuri-balasooriya2/TheMathLab/tree/main/Sentiment_Analysis
Conclusion & Future Work:
The main objective of this project was to take a large dataset and prepare it for analysis. I have managed this by using re, nltk tokenization, nltk stopwords. I used NLP with build in modules in Python to analyze user reviews and derived a sentiment categorized as positive, negative and neutral.