Skip to content

Sashank-Singh/IMDB-Sentiment-Analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

IMDB Sentiment Analysis

Table of Contents

Project Overview

This project is a sentiment analysis model that classifies movie reviews from the IMDB dataset as either positive or negative. The dataset contains 50,000 movie reviews labeled accordingly. The objective is to apply Natural Language Processing (NLP) techniques to preprocess the text data, vectorize it, and then use machine learning to train a model for sentiment classification.

Features

  • Clean and preprocess textual data (removing HTML tags, punctuation, stop words, etc.).
  • Vectorize the text using TF-IDF.
  • Train a Logistic Regression model to classify the sentiment of the reviews.
  • Evaluate the model using metrics like accuracy, precision, recall, and F1-score.
  • Visualize model performance using a confusion matrix and ROC curve.

Technologies Used

  • Python: Core programming language.
  • Jupyter Notebook: For creating and running the notebook for analysis.
  • Scikit-learn: For model building and evaluation.
  • Pandas: For data manipulation.
  • Matplotlib: For data visualization.
  • NLTK (Natural Language Toolkit): For text preprocessing.
  • Joblib: For saving the trained machine learning model.
  • Wordcloud: For creating word cloud visualizations.
  • Git: Version control for tracking changes.
  • GitHub: Repository for storing and sharing the project.

Data

The dataset used in this project is the IMDB Dataset of 50K Movie Reviews, available on Kaggle.

  • The dataset consists of two columns:
    • review: The text of the movie review.
    • sentiment: The label indicating whether the review is positive or negative.

Installation

To run this project locally, follow these steps:

Run this in JupyNoteBook for the best visualization.

Or use

  1. Clone the repository:
    git clone https://github.com/Sashank-Singh/IMDB-Sentiment-Analysis.git
    cd IMDB-Sentiment-Analysis
    

To install the required dependencies, you can use the following command:

pip install nltk scikit-learn joblib matplotlib wordcloud pandas


## Usage

### Running the Notebook:

The Jupyter Notebook is organized into several sections to guide you through the entire process of sentiment analysis:

1. **Data Preprocessing**:
   - Cleaning the dataset by removing HTML tags, special characters, and stop words.
   - Applying lemmatization to reduce words to their base form.
   
2. **Feature Extraction using TF-IDF**:
   - Converting text data into numerical features using Term Frequency-Inverse Document Frequency (TF-IDF).

3. **Model Training and Evaluation**:
   - Training a Logistic Regression model on the vectorized data.
   - Evaluating the model using metrics such as accuracy, precision, recall, and F1-score.
   
4. **Visualizations**:
   - Displaying a Confusion Matrix and ROC Curve to understand the model's performance in detail.

### Train and Test the Model:

You can modify the model or add more machine learning algorithms as needed. The flexibility of the notebook allows you to experiment with different classifiers or preprocessing techniques.

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published