Skip to content

Conversation

@MadhurJain06
Copy link

Adds a new end-to-end NLP example demonstrating binary sentiment classification on the IMDB dataset using a Sequential model with an LSTM layer.

This submission addresses previous review comments by:

Refactoring data loading to use keras.utils.get_file for portability (no reliance on local files or Kaggle setup).

Replacing keras.preprocessing.text.Tokenizer and pad_sequences with the modern TextVectorization layer, making the model end-to-end (accepts raw strings for inference)

And other minute ones are also corrected

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @MadhurJain06, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive new example for Natural Language Processing within the repository. Its primary goal is to showcase a modern and portable approach to binary sentiment classification using the IMDB dataset. By leveraging keras.utils.get_file for data handling and the TextVectorization layer, the example provides an end-to-end solution that is easy to run and understand, accepting raw text inputs directly into the model.

Highlights

  • New NLP Example: A new end-to-end NLP example has been added, demonstrating binary sentiment classification on the IMDB dataset.
  • Portable Data Loading: Data loading now uses keras.utils.get_file for improved portability, removing reliance on local files or specific setups.
  • Modern Text Preprocessing: The example utilizes the modern TextVectorization layer, replacing older keras.preprocessing utilities, to create an end-to-end model that accepts raw strings for inference.
  • LSTM Model Architecture: The model architecture is a Sequential model incorporating an LSTM layer for sentiment analysis.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a valuable end-to-end example for sentiment analysis using TextVectorization and an LSTM model. The code is well-structured and demonstrates modern Keras practices. However, I've found a critical issue in the data loading logic that prevents the script from running correctly. I've also included a few suggestions to improve code consistency and cleanup, which will enhance the quality of the example.

dataset_path = keras.utils.get_file(
"aclImdb_v1.tar.gz", data_url, untar=True, cache_dir=".", cache_subdir=""
)
main_dir = os.path.join(dataset_path, "aclImdb")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The path to the main data directory, main_dir, is constructed incorrectly. The dataset_path variable holds the path to the downloaded archive file (e.g., ./aclImdb_v1.tar.gz), but the aclImdb directory is extracted into the parent directory of the archive, not within the archive file path itself. This will cause the script to fail with a FileNotFoundError. The path should be constructed using the directory part of dataset_path.

main_dir = os.path.join(os.path.dirname(dataset_path), "aclImdb")

import pandas as pd
from keras import layers
from keras.models import Sequential
from keras.layers import TextVectorization # Modern Keras text preprocessing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This direct import of TextVectorization is redundant because the layers module, which contains it, is already imported on line 15. To improve code clarity and avoid unnecessary imports, you can remove this line and use layers.TextVectorization instead (see related comment on line 81).

embedding_dim = 128 # Size of the output vector for each word

# 1. Create the TextVectorization layer
vectorize_layer = TextVectorization(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with how other layers like layers.Embedding are used in this file, TextVectorization should be accessed via the imported layers module. This change, combined with removing the redundant import on line 17, makes the code more consistent and readable.

vectorize_layer = layers.TextVectorization(

Comment on lines 147 to 168
def predict_sentiment(review):
"""Predicts sentiment for a raw text review using the end-to-end model."""
# The model accepts a list/array of raw strings directly
prediction = model.predict([review])

# Sigmoid output is a probability
sentiment = "positive" if prediction[0][0] > 0.5 else "negative"
probability = prediction[0][0]
return sentiment, probability

# Examples
print("\n### Predicting Values")
examples = [
"This movie was fantastic. I loved it.",
"This movie was not that good",
"Great movie but could have added a better action scene",
"Mid movie"
]

for review in examples:
sentiment, prob = predict_sentiment(review)
print(f"Review: '{review[:30]}...' -> Sentiment: {sentiment} ({prob:.2f})")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation calls model.predict() inside a loop for each example review. This is inefficient as it doesn't take advantage of batch processing. It's better practice to pass the entire list of examples to model.predict() once, which will perform inference on all of them in a single batch. This is more performant and demonstrates a more typical inference workflow.

def predict_sentiment(review):
    """Predicts sentiment for a raw text review using the end-to-end model."""
    # The model accepts a list/array of raw strings directly
    prediction = model.predict([review])
    
    # Sigmoid output is a probability
    probability = prediction[0][0]
    sentiment = "positive" if probability > 0.5 else "negative"
    return sentiment, probability

# Examples
print("\n### Predicting Values")
examples = [
    "This movie was fantastic. I loved it.",
    "This movie was not that good",
    "Great movie but could have added a better action scene",
    "Mid movie"
]

# Predict on the batch of examples for efficiency
predictions = model.predict(examples)

for review, prediction in zip(examples, predictions):
    prob = prediction[0]
    sentiment = "positive" if prob > 0.5 else "negative"
    print(f"Review: '{review[:30]}...' -> Sentiment: {sentiment} ({prob:.2f})")

Comment on lines 170 to 172
# Clean up the downloaded directory
if os.path.exists(main_dir):
shutil.rmtree(main_dir) No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The cleanup logic only removes the extracted dataset directory. The downloaded archive file (aclImdb_v1.tar.gz) remains in the working directory. For a cleaner example that doesn't leave artifacts, the script should also remove this downloaded file.

# Clean up the downloaded directory and archive
if os.path.exists(main_dir):
    shutil.rmtree(main_dir)
if os.path.exists(dataset_path):
    os.remove(dataset_path)

Copy link
Author

@MadhurJain06 MadhurJain06 Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've now pushed the code again. All automated and preliminary review comments have been addressed, including:
Fixing the data path construction (os.path.dirname).
Using layers.TextVectorization and consistent layer access.
Optimizing prediction using batch inference.

All CI checks should now be passing. Ready for your review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants