-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[NLP] New example: IMDB Sentiment Analysis with TextVectorization and LSTM #2215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[NLP] New example: IMDB Sentiment Analysis with TextVectorization and LSTM #2215
Conversation
Summary of ChangesHello @MadhurJain06, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a comprehensive new example for Natural Language Processing within the repository. Its primary goal is to showcase a modern and portable approach to binary sentiment classification using the IMDB dataset. By leveraging Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a valuable end-to-end example for sentiment analysis using TextVectorization and an LSTM model. The code is well-structured and demonstrates modern Keras practices. However, I've found a critical issue in the data loading logic that prevents the script from running correctly. I've also included a few suggestions to improve code consistency and cleanup, which will enhance the quality of the example.
examples/nlp/imdb_lstm_sentiment.py
Outdated
| dataset_path = keras.utils.get_file( | ||
| "aclImdb_v1.tar.gz", data_url, untar=True, cache_dir=".", cache_subdir="" | ||
| ) | ||
| main_dir = os.path.join(dataset_path, "aclImdb") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The path to the main data directory, main_dir, is constructed incorrectly. The dataset_path variable holds the path to the downloaded archive file (e.g., ./aclImdb_v1.tar.gz), but the aclImdb directory is extracted into the parent directory of the archive, not within the archive file path itself. This will cause the script to fail with a FileNotFoundError. The path should be constructed using the directory part of dataset_path.
main_dir = os.path.join(os.path.dirname(dataset_path), "aclImdb")
examples/nlp/imdb_lstm_sentiment.py
Outdated
| import pandas as pd | ||
| from keras import layers | ||
| from keras.models import Sequential | ||
| from keras.layers import TextVectorization # Modern Keras text preprocessing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
examples/nlp/imdb_lstm_sentiment.py
Outdated
| embedding_dim = 128 # Size of the output vector for each word | ||
|
|
||
| # 1. Create the TextVectorization layer | ||
| vectorize_layer = TextVectorization( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For consistency with how other layers like layers.Embedding are used in this file, TextVectorization should be accessed via the imported layers module. This change, combined with removing the redundant import on line 17, makes the code more consistent and readable.
vectorize_layer = layers.TextVectorization(| def predict_sentiment(review): | ||
| """Predicts sentiment for a raw text review using the end-to-end model.""" | ||
| # The model accepts a list/array of raw strings directly | ||
| prediction = model.predict([review]) | ||
|
|
||
| # Sigmoid output is a probability | ||
| sentiment = "positive" if prediction[0][0] > 0.5 else "negative" | ||
| probability = prediction[0][0] | ||
| return sentiment, probability | ||
|
|
||
| # Examples | ||
| print("\n### Predicting Values") | ||
| examples = [ | ||
| "This movie was fantastic. I loved it.", | ||
| "This movie was not that good", | ||
| "Great movie but could have added a better action scene", | ||
| "Mid movie" | ||
| ] | ||
|
|
||
| for review in examples: | ||
| sentiment, prob = predict_sentiment(review) | ||
| print(f"Review: '{review[:30]}...' -> Sentiment: {sentiment} ({prob:.2f})") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation calls model.predict() inside a loop for each example review. This is inefficient as it doesn't take advantage of batch processing. It's better practice to pass the entire list of examples to model.predict() once, which will perform inference on all of them in a single batch. This is more performant and demonstrates a more typical inference workflow.
def predict_sentiment(review):
"""Predicts sentiment for a raw text review using the end-to-end model."""
# The model accepts a list/array of raw strings directly
prediction = model.predict([review])
# Sigmoid output is a probability
probability = prediction[0][0]
sentiment = "positive" if probability > 0.5 else "negative"
return sentiment, probability
# Examples
print("\n### Predicting Values")
examples = [
"This movie was fantastic. I loved it.",
"This movie was not that good",
"Great movie but could have added a better action scene",
"Mid movie"
]
# Predict on the batch of examples for efficiency
predictions = model.predict(examples)
for review, prediction in zip(examples, predictions):
prob = prediction[0]
sentiment = "positive" if prob > 0.5 else "negative"
print(f"Review: '{review[:30]}...' -> Sentiment: {sentiment} ({prob:.2f})")
examples/nlp/imdb_lstm_sentiment.py
Outdated
| # Clean up the downloaded directory | ||
| if os.path.exists(main_dir): | ||
| shutil.rmtree(main_dir) No newline at end of file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cleanup logic only removes the extracted dataset directory. The downloaded archive file (aclImdb_v1.tar.gz) remains in the working directory. For a cleaner example that doesn't leave artifacts, the script should also remove this downloaded file.
# Clean up the downloaded directory and archive
if os.path.exists(main_dir):
shutil.rmtree(main_dir)
if os.path.exists(dataset_path):
os.remove(dataset_path)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've now pushed the code again. All automated and preliminary review comments have been addressed, including:
Fixing the data path construction (os.path.dirname).
Using layers.TextVectorization and consistent layer access.
Optimizing prediction using batch inference.
All CI checks should now be passing. Ready for your review!
Adds a new end-to-end NLP example demonstrating binary sentiment classification on the IMDB dataset using a Sequential model with an LSTM layer.
This submission addresses previous review comments by: