The WordStream-Extension project builds upon the original WordStream tool, enhancing it with new features, datasets, and interactive visualizations. Key components include the integration of additional datasets, improvements to the existing visualization interface, and the addition of dynamic interactivity using D3.js.
- Data Acquisition and Preprocessing: Integrated new datasets and applied text preprocessing and sentiment analysis techniques.
- Extended Visualizations: Introduced new tabs for visualizing sentiment data, including SentimentCloud and SentimentStream.
- Enhanced Interactivity: Utilized D3.js for dynamic sliders and color schemes to improve user experience.
- Rotten Tomatoes Movie Reviews
- CNN News Articles
- Reddit Posts from the /datasets Subreddit
- Data Collection: Acquired datasets from trusted sources, ensuring diversity in the text corpora.
- Data Cleaning: Performed text normalization by:
- Lowercasing text
- Removing special characters
- Handling missing values for consistency across datasets
- Keyword Extraction: Used the SpaCy library in Python to identify and extract relevant keywords for visualization.
- Sentiment Analysis: Applied the VADER Sentiment Analyzer to compute sentiment scores, categorizing text into positive, neutral, and negative sentiments.
- Additional Data Transformation: Aggregated data by year to calculate average sentiment and keyword frequency, facilitating temporal analysis.
Some additional datasets (e.g., social media posts and fact-check articles) were added but proved difficult to visualize in WordStream. As a result, they are only accessible through the SentimentCloud and SentimentStream tabs.
- SentimentCloud: An interactive word cloud displaying sentiment scores for various datasets.
- SentimentStream: A temporal sentiment analysis tool that displays the evolution of sentiments over time.
- Dynamic Sliders: Interactive sliders allow users to adjust sentiment thresholds and word rankings in real time.
- Diverging Color Schemes: Switched from categorical to diverging color schemes to better represent the spectrum of sentiments.
For detailed information about the classes and functions used in the implementation, refer to the Code Documentation.
To run the application locally, follow these steps:
-
Ensure you have Python installed on your machine.
-
Open a terminal and execute the following command to start a simple HTTP server:
python -m http.server 8000
-
Navigate to
http://localhost:8000in your web browser to access the application.
We expanded the original WordStream with several new datasets to enhance analysis capabilities:
- Rotten Tomatoes Movie Reviews
- CNN News Articles
- Reddit Posts from the /datasets Subreddit
Each dataset underwent extensive preprocessing:
- Data Cleaning: Standardized text formats and removed inconsistencies.
- Keyword Extraction: Identified significant terms to be visualized.
- Sentiment Analysis: Sentiment scores were calculated to classify words by sentiment.
- Data Aggregation: Yearly sentiment averages and keyword frequencies were compiled to support temporal visualization.
The SentimentCloud tab presents an interactive word cloud that visualizes sentiment scores:
- Color-Coded Words: Red for negative sentiment and blue for positive sentiment.
- Dataset Selection: Choose from multiple datasets to visualize sentiment dynamics.
- Category Filtering: Filter words by categories such as 'person', 'organization', or 'country'.
- Sentiment Threshold Sliders: Adjust sliders to customize sentiment classification.
The SentimentStream tab offers temporal analysis alongside sentiment visualization:
- Yearly Word Clouds: Separate word clouds for each year, displaying positive and negative sentiments.
- Interactive Line Charts: Click on a word to generate a line chart showing sentiment evolution over time.
- Multiple Category Comparison: Select multiple categories to compare sentiment trends across different datasets.
- Adjustable Sentiment Thresholds: Similar to the SentimentCloud tab, enabling dynamic customization.
Sentiment scores are represented with a diverging color scheme:
- Positive Sentiments: Shades of blue, with intensity proportional to sentiment score.
- Negative Sentiments: Shades of red, similarly scaled.
- Rotten Tomatoes Movie Reviews Dataset
- CNN News Articles Dataset
- Reddit Dataset
- Social Media Sentiments Dataset
- Fake and Real News Dataset
