This project focuses on designing and implementing a complete data pipeline to handle streaming tweets, process the data, and visualize the results in a user-friendly web application. The system integrates several components to ensure real-time data ingestion, processing, and visualization.
- A producer program reads data from the
tweets.jsonfile and streams it to an Apache Kafka topic namedraw-tweets.
The pipeline includes three consumers to handle data processing:
- Consumer 1: Written in Scala with Spark, it extracts key fields (e.g., text, creation time, geo-coordinates, hashtags) from tweets in the
raw-tweetsKafka topic and writes the processed data to thetransform-tweetstopic. - Consumer 2: Performs sentiment analysis using Spark NLP or TextBlob on tweets from the
transform-tweetstopic, adding sentiment data and forwarding results to thesentiment-tweetstopic. - Consumer 3: A Python consumer writes tweets from the
sentiment-tweetstopic into an Elasticsearch database for querying and visualization.
- Stores processed tweets and metadata in Elasticsearch database.
- Implements an efficient schema design to support fast querying for tweet annotations and visualization.
The web application provides an interactive interface with the following features:
- Keyword Search: Allows users to enter a keyword to query tweets.
- Geospatial Visualization: Renders tweets containing the keyword on a map based on their geo-coordinates (longitude/latitude).
- Trend Diagram: Displays a temporal distribution of tweets over time with hourly and daily aggregation levels.
- Sentiment Analysis Gauge: Visualizes the average sentiment score of tweets over a specified period.
This pipeline demonstrates the integration of modern data engineering tools and techniques to build a robust, scalable, and user-friendly system for real-time data processing and visualization.
As for each Consumer, Producer and Backend-Client
# cd into the directory that you want to run
make init
source venv/bin/activate
make install
make runNOTE: For Elasticsearch Consumer and Backend-Client, don't forget to change the password to your Elasticsearch password
Open Consumer as project in IntelliJ
As for the web application
npm i
npm startAs for each Consumer, Producer and Backend-Client
Remove-Item -Recurse -Force venv # rmdir venv for cmd
py -m venv venv # py, python or python3
.\venv\Scripts\activateTo install dependencies and run for each one of them
# PythonConsumer
pip install nltk pandas scikit-learn textblob confluent-kafka
py consumer.py # py, python or python3
# ElasticSearchConsumer
pip install confluent-kafka elasticsearch python-dotenv
py elasticSeachConsumer.py # py, python or python3
# Producer
pip install confluent-kafka
py producer.py # py, python or python3
# Backend-Client
pip install django django-cors-headers elasticsearch python-dotenv
py manage.py runserver # py, python or python3NOTE: For Elasticsearch Consumer and Backend-Client, don't forget to change the password to your Elasticsearch password
Open Consumer as project in IntelliJ
As for the web application
npm i
npm startTo install Kafka and Zookeeper
curl "https://dlcdn.apache.org/kafka/3.8.1/kafka_2.12-3.8.1.tgz" --output ~/kafka.tgz
tar -xvzf ~/kafka.tgz
rm ~/kafka.tgz
mv kafka_2.12-3.8.1 ~/kafkaThen to run the servers
~/kafka/bin/zookeeper-server-start.sh ~/kafka/config/zookeeper.properties
~/kafka/bin/kafka-server-start.sh ~/kafka/config/server.propertiesIn Windows, it’s a bit different as you have to download it via this link, then extract it and place the folder in C:\ and name it kafka, then run the servers via
C:\kafka\bin\windows\zookeeper-server-start.bat C:\kafka\config\zookeeper.properties
C:\kafka\bin\windows\kafka-server-start.bat C:\kafka\config\server.properties
For a higher quality, click this link