Stock Market Big Data Platform

Project Startup Guide

This project analyzes stock market trends in real time by combining financial data, macroeconomic indicators, and sentiment analysis from news and social media. The goal is to provide a one-minute-ahead forecast for a customizable selection of company's stock prices. The prediction will be displayed on an interactive dashboard, which also allows real-time monitoring of market price anomalies. This repository provides two configurations to launch the Stock Market Trend Analysis project:

Predefined model and continous streaming with a pre-trained model.
Customizable full model training involving historical data download, model training and weekly model re-training.

Prerequisites

To start and use the project, ensure you have Docker installed on your system. Furthermore, it is essential that Docker is configured with the following minimum resources:

RAM: Minimum 10 GB RAM (allocated to Docker)
CPU: Minimum 12 CPUs (allocated to Docker)

Essential Setup

1. Clone the repository

First, clone this repository to your local system using the repository url: https://github.com/elisa-negrini/BDT-project.git

2. Ensure LFS is installed

After cloning the repository, make sure to run the following commands to properly fetch large files (e.g., model.onnx) managed by Git LFS:

 git lfs install #Windows 
 sudo apt install git-lfs #Linux
 brew install git-lfs #IOs

3. Download of the .env file

Download the provided .env file and place it in the root directory of this repository. This file will contain necessary credentials and configuration settings.

4. Alpaca Credentials

To use real stock market streaming and historical data, you need to configure your Alpaca credentials. You can obtain an API_KEY_ALPACA and an API_SECRET_ALPACA by registering through the Alpaca Trading API: https://alpaca.markets/. Alternatively, you can send an email to samuele.viola@studenti.unitn.it to receive updated credentials.

Once obtained, modify the environment variables in the .env file with your credentials (API_KEY_ALPACA and API_SECRET_ALPACA).

NOTE: Stock Market Data Availability

The Alpaca (stock market) streaming data is real and is provided Monday to Friday from 9:30 AM to 4:00 PM (US Eastern Time, ET), which corresponds to 3:30 PM to 10:00 PM (Italian Summer Time, CEST). For the rest of the time, synthetic data will be generated to maintain the flow, and therefore, it is not necessary to have updated API_KEY_ALPACA and API_SECRET_ALPACA. The other data streams (macroeconomics data, company fundamentals, bluesky’s sentiment and news’ bluesky) are always real.

1. Predefined Model for continuous streaming

This configuration is designed to start the real-time data stream and use a pre-trained model for prediction. It's the ideal option for those who want to see the project in action without having to manage the initial model training.

Startup

To start only the streaming data flow and prediction, run the following command in your terminal:

 docker-compose -f docker-compose-stream.yml up --build -d

Dashboard Visualization

After launching and ensuring the streaming configuration has been completed, you can access the dashboard at http://localhost:8501.

From the dropdown menu at the top, you can select a company and view both the stock price trend and the future prediction generated by the model, the prediction will show up autonomously in the dashborad 5 minutes after the first data has been registered, along with the real-time detection of potential market anomalies.

2. Customizable full model training

This comprehensive setup first downloads approximately 13 million rows of historical data (~4GB) and then trains the model from scratch. Once this initial training is completed, you will manually transition to a continuous streaming mode with real-time model retraining enabled. This provides a more realistic setup for an evolving model.

The prediction behaviour remains the same as the first configuration, as anticipated before this configuration also ensures weekly retraining of the model.

Startup

To download historical data, perform initial model training, and then start continuous streaming with retraining, follow these steps:

Start the historical data download and initial training:

 docker-compose -f docker-compose-historical.yml up --build -d

Monitor the container logs using the command

 docker-compose logs -f ml_model_train

to determine when the initial training process has finished. This process can take a significant amount of time (around 5 hours) depending on your system's resources and data volume. This line will appear in the container's logs once the process has been completed succesfully: "Model training process finished successfully. Exiting." (the container shuts down too).

Once initial training is completed, shut down the docker-compose-historical.yml configuration:

 docker-compose -f docker-compose-historical.yml down

It is crucial to perform this shutdown to release resources and prepare for the next step.

Start the continuous streaming with retraining:

 docker-compose -f docker-compose-retrain.yml up --build -d

This will launch the application in a mode where it streams real-time data and continuously retrains the model using the historical data you've already downloaded.

Dashboard Visualization

Once the streaming configuration is launched, you can access the dashboard at http://localhost:8501. From the dropdown menu, you can select a company to view its stock price trend, the model's future prediction, and real-time detection of potential market anomalies.

Company Configuration

The project considers a specific set of companies. You can modify these companies by customizing the companies_info.csv file located in the postgres/ folder. If you want to change the companies to be considered in the project, it is necessary to restart docker-compose-historical.yml to re-download the historical data and retrain the models on the new companies. You can choose from all the companies listed on the New York Stock Exchange (NYSE) and NASDAQ.

Additionally, you must remove the persistent PostgreSQL volume that stores previous company data. Otherwise, the changes will not take effect. Run the following command before restarting the configuration:

 docker volume rm bdt-project_pgdata

Important: With Alpaca's free API plan, you cannot exceed 30 companies.

In the companies_info.csv file:

Modify the "ticker_id" column with a unique identification number.
Modify the "ticker" column with the ticker of the company you wish to add.
Modify the "company_name" column with the full company name (this name will appear in the dashboard and will be used for sentiment analysis searches).
Use the "related_words" column to add a second keyword to search for that company (e.g., a company nickname or a closely related term, like "Facebook" for "META").
Set "is_active" to TRUE or FALSE to include or exclude a company from the project.

After modifying and saving the companies_info.csv file, you need to update the fundamental data for the companies. To do this, run the following commands:

Build the Docker image for updating company fundamentals.

 docker build -f company_fundamentals.Dockerfile -t company_fundamentals-image .

Run the image to update fundamentals using .env variables.

 docker run --rm --env-file .env company_fundamentals-image

API Limits for Fundamental Data: The API for fundamental data has a limit of 250 requests per day. Each company requires 3 API calls. Therefore, you cannot modify the full set of 30 companies more than 2 times on the same day.

Shutting Down a Docker Compose

To shut down any Docker Compose configuration, use the command:

 docker-compose -f docker-compose-filename.yml down

For example, to shut down the historical configuration:

 docker-compose -f docker-compose-historical.yml down

Complete Overview

The primary objective of this project is the design and implementation of a big data platform for real-time analysis and prediction of stock market trends. The system integrates streaming and historical data with sentiment analysis and predictive modeling to support informed investment decisions.

Key Objectives

To achieve this goal, the following key objectives have been defined:

Data Ingestion: Ingestion of real-time and historical data from sources such as Alpaca (stock data), Bluesky (social media data), and Finnhub (news).
Sentiment Extraction: Sentiment extraction from news and social media data using FinBERT and Apache Flink.
Data Storage: Storage of non-aggregated streams in MinIO and aggregated data in PostgreSQL.
Unified Aggregation Pipeline: Development of a unified aggregation pipeline for historical and streaming data, implemented with Flink.
Prediction Model: Creation of a model for short-term price forecasting based on LSTM models.
Visualization Dashboard: Development of an interactive dashboard with Streamlit to visualize trends, predictions, and anomaly detection.
Data Recoverability: Storage of data at different stages to ensure recovery capability.

Problem Statement and Motivation

This project stems from the difficulty of predicting stock market trends and making informed investment decisions due to the vast and rapidly changing volume of market data. A big data system with real-time processing is essential for extracting meaningful insights. Traditional approaches often lack comprehensive sentiment analysis.

This system aims to empower investors with advanced decision-making tools through timely detection of trends and risks, leading to improved investment outcomes and reduced exposure to losses. Additionally, it enables the unlocking of new insights from unstructured data (news and social media).

Data Exploration

The project utilizes various data sources, both streaming and historical:

Streaming Data:

Stock Market (Alpaca): timestamp, ticker, price, size, exchange. Frequency: less than 1 observation/second/ticker.
Macroeconomics: timestamp, gdp_real, cpi, ffr, t10y, t2y, spread_10y_2y, unemployment. Frequency: daily, monthly, or quarterly.
Bluesky: timestamp, ticker, text. Frequency: undefined.
News (Finnhub): timestamp, ticker, title, description. Frequency: undefined.

Historical Data:

Stock Market (Alpaca): timestamp, ticker, open, close, size, exchange. Frequency: 1 observation/minute/ticker.
Macroeconomics: timestamp, gdp_real, cpi, ffr, t10y, t2y, spread_10y_2y, unemployment. Frequency: daily, monthly, or quarterly.
Company Fundamentals: calendar year, eps, freeCashFlow, revenue, netIncome, balance_totalDebt, balance_totalStockholdersEquity. Frequency: 1 observation/year.

System Architecture

The system comprises several modules that manage data flow from ingestion to visualization:

Stream Data Module: Ingests data from macroeconomics, Bluesky, news, and Alpaca stock data.
Kafka: Serves as a message broker for data streaming.
Flink Global Job / Flink Aggregator Job / Flink Main Job: Handle transformation and aggregation of streaming and historical data.
Raw Storage (Data Lake): MinIO is used for high-performance, S3-compatible object storage for managing large raw datasets.
Sentiment Analysis: News and Bluesky data are processed for sentiment analysis, with results stored in Sentiment Storage.
Stream Aggregated Data: Data is preprocessed and aggregated for the prediction using the model.
Prediction Model: Utilizes the LSTM model to generate predictions based on aggregated data.
PostgreSQL: Used as a robust relational database for reliable storage of structured data, including company configurations and processed results.
Historical Data Storage: Stores historical stock, company fundamental, and macroeconomic data.
Dashboard (Streamlit): Provides interactive visualization of real-time data, predictions, and historical trends.

Data Flow

The data flows through the different layers as follows:

Ingestion Layer: All streaming data (stock market, macroeconomics, Bluesky, news) and historical data (company fundamentals, macroeconomics, stock market) are ingested.
Sentiment Analysis Layer: Bluesky and news data are routed for sentiment analysis and stored in MinIO.
Aggregation Layer: Streaming and historical data are aggregated and stored in PostgreSQL. For company fundamentals, only the latest record is used.
Model Training Layer: Aggregated data from PostgreSQL is used for model training.
Prediction Layer: Predictions are generated using the ML models and stream aggregated data.
Dashboard: All predictions and relevant data are displayed on the dashboard.

Technologies Used

Objective	Technology	Role
Containerization/Orchestration	Docker	Used for reproducible and consistent environments, simple configuration, and deployment management of multi-service applications.
Streaming Data Management	Kafka (with Zookeeper)	Facilitates robust, efficient, scalable, and real-time data streaming between all project components.
Stream Processing	Apache Flink	Used for powerful, real-time stream processing, enabling complex data transformations and stateful computations on live data streams.
Data Storage	MinIO	Provides flexible, high-performance, S3-compatible object storage for managing large datasets.
	PostgreSQL	Used as a robust relational database for reliable storage of structured data, including company configurations and processed results.
Models	Quantized FinBERT	A specialized pre-trained model for accurate sentiment analysis of financial text.
	LSTM Model	A type of neural network specifically used for time series prediction, enabling accurate stock price predictions one minute ahead.
Live Insight Visualization	Streamlit	Used to rapidly build an interactive dashboard for easily visualizing real-time data, predictions, and historical trends.

Results

The dashboard (accessible via Streamlit) allows interactive visualization of actual and predicted price trends. It includes a dropdown menu to select the company (ticker), trend lines for actual price (blue) and predicted price (red), an opening price line as a reference level, and display of the opening price level with percentage and absolute change.

Anomaly Detection: Anomalies, defined as rapid price changes (upward or downward), are highlighted on the dashboard. Negative changes are in red, positive ones in green. The anomaly detection threshold is customizable through environment variables. The dashboard also displays the last detected anomaly and its details, with gradient shading highlighting areas where anomalies were detected.

Aggregated Data: Each row of aggregated data represents a (ticker, timestamp) pair with all features used by the predictive model. Features include:

Stock price-derived features: price_mean_1min, price_mean_5min, price_std_5min, price_mean_30min, price_std_30min, size_tot_1min, size_tot_5min, size_tot_30min.
Sentiment features: sentiment_bluesky_mean_2hours, sentiment_bluesky_mean_1day, sentiment_news_mean_1day, sentiment_news_mean_3days, sentiment_general_bluesky_mean_2hours, sentiment_general_bluesky_mean_1day.
Temporal features: minutes_since_open, day_of_week, day_of_month, week_of_year, month_of_year, market_open_spike_flag, market_close_spike_flag.
Fundamental features:

--> eps (Earnings Per Share): Measures a company's profit allocated to each outstanding share of common stock. Higher EPS indicates greater profitability.

--> free_cash_flow: The cash a company generates after accounting for capital expenditures. It shows how much cash is available for debt repayment, dividends, or reinvestment.

--> profit_margin: The percentage of revenue that turns into profit. It reflects how efficiently a company controls costs relative to its sales.

--> debt_to_equity: A ratio comparing a company’s total debt to its shareholders’ equity. It indicates the level of financial leverage and risk.
Macroeconomic features:

--> gdp_real: The inflation-adjusted value of all goods and services produced in the economy. It reflects overall economic growth.

--> cpi (Consumer Price Index): Measures the average change in prices paid by consumers. It’s a key indicator of inflation.

--> ffr (Federal Funds Rate): The interest rate at which banks lend to each other overnight. It influences overall monetary policy and borrowing costs.

--> t10y (10-Year Treasury Yield): The return on U.S. government bonds maturing in 10 years. It’s a benchmark for long-term interest rates.

--> t2y (2-Year Treasury Yield): The return on U.S. government bonds maturing in 2 years. It reflects short-term interest rate expectations.

--> spread_10y_2y: The difference between 10-year and 2-year Treasury yields. Negative values may signal a potential recession.

--> unemployment: The percentage of the labor force that is jobless and actively seeking work. It indicates labor market health.
Target: y1 — Next-minute price (prediction target).

System Performance

The system demonstrates robust performance characteristics during operation, as evidenced by comprehensive monitoring metrics:

CPU Usage:

Initial spike below 1000% due to many containers being created simultaneously during system startup.
After the bootstrap phase, usage stabilizes between 200% and 500% (roughly 2–5 out of 12 available cores).
Occasional spikes are caused by intensive phases of active Flink jobs or retrain module.

Memory Usage:

The rapid growth up to ~8.85GB at startup, followed by a consistently high memory footprint is caused by the presence of numerous in-memory buffers associated with time windowing logic used in stream processing.
Buffers remain filled and are not frequently emptied, due to the stateful nature of multiple active Flink jobs.

Lessons Learned

During project development, several challenges and successes were encountered:

Major Challenges:

LSTM model setup: managing different granularity between historical and streaming data.
Broadcast state: joining global data with ticker-specific data.
Module parallelization.
Using EXCLUSIVELY real data.
Sentiment analysis setup: for example, sentiment score polarization.

What didn't work well:

Broadcast state.
Historical data retrieval for sentiment analysis.
Retrain.

What worked well:

API retrieval and configuration.
Dashboard configuration.
Teamwork.
Docker Compose and Kafka container configuration.

Limitations and Future Work

Current system limitations:

No predictions outside market hours: replaced with simulated data.
API limitations: some APIs expire or offer limited access.
No historical news or social media data: the current model is not trained on historical sentiment scores.

What would break first when the system scales and why:

Streaming data aggregation: not parallelizable without broadcast state.

Potential improvements:

Enable predictions even when the stock market is closed.
Improve models for more reliable predictions.
Add customizable options for short and long-term predictions.
Incorporate more sources for sentiment analysis.
Expand the system to more stock markets.

What would be done with more time/resources:

Train the model on sentiment as well.
Provide predictions over different time horizons.
Implement robust predictions even when the stock market is closed.
Parallelize historical data aggregation.
Offer personalized portfolio insights.

NOTE: for more details on the specific modules please check the README.md of the different folders.

References and Acknowledgments

References:

Prasad, A., & Seetharaman, A. (2021). Importance of machine learning in making investment decision in stock market. Vikalpa, 46(4), 209-222.
Bhandari, H. N., Rimal, B., Pokhrel, N. R., Rimal, R., Dahal, K. R., & Khatri, R. K. (2022). Predicting stock market index using LSTM. Machine Learning with Applications, 9, 100320.

Contributors:

Pietro Lovato, Machine Learning researcher at the University of Verona.

Name		Name	Last commit message	Last commit date
Latest commit History 242 Commits
dashboard		dashboard
data/minio/.minio.sys		data/minio/.minio.sys
historical_data		historical_data
jar_flink		jar_flink
ml_model		ml_model
postgresql		postgresql
prediction_layer		prediction_layer
preprocessing_layer		preprocessing_layer
sentiment_analysis		sentiment_analysis
stream_data		stream_data
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
company_fundamentals.Dockerfile		company_fundamentals.Dockerfile
company_fundamentals.py		company_fundamentals.py
docker-compose-historical.yml		docker-compose-historical.yml
docker-compose-retrain.yml		docker-compose-retrain.yml
docker-compose-stream.yml		docker-compose-stream.yml

Folders and files

Latest commit

History

Repository files navigation

Stock Market Big Data Platform

Table of Contents

Project Startup Guide

Prerequisites

Essential Setup

NOTE: Stock Market Data Availability

1. Predefined Model for continuous streaming

Startup

Dashboard Visualization

2. Customizable full model training

Startup

Dashboard Visualization

Company Configuration

Shutting Down a Docker Compose

Complete Overview

Key Objectives

Problem Statement and Motivation

Data Exploration

Streaming Data:

Historical Data:

System Architecture

Data Flow

Technologies Used

Results

System Performance

Lessons Learned

Major Challenges:

What didn't work well:

What worked well:

Limitations and Future Work

Current system limitations:

What would break first when the system scales and why:

Potential improvements:

What would be done with more time/resources:

References and Acknowledgments

References:

Contributors:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages