Deep Learning Project Report: Hull Tactical Market Prediction

Team Members: Aryan Gosain, Kunal Ranjan, Rishit Anand, Varun SG

1. Executive Summary

This project aimed to develop a deep learning model to predict the daily excess returns of the S&P 500 index and generate daily portfolio allocation decisions. Originating from the "Hull Tactical Market Prediction" Kaggle competition, the project addresses the challenge of forecasting in a low signal-to-noise environment.

The team evaluated multiple architectures, including CNN-LSTM and Temporal Convolutional Networks (TCN), before developing a final Enhanced FT-Transformer (Feature Tokenizer Transformer). This model achieved a Sharpe Ratio of 1.699 over a 24-year validation period, significantly outperforming the S&P 500 benchmark (Sharpe ~0.67–1.05) and previous deep learning baselines.

Video: https://youtu.be/4uuSSBy_0ZM?si=7JGf0MOYgrbtNZZw

2. Problem Statement

The objective was to predict the daily excess returns of the S&P 500 and output a daily portfolio allocation decision to maximize risk-adjusted returns (Sharpe Ratio).

Input Data: 8,990 trading days $\times$ 98 features.
Features: Market Dynamics ($M^$), Macroeconomic ($E^$), Interest Rates ($I^$), Price/Valuation ($P^$), Volatility ($V^$), Sentiment ($S^$), and Dummy variables ($D^*$).
Output: An allocation value between $[0, 2]$:
- $0$: No investment (Cash/Risk-free).
- $1$: Full investment in S&P 500.
- $2$: 2x Leveraged investment.
Target Variable: market_forward_excess_returns (Forward returns relative to expectations, winsorized using Median Absolute Deviation).

3. Theoretical Challenges

Financial time series prediction presents unique challenges compared to domains like computer vision or NLP:

Low Signal-to-Noise Ratio: The daily volatility of the S&P 500 (~1%) is roughly 25 times larger than the expected daily return (~0.04%). The "signal" is buried in "noise".
The $R^2$ Paradox: In this domain, a negative Out-of-Sample $R^2$ does not necessarily imply a failed model. A model with low or negative $R^2$ can still generate significant economic profits (high Sharpe Ratio) if the timing variance is managed correctly.
Efficient Market Hypothesis: "Shallow" models often outperform "deep" models because financial data lacks the high-fidelity signal required to train complex deep networks without overfitting.

4. Methodology & Research-Based Architecture

4.1. Model Architectures

We experimented with several deep learning architectures, evolving from standard sequence models to specialized tabular transformers. This was based on both experimentation and looking at proposed solutions for financial modelling and modifying them for our use case.

A. CNN-LSTM Ensemble

Reference: Kaijian He et al., "Financial Time Series Forecasting with the Deep Learning Ensemble Model," Mathematics 11, no. 4 (2023): 1054 [7].

Approach: Inspired by the architecture proposed by He et al. [7], we combined the spatial feature extraction capabilities of Convolutional Neural Networks (CNNs) with the temporal sequence modeling of Long Short-Term Memory (LSTM) networks. We modified this by creating an ensemble of 3 versions and adding residual connections to model linear and non-linear features.

Architecture:
- Branch A (CNN-LSTM): A 1D Convolutional layer (64 filters, kernel size 3) extracts local short-term patterns (e.g., 3-day trends). This is followed by Batch Normalization and an LSTM layer (64 units) to capture longer-term dependencies.
- Branch B (Residual Skip): A direct skip connection processes the raw input through a dense layer, preserving the original signal and aiding gradient flow.
- Merge: The branches are concatenated and passed through dense layers to produce the final prediction.
Ensemble Strategy: To improve robustness, we trained an ensemble of 3 identical models with different random seeds and averaged their predictions.
Performance: Adjusted Sharpe Ratio: 0.696
Architecture from [7] Our Architecture below

B. Temporal Convolutional Network (TCN)

Reference: Li, Jianlong, Siyuan Wang, Zhihang Zhu, and Bingyan Han. "Stock Prediction Based on Deep Learning and its Application in Pairs Trading." In 2022 International Symposium on Networks, Computers and Communications (ISNCC), 1-6. IEEE, 2022. [8]

The TCN, inspired by the work presented in [8], uses dilated causal convolutions to capture long-range temporal dependencies without the computational overhead of recurrent networks. This was used primarily because TCN models have been proven to perform better than LSTM models for financial modelling in current literature.

Architecture:
- Dilated Convolutions: The model uses a stack of residual blocks with exponentially increasing dilation factors ($d = 1, 2, 4, 8$). This allows the network to have a large receptive field, seeing 15+ timesteps into the past with only a few layers.
- Causal Padding: Ensures that predictions at time $t$ depend only on inputs from time $t$ and earlier, preventing data leakage.
- Configuration: 128 filters per layer, kernel size 2, 2 stacks.
Performance: Adjusted Sharpe Ratio: 0.749

C. Vanilla Transformer

Chen, Qizhao. "Comparing Different Transformer Model Structures for Stock Prediction." arXiv. April 23, 2025. [6]

We applied the standard Transformer architecture, inspired by [6], originally designed for NLP, to financial time series to better compare all features against all time points.

Architecture:
- Input Embedding: A linear projection maps the 94 input features to a d_model of 256.
- Positional Encoding: Learnable embeddings are added to inject sequence order information.
- Encoder: A stack of 4 TransformerEncoderLayers with 8 attention heads and a feedforward dimension of 512.
- Attention Mechanism: Multi-head self-attention captures global temporal dependencies across the 64-day lookback window.
Performance: Adjusted Sharpe Ratio: 0.809
Architecture from [6]
Our Architecture below

4.2. Feature Engineering Pipeline

Before feeding data into our final transformer models, we implemented a rigorous preprocessing pipeline to expand and select the most relevant features.

Interaction Feature Generation: We expanded the original 94 features by calculating pairwise interactions (Addition, Subtraction, Multiplication) for all feature pairs ($f_1+f_2, f_1-f_2, f_1 \times f_2$). This resulted in a massive set of 13,207 features.
Selection via XGBoost: To handle this high dimensionality, we trained an XGBoost Regressor on the expanded set to calculate gain-based feature importance.
Filtering: We selected the top 150 features based on importance scores. This subset was used as the input for both FT-Transformer models.

4.3. FT-Transformers

Reference: Gorishniy, Yury, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. "Revisiting Deep Learning Models for Tabular Data." arXiv:2106.11959 (2021). [1]

The core of our final solution is the Feature Tokenizer Transformer (FT-Transformer) [1], which is specifically designed to handle tabular data more effectively than standard MLPs or ResNets. We modified this approach by the features mentioned in 4.3.B along with a preprocessing architecture highlighted in 4.2.

A. Baseline FT-Transformer

Feature Tokenization: Unlike standard transformers that treat time steps as tokens, the FT-Transformer treats each feature as a separate token. Each of the 150 features is projected into a 128-dimensional embedding space.
Architecture:
- 3 Transformer Encoder layers.
- ReGLU (Rectified Gated Linear Unit) activation in the feed-forward networks.
- [CLS] Token: A special token is appended to aggregate information from all features for the final prediction.
Performance: Adjusted Sharpe Ratio: 0.684

Architecture from [1]
Our Architecture below

B. Enhanced FT-Transformer (Final Model)

We significantly improved the baseline model by introducing three key innovations:

PLR Embeddings (Periodic Linear Representations):
- Instead of simple linear projections, we used Fourier Features [9] to embed continuous values.
- Formula: $\varphi(x_i) = [\sin(2\pi x_i \cdot v_i), \cos(2\pi x_i \cdot v_i)]$
- This maps scalar inputs into a higher-dimensional space, allowing the model to capture high-frequency periodic patterns often missed by standard linear embeddings.
SwiGLU Activation:
- We replaced ReGLU with SwiGLU (Swish-Gated Linear Unit) [10], popularized by LLaMA.
- Formula: $SwiGLU(x) = (x \cdot W_1) \odot \text{Swish}(x \cdot W_2)$
- SwiGLU provides smoother gradients and better optimization stability compared to ReLU-based activations.
SAM Optimizer (Sharpness-Aware Minimization):
- We trained the model using SAM [11], which minimizes both the loss value and the sharpness of the loss landscape.
- By finding a "flat minimum," the model generalizes significantly better to unseen, noisy financial data, reducing the risk of overfitting.

Performance: Adjusted Sharpe Ratio: 1.699

5. Experimentation & Results

5.1. Evaluation Metric

The primary metric used to evaluate model performance was the Sharpe Ratio, which measures the excess return of an investment per unit of risk (volatility).

$$Sharpe~Ratio = \frac{E[R - R_f]}{\sigma}$$

Interpretation:
- > 1.0: Generally considered "good".
- > 2.0: Considered "very good" to "outstanding".
- A higher ratio indicates that the strategy performs better relative to the risk taken.

5.2. Comparative Performance

We evaluated several architectures to establish a baseline before finalizing the FT-Transformer.

Model	Description	Sharpe Ratio
CNN-LSTM	Hybrid spatial-temporal model with residual connections.	0.696
TCN	Temporal Convolutional Network with dilated convolutions.	0.749
Vanilla Transformer	Standard Transformer Encoder for time series.	0.809
Enhanced FT-Transformer	Final Model. PLR embeddings + SwiGLU + SAM optimizer.	1.699

5.3. Final Results Analysis

The final Enhanced FT-Transformer achieved a Sharpe Ratio of 1.699 over the 24-year validation period.

Economic Interpretation: A Sharpe ratio of 1.699 indicates that the investment strategy performs 1.699x better than investing in a risk-free strategy like government bonds.

A. Benchmarking Against S&P 500

To validate the model's effectiveness, we compared it against the historical Sharpe Ratios of the S&P 500 index over multiple time horizons.

Key Finding: Our model outperformed the long-term market average by ~2.5x and the recent bull market by ~1.6x.

Time Horizon	S&P 500 Sharpe	Context
30 Years	0.67	The long-term market average.
10 Years	0.72	The market average over the last decade.
3 Years	1.05	Recent "bull market" conditions.

Note: For reference, our baseline CNN-LSTM model was only able to outperform the long-term market average by approximately 1.04x, highlighting the significant jump in performance achieved by the FT-Transformer.

B. Real-World Viability & Context

It is crucial to contextualize these results within the constraints of real-world finance versus theoretical competition metrics.

Kaggle vs. Reality: While some models in the competition leaderboard achieved Sharpe scores in the range of 5–17, these models were likely based on "pure math," looked-ahead bias, or were optimizing directly for the Sharpe ratio rather than predictive signal.
Realistic Upper Bounds: In the real world, it is considered next to impossible for Sharpe ratios to cross 3.0 over a sustained period.
Industry Benchmark: For a realistic comparison, the highest-scoring Multi-Asset Fund in India currently holds a Sharpe Ratio of 1.88 (over the last three years).
Conclusion: Our model's score of 1.699 places it in a highly competitive bracket for a fully automated strategy. The results reinforce the finding that shallow models (or carefully regularized ones) often prove to be better than deep models for financial data due to the noise present in the dataset.

6. Lessons Learned

Shallow vs. Deep: Contrary to fields like Computer Vision, "shallow" models (or carefully regularized deep models like FT-Transformer) often perform better in finance due to the extreme noise.
Feature Engineering is Critical: The jump in performance was largely driven by the interaction features and the specific embedding of continuous variables (PLR), rather than just increasing model depth.
Optimization Matters: Using SAM to smooth the loss landscape was essential for generalization in a noisy domain.
Target Scaling: Scaling or normalizing the target variable proved helpful when loss values were too small to drive gradient updates.

7. Future Work & Improvements

Allocation Scaling: Experiment with altering factors such as the scaling factor used for converting predictions into portfolio allocations (prediction to signal conversion) to optimize risk-adjusted returns.
Classification Approach: Try converting the problem from a regression task to a classification problem. Instead of predicting continuous returns, categorize outputs into classes such as "Don't Buy" (Cash), "Buy" (1x), and "Leveraged Buy" (2x), and use these classes for signal conversion.

8. References

Gorishniy, Y., et al. "Revisiting Deep Learning Models for Tabular Data." arXiv:2106.11959 (2021). https://arxiv.org/abs/2106.11959
Gu, S., Kelly, B., & Xiu, D. "Empirical Asset Pricing via Machine Learning." The Review of Financial Studies (2020).
Campbell, J. Y., & Thompson, S. B. "Predicting Excess Stock Returns Out of Sample." The Review of Financial Studies (2008).
Kelly, B., et al. "The Virtue of Complexity in Return Prediction." The Journal of Finance (2024).
Li, J., et al. "Stock Prediction Based on Deep Learning and its Application in Pairs Trading." IEEE (2022).
Chen, Qizhao. "Comparing Different Transformer Model Structures for Stock Prediction." arXiv. April 23, 2025. https://arxiv.org/abs/2504.16361
Kaijian He et al., “Financial Time Series Forecasting with the Deep Learning Ensemble Model,” Mathematics 11, no. 4 (2023): 1054, https://doi.org/10.3390/math11041054
"Temporal Convolutional Networks for the Classification of Satellite Image Time Series," IEEE Xplore. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9851776
Shazeer, N. "GLU Variants Improve Transformer." arXiv:2002.05202 (2020).
Foret, P., et al. "Sharpness-Aware Minimization for Efficiently Improving Generalization." arXiv:2010.01412 (2021).

Video here -> https://youtu.be/4uuSSBy_0ZM?si=7JGf0MOYgrbtNZZw

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
arch_imgs		arch_imgs
enhanced_ft		enhanced_ft
ft_transformer		ft_transformer
EDA_Analysis_Guide.md		EDA_Analysis_Guide.md
README.md		README.md
eda.ipynb		eda.ipynb
report.md		report.md
rescnnlstm2.ipynb		rescnnlstm2.ipynb
rescnnlstm2_sequential.ipynb		rescnnlstm2_sequential.ipynb
tcn1.ipynb		tcn1.ipynb
tcn3.ipynb		tcn3.ipynb
test.csv		test.csv
train.csv		train.csv
transformer.ipynb		transformer.ipynb
xgboost.ipynb		xgboost.ipynb

Provide feedback

Saved searches