Skip to content

warun7/HullTacticalMarketPrediction

Repository files navigation

Deep Learning Project Report: Hull Tactical Market Prediction

Team Members: Aryan Gosain, Kunal Ranjan, Rishit Anand, Varun SG

1. Executive Summary

This project aimed to develop a deep learning model to predict the daily excess returns of the S&P 500 index and generate daily portfolio allocation decisions. Originating from the "Hull Tactical Market Prediction" Kaggle competition, the project addresses the challenge of forecasting in a low signal-to-noise environment.

The team evaluated multiple architectures, including CNN-LSTM and Temporal Convolutional Networks (TCN), before developing a final Enhanced FT-Transformer (Feature Tokenizer Transformer). This model achieved a Sharpe Ratio of 1.699 over a 24-year validation period, significantly outperforming the S&P 500 benchmark (Sharpe ~0.67–1.05) and previous deep learning baselines.

2. Problem Statement

The objective was to predict the daily excess returns of the S&P 500 and output a daily portfolio allocation decision to maximize risk-adjusted returns (Sharpe Ratio).

  • Input Data: 8,990 trading days $\times$ 98 features.
  • Features: Market Dynamics ($M^$), Macroeconomic ($E^$), Interest Rates ($I^$), Price/Valuation ($P^$), Volatility ($V^$), Sentiment ($S^$), and Dummy variables ($D^*$).
  • Output: An allocation value between $[0, 2]$:
    • $0$: No investment (Cash/Risk-free).
    • $1$: Full investment in S&P 500.
    • $2$: 2x Leveraged investment.
  • Target Variable: market_forward_excess_returns (Forward returns relative to expectations, winsorized using Median Absolute Deviation).

3. Theoretical Challenges

Financial time series prediction presents unique challenges compared to domains like computer vision or NLP:

  1. Low Signal-to-Noise Ratio: The daily volatility of the S&P 500 (~1%) is roughly 25 times larger than the expected daily return (~0.04%). The "signal" is buried in "noise".
  2. The $R^2$ Paradox: In this domain, a negative Out-of-Sample $R^2$ does not necessarily imply a failed model. A model with low or negative $R^2$ can still generate significant economic profits (high Sharpe Ratio) if the timing variance is managed correctly.
  3. Efficient Market Hypothesis: "Shallow" models often outperform "deep" models because financial data lacks the high-fidelity signal required to train complex deep networks without overfitting.

4. Methodology & Research-Based Architecture

4.1. Model Architectures

We experimented with several deep learning architectures, evolving from standard sequence models to specialized tabular transformers. This was based on both experimentation and looking at proposed solutions for financial modelling and modifying them for our use case.

A. CNN-LSTM Ensemble

Reference: Kaijian He et al., "Financial Time Series Forecasting with the Deep Learning Ensemble Model," Mathematics 11, no. 4 (2023): 1054 [7].

Approach: Inspired by the architecture proposed by He et al. [7], we combined the spatial feature extraction capabilities of Convolutional Neural Networks (CNNs) with the temporal sequence modeling of Long Short-Term Memory (LSTM) networks. We modified this by creating an ensemble of 3 versions and adding residual connections to model linear and non-linear features.

  • Architecture:
    • Branch A (CNN-LSTM): A 1D Convolutional layer (64 filters, kernel size 3) extracts local short-term patterns (e.g., 3-day trends). This is followed by Batch Normalization and an LSTM layer (64 units) to capture longer-term dependencies.
    • Branch B (Residual Skip): A direct skip connection processes the raw input through a dense layer, preserving the original signal and aiding gradient flow.
    • Merge: The branches are concatenated and passed through dense layers to produce the final prediction.
  • Ensemble Strategy: To improve robustness, we trained an ensemble of 3 identical models with different random seeds and averaged their predictions.
  • Performance: Adjusted Sharpe Ratio: 0.696
    Architecture from [7] alt text Our Architecture below alt text


B. Temporal Convolutional Network (TCN)

Reference: Li, Jianlong, Siyuan Wang, Zhihang Zhu, and Bingyan Han. "Stock Prediction Based on Deep Learning and its Application in Pairs Trading." In 2022 International Symposium on Networks, Computers and Communications (ISNCC), 1-6. IEEE, 2022. [8]

The TCN, inspired by the work presented in [8], uses dilated causal convolutions to capture long-range temporal dependencies without the computational overhead of recurrent networks. This was used primarily because TCN models have been proven to perform better than LSTM models for financial modelling in current literature.

  • Architecture:
    • Dilated Convolutions: The model uses a stack of residual blocks with exponentially increasing dilation factors ($d = 1, 2, 4, 8$). This allows the network to have a large receptive field, seeing 15+ timesteps into the past with only a few layers.
    • Causal Padding: Ensures that predictions at time $t$ depend only on inputs from time $t$ and earlier, preventing data leakage.
    • Configuration: 128 filters per layer, kernel size 2, 2 stacks.
  • Performance: Adjusted Sharpe Ratio: 0.749 alt text alt text


C. Vanilla Transformer

Chen, Qizhao. "Comparing Different Transformer Model Structures for Stock Prediction." arXiv. April 23, 2025. [6]

We applied the standard Transformer architecture, inspired by [6], originally designed for NLP, to financial time series to better compare all features against all time points.

  • Architecture:
    • Input Embedding: A linear projection maps the 94 input features to a d_model of 256.
    • Positional Encoding: Learnable embeddings are added to inject sequence order information.
    • Encoder: A stack of 4 TransformerEncoderLayers with 8 attention heads and a feedforward dimension of 512.
    • Attention Mechanism: Multi-head self-attention captures global temporal dependencies across the 64-day lookback window.
  • Performance: Adjusted Sharpe Ratio: 0.809
    Architecture from [6] alt text
    Our Architecture below alt text


4.2. Feature Engineering Pipeline

Before feeding data into our final transformer models, we implemented a rigorous preprocessing pipeline to expand and select the most relevant features.

  1. Interaction Feature Generation: We expanded the original 94 features by calculating pairwise interactions (Addition, Subtraction, Multiplication) for all feature pairs ($f_1+f_2, f_1-f_2, f_1 \times f_2$). This resulted in a massive set of 13,207 features.
  2. Selection via XGBoost: To handle this high dimensionality, we trained an XGBoost Regressor on the expanded set to calculate gain-based feature importance.
  3. Filtering: We selected the top 150 features based on importance scores. This subset was used as the input for both FT-Transformer models. alt text

4.3. FT-Transformers

Reference: Gorishniy, Yury, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. "Revisiting Deep Learning Models for Tabular Data." arXiv:2106.11959 (2021). [1]

The core of our final solution is the Feature Tokenizer Transformer (FT-Transformer) [1], which is specifically designed to handle tabular data more effectively than standard MLPs or ResNets. We modified this approach by the features mentioned in 4.3.B along with a preprocessing architecture highlighted in 4.2.

A. Baseline FT-Transformer

  • Feature Tokenization: Unlike standard transformers that treat time steps as tokens, the FT-Transformer treats each feature as a separate token. Each of the 150 features is projected into a 128-dimensional embedding space.
  • Architecture:
    • 3 Transformer Encoder layers.
    • ReGLU (Rectified Gated Linear Unit) activation in the feed-forward networks.
    • [CLS] Token: A special token is appended to aggregate information from all features for the final prediction.
  • Performance: Adjusted Sharpe Ratio: 0.684

alt text
Architecture from [1] alt text
Our Architecture below alt text

B. Enhanced FT-Transformer (Final Model)

We significantly improved the baseline model by introducing three key innovations:

  1. PLR Embeddings (Periodic Linear Representations):

    • Instead of simple linear projections, we used Fourier Features [9] to embed continuous values.
    • Formula: $\varphi(x_i) = [\sin(2\pi x_i \cdot v_i), \cos(2\pi x_i \cdot v_i)]$
    • This maps scalar inputs into a higher-dimensional space, allowing the model to capture high-frequency periodic patterns often missed by standard linear embeddings.
  2. SwiGLU Activation:

    • We replaced ReGLU with SwiGLU (Swish-Gated Linear Unit) [10], popularized by LLaMA.
    • Formula: $SwiGLU(x) = (x \cdot W_1) \odot \text{Swish}(x \cdot W_2)$
    • SwiGLU provides smoother gradients and better optimization stability compared to ReLU-based activations.
  3. SAM Optimizer (Sharpness-Aware Minimization):

    • We trained the model using SAM [11], which minimizes both the loss value and the sharpness of the loss landscape.
    • By finding a "flat minimum," the model generalizes significantly better to unseen, noisy financial data, reducing the risk of overfitting.
  • Performance: Adjusted Sharpe Ratio: 1.699 alt text alt text

5. Experimentation & Results

5.1. Evaluation Metric

The primary metric used to evaluate model performance was the Sharpe Ratio, which measures the excess return of an investment per unit of risk (volatility).

$$Sharpe~Ratio = \frac{E[R - R_f]}{\sigma}$$

  • Interpretation:
    • > 1.0: Generally considered "good".
    • > 2.0: Considered "very good" to "outstanding".
    • A higher ratio indicates that the strategy performs better relative to the risk taken.

5.2. Comparative Performance

We evaluated several architectures to establish a baseline before finalizing the FT-Transformer.

Model Description Sharpe Ratio
CNN-LSTM Hybrid spatial-temporal model with residual connections. 0.696
TCN Temporal Convolutional Network with dilated convolutions. 0.749
Vanilla Transformer Standard Transformer Encoder for time series. 0.809
Enhanced FT-Transformer Final Model. PLR embeddings + SwiGLU + SAM optimizer. 1.699

5.3. Final Results Analysis

The final Enhanced FT-Transformer achieved a Sharpe Ratio of 1.699 over the 24-year validation period.

  • Economic Interpretation: A Sharpe ratio of 1.699 indicates that the investment strategy performs 1.699x better than investing in a risk-free strategy like government bonds.

A. Benchmarking Against S&P 500

To validate the model's effectiveness, we compared it against the historical Sharpe Ratios of the S&P 500 index over multiple time horizons.

Key Finding: Our model outperformed the long-term market average by ~2.5x and the recent bull market by ~1.6x.

Time Horizon S&P 500 Sharpe Context
30 Years 0.67 The long-term market average.
10 Years 0.72 The market average over the last decade.
3 Years 1.05 Recent "bull market" conditions.

Note: For reference, our baseline CNN-LSTM model was only able to outperform the long-term market average by approximately 1.04x, highlighting the significant jump in performance achieved by the FT-Transformer.

B. Real-World Viability & Context

It is crucial to contextualize these results within the constraints of real-world finance versus theoretical competition metrics.

  • Kaggle vs. Reality: While some models in the competition leaderboard achieved Sharpe scores in the range of 5–17, these models were likely based on "pure math," looked-ahead bias, or were optimizing directly for the Sharpe ratio rather than predictive signal.
  • Realistic Upper Bounds: In the real world, it is considered next to impossible for Sharpe ratios to cross 3.0 over a sustained period.
  • Industry Benchmark: For a realistic comparison, the highest-scoring Multi-Asset Fund in India currently holds a Sharpe Ratio of 1.88 (over the last three years).
  • Conclusion: Our model's score of 1.699 places it in a highly competitive bracket for a fully automated strategy. The results reinforce the finding that shallow models (or carefully regularized ones) often prove to be better than deep models for financial data due to the noise present in the dataset.

6. Lessons Learned

  1. Shallow vs. Deep: Contrary to fields like Computer Vision, "shallow" models (or carefully regularized deep models like FT-Transformer) often perform better in finance due to the extreme noise.
  2. Feature Engineering is Critical: The jump in performance was largely driven by the interaction features and the specific embedding of continuous variables (PLR), rather than just increasing model depth.
  3. Optimization Matters: Using SAM to smooth the loss landscape was essential for generalization in a noisy domain.
  4. Target Scaling: Scaling or normalizing the target variable proved helpful when loss values were too small to drive gradient updates.

7. Future Work & Improvements

  1. Allocation Scaling: Experiment with altering factors such as the scaling factor used for converting predictions into portfolio allocations (prediction to signal conversion) to optimize risk-adjusted returns.
  2. Classification Approach: Try converting the problem from a regression task to a classification problem. Instead of predicting continuous returns, categorize outputs into classes such as "Don't Buy" (Cash), "Buy" (1x), and "Leveraged Buy" (2x), and use these classes for signal conversion.

8. References

  1. Gorishniy, Y., et al. "Revisiting Deep Learning Models for Tabular Data." arXiv:2106.11959 (2021). https://arxiv.org/abs/2106.11959
  2. Gu, S., Kelly, B., & Xiu, D. "Empirical Asset Pricing via Machine Learning." The Review of Financial Studies (2020).
  3. Campbell, J. Y., & Thompson, S. B. "Predicting Excess Stock Returns Out of Sample." The Review of Financial Studies (2008).
  4. Kelly, B., et al. "The Virtue of Complexity in Return Prediction." The Journal of Finance (2024).
  5. Li, J., et al. "Stock Prediction Based on Deep Learning and its Application in Pairs Trading." IEEE (2022).
  6. Chen, Qizhao. "Comparing Different Transformer Model Structures for Stock Prediction." arXiv. April 23, 2025. https://arxiv.org/abs/2504.16361
  7. Kaijian He et al., “Financial Time Series Forecasting with the Deep Learning Ensemble Model,” Mathematics 11, no. 4 (2023): 1054, https://doi.org/10.3390/math11041054
  8. "Temporal Convolutional Networks for the Classification of Satellite Image Time Series," IEEE Xplore. https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9851776
  9. Shazeer, N. "GLU Variants Improve Transformer." arXiv:2002.05202 (2020).
  10. Foret, P., et al. "Sharpness-Aware Minimization for Efficiently Improving Generalization." arXiv:2010.01412 (2021).

Video here -> https://youtu.be/4uuSSBy_0ZM?si=7JGf0MOYgrbtNZZw

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors