This report presents a comprehensive analysis of the Graph Attention Long Short-Term Memory (GAT-LSTM) architecture for spatiotemporal wheat price forecasting. The model combines Graph Attention Networks (GATs) for capturing spatial dependencies between agricultural markets (mandis) with Long Short-Term Memory (LSTM) networks for temporal sequence modeling. We detail the mathematical foundations, architectural components, training methodology, and theoretical justifications for design decisions.
Objective: Predict future price sequences for agricultural commodities across multiple spatially-distributed markets.
Input:
- Historical price sequence:
$\mathbf{X} \in \mathbb{R}^{T \times d}$ where$T=60$ days,$d=6$ features - Mandi identifier:
$m \in \{1, 2, ..., M\}$ where$M$ is total mandis - Crop identifier:
$c \in \{1, 2, ..., C\}$ where$C$ is total crops - Spatial graph:
$\mathcal{G} = (\mathcal{V}, \mathcal{E})$ with$|\mathcal{V}| = M$ nodes
Output:
- Future price sequence:
$\mathbf{Y} \in \mathbb{R}^{H}$ where$H=10$ forecast horizon
Mathematical Formulation:
where
- Temporal Dependencies: Prices exhibit autocorrelation, trends, and seasonality
- Spatial Dependencies: Nearby mandis influence each other through trade and arbitrage
- Multi-step Forecasting: Predicting sequences (not single points) with compounding uncertainty
- Heterogeneity: Different mandis and crops have distinct characteristics
The GAT-LSTM architecture consists of four main components:
Input → Embeddings → Spatial Encoding (GAT) → Temporal Encoding (LSTM Encoder)
→ Multi-step Prediction (LSTM Decoder) → Output
Purpose: Transform discrete identifiers into continuous vector representations.
Mandi Embedding:
where
Crop Embedding:
where
Rationale: Embeddings allow the model to learn that similar mandis (e.g., geographically close, similar climate) should have similar representations. This is more expressive than one-hot encoding.
Combined Embedding:
where
Input Features (per timestep
-
$p_t$ : Price (Imp_Price) -
$month_t$ : Month (1-12) -
$dow_t$ : Day of week (0-6) -
$doy_t$ : Day of year (1-365) -
$temp_t$ : Mean temperature -
$rain_t$ : Mean rainfall
Feature Vector:
Sequence:
Combined Representation (broadcast embedding to all timesteps):
Purpose: Process temporal sequence and extract temporal patterns.
An LSTM cell at timestep
Forget Gate (what to forget from previous memory):
Input Gate (what new information to store):
Candidate Memory (new information):
Cell State Update (combine old and new):
Output Gate (what to output):
Hidden State:
where:
-
$\sigma(\cdot)$ is the sigmoid function:$\sigma(x) = \frac{1}{1 + e^{-x}}$ -
$\odot$ is element-wise multiplication (Hadamard product) -
$\mathbf{W}_*, \mathbf{U}_* \in \mathbb{R}^{d_h \times d_{in}}$ are weight matrices -
$\mathbf{b}_* \in \mathbb{R}^{d_h}$ are bias vectors -
$d_h = 128$ is the hidden dimension
For
Layer 1:
Layer 2:
Final Encoder Output:
Final Hidden and Cell States:
Vanishing Gradient Problem: Traditional RNNs suffer from vanishing gradients when learning long-term dependencies:
If
LSTM Solution: The cell state
Since
Purpose: Generate multi-step forecasts sequentially.
Initialization:
Autoregressive Generation (for
where
Final Predictions:
Dependency Modeling: Future prices depend on previous predictions:
Error Propagation: While autoregressive models can accumulate errors, they better capture sequential dependencies than independent predictions.
Note: Simplified in current implementation, but designed for:
k-Nearest Neighbors Graph:
For each mandi
Haversine Distance (great-circle distance):
where
Edge Set:
Edge Weights (inverse distance):
Normalized:
Attention Coefficient (importance of node
where:
-
$\mathbf{W} \in \mathbb{R}^{d' \times d}$ is a learnable transformation -
$\mathbf{a} \in \mathbb{R}^{2d'}$ is the attention mechanism -
$\|$ denotes concatenation
Normalized Attention (softmax over neighbors):
Aggregated Features:
K Attention Heads:
where
Rationale: Multiple heads capture different types of spatial relationships (e.g., geographic proximity, trade volume, price correlation).
Mean Squared Error (MSE):
where:
-
$N$ is batch size -
$H = 10$ is forecast horizon -
$y_n^{(h)}$ is actual price at horizon$h$ for sample$n$ -
$\hat{y}_n^{(h)}$ is predicted price
Why MSE?
- Differentiable (required for gradient descent)
- Penalizes large errors more (quadratic)
- Corresponds to Gaussian likelihood assumption
Adam Optimizer:
Parameter Update:
where:
First Moment Estimate (mean of gradients):
Second Moment Estimate (variance of gradients):
Bias Correction:
Hyperparameters:
- Learning rate:
$\alpha = 10^{-3}$ -
$\beta_1 = 0.9$ (momentum) -
$\beta_2 = 0.999$ (RMSprop) -
$\epsilon = 10^{-8}$ (numerical stability)
Why Adam?
- Adaptive learning rates per parameter
- Combines momentum (SGD) and RMSprop
- Works well for sparse gradients (embeddings)
Modified Loss:
where
Effect: Prevents weights from growing too large, reducing overfitting.
Training: Randomly zero out activations with probability
Inference: Scale activations:
Why Dropout?
- Prevents co-adaptation of neurons
- Ensemble effect (averaging over
$2^n$ networks)
Clip Gradient Norm:
where
Why? Prevents exploding gradients in RNNs.
ReduceLROnPlateau:
If validation loss doesn't improve for
where
Rationale: Fine-tune learning as model converges.
Criterion: Stop training if validation loss doesn't improve for
Why? Prevents overfitting and saves computation.
StandardScaler (z-score normalization):
where:
Why?
- Ensures all features have similar scales
- Improves gradient descent convergence
- Prevents features with large magnitudes from dominating
Applied to:
- Input features:
$\mathbf{X}$ - Target prices:
$\mathbf{Y}$
Denormalization (for predictions):
Train/Validation/Test Split: 80% / 10% / 10% by date
Why Temporal? Prevents data leakage - model shouldn't see future data during training.
Critical: Sort by date before splitting:
Embeddings:
- Mandi:
$M \times 32$ - Crop:
$C \times 16$
LSTM Encoder (per layer):
- Input to hidden:
$4 \times (d_{\text{in}} \times d_h + d_h \times d_h + d_h)$ - Layer 1:
$4 \times (54 \times 128 + 128 \times 128 + 128) = 94,720$ - Layer 2:
$4 \times (128 \times 128 + 128 \times 128 + 128) = 66,048$
LSTM Decoder (2 layers):
- Similar to encoder:
$2 \times 66,048 = 132,096$
Output Layer:
$128 \times 1 + 1 = 129$
Total: ~597,745 parameters (for
LSTM Forward Pass:
Training Complexity (per epoch):
where
GPU Speedup: ~10-15x due to parallelization across batch dimension.
Theorem (Cybenko, 1989): A feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of
Implication: Our multi-layer LSTM can theoretically approximate any temporal mapping.
Encoder-Decoder Framework (Sutskever et al., 2014):
Encoder: Compresses
Decoder: Generates
Motivation (Bahdanau et al., 2015): Fixed-size encoding bottleneck limits performance.
Solution: Attention allows decoder to focus on relevant parts of input:
where
Mean Squared Error (MSE):
Root Mean Squared Error (RMSE):
Mean Absolute Error (MAE):
Mean Absolute Percentage Error (MAPE):
R² Score (Coefficient of Determination):
where
Interpretation:
-
$R^2 = 1$ : Perfect predictions -
$R^2 = 0$ : Model performs as well as predicting mean -
$R^2 < 0$ : Model worse than predicting mean
Per-Day RMSE:
Expected Behavior:
(Uncertainty compounds over forecast horizon)
- Simplified Spatial Component: GAT not fully implemented; using only temporal LSTM
- Fixed Horizon: Model trained for specific 10-day horizon
- Single Crop: Currently optimized for wheat only
- No Uncertainty Quantification: Point predictions without confidence intervals
- Full GAT Implementation: Incorporate spatial attention across all mandis
- Probabilistic Forecasting: Use mixture density networks or Bayesian LSTMs
- Multi-task Learning: Jointly predict prices for multiple crops
- Exogenous Variables: Include policy changes, global commodity prices
- Transformer Architecture: Replace LSTM with self-attention for longer sequences
The GAT-LSTM architecture provides a principled approach to spatiotemporal price forecasting by:
- Embeddings: Learning mandi and crop representations
- LSTM Encoder: Capturing temporal dependencies with gated memory
- LSTM Decoder: Generating multi-step forecasts autoregressively
- Regularization: Preventing overfitting through dropout, weight decay, and early stopping
The mathematical foundations ensure the model can learn complex patterns while remaining trainable through gradient descent. The architecture balances expressiveness (universal approximation) with practical considerations (computational efficiency, overfitting prevention).
-
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
-
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2017). Graph attention networks. arXiv preprint arXiv:1710.10903.
-
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27.
-
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
-
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
-
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1), 1929-1958.
| Symbol | Description | Dimension |
|---|---|---|
| Input sequence | ||
| Output sequence | ||
| Input sequence length | 60 | |
| Forecast horizon | 10 | |
| Input feature dimension | 6 | |
| Number of mandis | 520 | |
| Number of crops | 1 | |
| Mandi embedding dimension | 32 | |
| Crop embedding dimension | 16 | |
| LSTM hidden dimension | 128 | |
| Number of LSTM layers | 2 | |
| Hidden state at time |
||
| Cell state at time |
||
| Model parameters | ~597K | |
| Learning rate | ||
| Weight decay | ||
| Dropout probability | 0.2 |
