GAT-LSTM for Spatiotemporal Price Forecasting: Technical Report

Abstract

This report presents a comprehensive analysis of the Graph Attention Long Short-Term Memory (GAT-LSTM) architecture for spatiotemporal wheat price forecasting. The model combines Graph Attention Networks (GATs) for capturing spatial dependencies between agricultural markets (mandis) with Long Short-Term Memory (LSTM) networks for temporal sequence modeling. We detail the mathematical foundations, architectural components, training methodology, and theoretical justifications for design decisions.

1. Introduction

1.1 Problem Formulation

Objective: Predict future price sequences for agricultural commodities across multiple spatially-distributed markets.

Input:

Historical price sequence: $\mathbf{X} \in \mathbb{R}^{T \times d}$ where $T=60$ days, $d=6$ features
Mandi identifier: $m \in \{1, 2, ..., M\}$ where $M$ is total mandis
Crop identifier: $c \in \{1, 2, ..., C\}$ where $C$ is total crops
Spatial graph: $\mathcal{G} = (\mathcal{V}, \mathcal{E})$ with $|\mathcal{V}| = M$ nodes

Output:

Future price sequence: $\mathbf{Y} \in \mathbb{R}^{H}$ where $H=10$ forecast horizon

Mathematical Formulation:

$$\hat{\mathbf{Y}} = f_\theta(\mathbf{X}, m, c, \mathcal{G})$$

where $f_\theta$ is our learned function parameterized by $\theta$.

1.2 Challenges

Temporal Dependencies: Prices exhibit autocorrelation, trends, and seasonality
Spatial Dependencies: Nearby mandis influence each other through trade and arbitrage
Multi-step Forecasting: Predicting sequences (not single points) with compounding uncertainty
Heterogeneity: Different mandis and crops have distinct characteristics

2. Model Architecture

2.1 Overview

The GAT-LSTM architecture consists of four main components:

Input → Embeddings → Spatial Encoding (GAT) → Temporal Encoding (LSTM Encoder) 
     → Multi-step Prediction (LSTM Decoder) → Output

Architecture Diagram:

3. Mathematical Foundations

3.1 Embedding Layer

Purpose: Transform discrete identifiers into continuous vector representations.

Mandi Embedding:

$$\mathbf{e}_m = \mathbf{E}_m[m] \in \mathbb{R}^{d_m}$$

where $\mathbf{E}_m \in \mathbb{R}^{M \times d_m}$ is the learnable embedding matrix, $d_m = 32$.

Crop Embedding:

$$\mathbf{e}_c = \mathbf{E}_c[c] \in \mathbb{R}^{d_c}$$

where $\mathbf{E}_c \in \mathbb{R}^{C \times d_c}$ is the learnable embedding matrix, $d_c = 16$.

Rationale: Embeddings allow the model to learn that similar mandis (e.g., geographically close, similar climate) should have similar representations. This is more expressive than one-hot encoding.

Combined Embedding:

$$\mathbf{e} = [\mathbf{e}_m; \mathbf{e}_c] \in \mathbb{R}^{48}$$

where $[;]$ denotes concatenation.

3.2 Feature Combination

Input Features (per timestep $t$):

$p_t$: Price (Imp_Price)
$month_t$: Month (1-12)
$dow_t$: Day of week (0-6)
$doy_t$: Day of year (1-365)
$temp_t$: Mean temperature
$rain_t$: Mean rainfall

Feature Vector:

$$\mathbf{x}_t = [p_t, month_t, dow_t, doy_t, temp_t, rain_t]^T \in \mathbb{R}^6$$

Sequence:

$$\mathbf{X} = [\mathbf{x}_1, \mathbf{x}_2, ..., \mathbf{x}_T] \in \mathbb{R}^{T \times 6}$$

Combined Representation (broadcast embedding to all timesteps):

$$\mathbf{Z}_t = [\mathbf{x}_t; \mathbf{e}] \in \mathbb{R}^{54}$$

$$\mathbf{Z} = [\mathbf{Z}_1, \mathbf{Z}_2, ..., \mathbf{Z}_T] \in \mathbb{R}^{T \times 54}$$

3.3 LSTM Encoder

Purpose: Process temporal sequence and extract temporal patterns.

3.3.1 LSTM Cell Mathematics

An LSTM cell at timestep $t$ computes:

Forget Gate (what to forget from previous memory):

$$\mathbf{f}_t = \sigma(\mathbf{W}_f \mathbf{Z}_t + \mathbf{U}_f \mathbf{h}_{t-1} + \mathbf{b}_f)$$

Input Gate (what new information to store):

$$\mathbf{i}_t = \sigma(\mathbf{W}_i \mathbf{Z}_t + \mathbf{U}_i \mathbf{h}_{t-1} + \mathbf{b}_i)$$

Candidate Memory (new information):

$$\tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_c \mathbf{Z}_t + \mathbf{U}_c \mathbf{h}_{t-1} + \mathbf{b}_c)$$

Cell State Update (combine old and new):

$$\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$$

Output Gate (what to output):

$$\mathbf{o}_t = \sigma(\mathbf{W}_o \mathbf{Z}_t + \mathbf{U}_o \mathbf{h}_{t-1} + \mathbf{b}_o)$$

Hidden State:

$$\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)$$

where:

$\sigma(\cdot)$ is the sigmoid function: $\sigma(x) = \frac{1}{1 + e^{-x}}$
$\odot$ is element-wise multiplication (Hadamard product)
$\mathbf{W}_*, \mathbf{U}_* \in \mathbb{R}^{d_h \times d_{in}}$ are weight matrices
$\mathbf{b}_* \in \mathbb{R}^{d_h}$ are bias vectors
$d_h = 128$ is the hidden dimension

3.3.2 Multi-layer LSTM

For $L=2$ layers:

Layer 1:

$$\mathbf{h}_t^{(1)}, \mathbf{c}_t^{(1)} = \text{LSTM}^{(1)}(\mathbf{Z}_t, \mathbf{h}_{t-1}^{(1)}, \mathbf{c}_{t-1}^{(1)})$$

Layer 2:

$$\mathbf{h}_t^{(2)}, \mathbf{c}_t^{(2)} = \text{LSTM}^{(2)}(\mathbf{h}_t^{(1)}, \mathbf{h}_{t-1}^{(2)}, \mathbf{c}_{t-1}^{(2)})$$

Final Encoder Output:

$$\mathbf{H} = [\mathbf{h}_1^{(2)}, \mathbf{h}_2^{(2)}, ..., \mathbf{h}_T^{(2)}] \in \mathbb{R}^{T \times d_h}$$

Final Hidden and Cell States:

$$\mathbf{h}_{\text{enc}} = \mathbf{h}_T^{(2)}, \quad \mathbf{c}_{\text{enc}} = \mathbf{c}_T^{(2)}$$

3.3.3 Why LSTM?

Vanishing Gradient Problem: Traditional RNNs suffer from vanishing gradients when learning long-term dependencies:

$$\frac{\partial \mathcal{L}}{\partial \mathbf{h}_1} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}_T} \prod_{t=2}^{T} \frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}}$$

If $\|\frac{\partial \mathbf{h}_t}{\partial \mathbf{h}_{t-1}}\| < 1$, the gradient vanishes exponentially.

LSTM Solution: The cell state $\mathbf{c}_t$ provides a "highway" for gradients:

$$\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \mathbf{f}_t$$

Since $\mathbf{f}_t$ is learned, the model can preserve gradients when needed.

3.4 LSTM Decoder (Autoregressive)

Purpose: Generate multi-step forecasts sequentially.

3.4.1 Decoder Mechanism

Initialization:

$$\mathbf{h}_0^{\text{dec}} = \mathbf{h}_{\text{enc}}, \quad \mathbf{c}_0^{\text{dec}} = \mathbf{c}_{\text{enc}}$$

$$\mathbf{d}_0 = \mathbf{h}_T^{(2)} \quad \text{(last encoder output)}$$

Autoregressive Generation (for $k = 1, 2, ..., H$):

$$\mathbf{h}_k^{\text{dec}}, \mathbf{c}_k^{\text{dec}} = \text{LSTM}_{\text{dec}}(\mathbf{d}_{k-1}, \mathbf{h}_{k-1}^{\text{dec}}, \mathbf{c}_{k-1}^{\text{dec}})$$

$$\hat{y}_k = \mathbf{W}_{\text{out}} \mathbf{h}_k^{\text{dec}} + b_{\text{out}}$$

$$\mathbf{d}_k = \mathbf{h}_k^{\text{dec}}$$

where $\mathbf{W}_{\text{out}} \in \mathbb{R}^{1 \times d_h}$ is the output projection.

Final Predictions:

$$\hat{\mathbf{Y}} = [\hat{y}_1, \hat{y}_2, ..., \hat{y}_H]^T \in \mathbb{R}^H$$

3.4.2 Why Autoregressive?

Dependency Modeling: Future prices depend on previous predictions:

$$p(y_{t+k} | \mathbf{X}) = p(y_{t+k} | y_{t+1}, ..., y_{t+k-1}, \mathbf{X})$$

Error Propagation: While autoregressive models can accumulate errors, they better capture sequential dependencies than independent predictions.

3.5 Graph Attention Network (GAT) - Theoretical

Note: Simplified in current implementation, but designed for:

3.5.1 Spatial Graph Construction

k-Nearest Neighbors Graph:

For each mandi $i$ with coordinates $(lat_i, lon_i)$:

Haversine Distance (great-circle distance):

$$d(i, j) = 2r \arcsin\left(\sqrt{\sin^2\left(\frac{\Delta lat}{2}\right) + \cos(lat_i)\cos(lat_j)\sin^2\left(\frac{\Delta lon}{2}\right)}\right)$$

where $r = 6371$ km (Earth's radius).

Edge Set:

$$\mathcal{E} = \{(i, j) : j \in \text{k-NN}(i)\}$$

Edge Weights (inverse distance):

$$w_{ij} = \frac{1}{d(i,j) + \epsilon}$$

Normalized:

$$\tilde{w}_{ij} = \frac{w_{ij}}{\sum_{k \in \mathcal{N}(i)} w_{ik}}$$

3.5.2 Graph Attention Mechanism

Attention Coefficient (importance of node $j$ to node $i$):

$$e_{ij} = \text{LeakyReLU}(\mathbf{a}^T [\mathbf{W}\mathbf{h}_i \| \mathbf{W}\mathbf{h}_j])$$

where:

$\mathbf{W} \in \mathbb{R}^{d' \times d}$ is a learnable transformation
$\mathbf{a} \in \mathbb{R}^{2d'}$ is the attention mechanism
$\|$ denotes concatenation

Normalized Attention (softmax over neighbors):

$$\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k \in \mathcal{N}(i)} \exp(e_{ik})}$$

Aggregated Features:

$$\mathbf{h}_i' = \sigma\left(\sum_{j \in \mathcal{N}(i)} \alpha_{ij} \mathbf{W} \mathbf{h}_j\right)$$

3.5.3 Multi-head Attention

K Attention Heads:

$$\mathbf{h}_i' = \|_{k=1}^K \sigma\left(\sum_{j \in \mathcal{N}(i)} \alpha_{ij}^k \mathbf{W}^k \mathbf{h}_j\right)$$

where $\|$ denotes concatenation, $K=4$ heads.

Rationale: Multiple heads capture different types of spatial relationships (e.g., geographic proximity, trade volume, price correlation).

4. Training Methodology

4.1 Loss Function

Mean Squared Error (MSE):

$$\mathcal{L}(\theta) = \frac{1}{N \cdot H} \sum_{n=1}^N \sum_{h=1}^H (y_n^{(h)} - \hat{y}_n^{(h)})^2$$

where:

$N$ is batch size
$H = 10$ is forecast horizon
$y_n^{(h)}$ is actual price at horizon $h$ for sample $n$
$\hat{y}_n^{(h)}$ is predicted price

Why MSE?

Differentiable (required for gradient descent)
Penalizes large errors more (quadratic)
Corresponds to Gaussian likelihood assumption

4.2 Optimization

Adam Optimizer:

Parameter Update:

$$\theta_{t+1} = \theta_t - \alpha \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}$$

where:

First Moment Estimate (mean of gradients):

$$m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$

Second Moment Estimate (variance of gradients):

$$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2$$

Bias Correction:

$$\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}$$

Hyperparameters:

Learning rate: $\alpha = 10^{-3}$
$\beta_1 = 0.9$ (momentum)
$\beta_2 = 0.999$ (RMSprop)
$\epsilon = 10^{-8}$ (numerical stability)

Why Adam?

Adaptive learning rates per parameter
Combines momentum (SGD) and RMSprop
Works well for sparse gradients (embeddings)

4.3 Regularization

4.3.1 L2 Regularization (Weight Decay)

Modified Loss:

$$\mathcal{L}_{\text{reg}}(\theta) = \mathcal{L}(\theta) + \lambda \sum_{i} \theta_i^2$$

where $\lambda = 10^{-5}$.

Effect: Prevents weights from growing too large, reducing overfitting.

4.3.2 Dropout

Training: Randomly zero out activations with probability $p = 0.2$:

$$\mathbf{h}_{\text{drop}} = \mathbf{h} \odot \mathbf{m}, \quad \mathbf{m} \sim \text{Bernoulli}(1-p)$$

Inference: Scale activations:

$$\mathbf{h}_{\text{test}} = (1-p) \mathbf{h}$$

Why Dropout?

Prevents co-adaptation of neurons
Ensemble effect (averaging over $2^n$ networks)

4.3.3 Gradient Clipping

Clip Gradient Norm:

$$\mathbf{g} \leftarrow \frac{\mathbf{g}}{\max(1, \|\mathbf{g}\| / \tau)}$$

where $\tau = 1.0$ is the threshold.

Why? Prevents exploding gradients in RNNs.

4.4 Learning Rate Scheduling

ReduceLROnPlateau:

If validation loss doesn't improve for $p = 5$ epochs:

$$\alpha \leftarrow \alpha \cdot \gamma$$

where $\gamma = 0.5$ (reduction factor).

Rationale: Fine-tune learning as model converges.

4.5 Early Stopping

Criterion: Stop training if validation loss doesn't improve for $p_{\text{stop}} = 10$ epochs.

Why? Prevents overfitting and saves computation.

5. Data Preprocessing

5.1 Feature Normalization

StandardScaler (z-score normalization):

$$\tilde{x}_i = \frac{x_i - \mu}{\sigma}$$

where:

$$\mu = \frac{1}{N}\sum_{i=1}^N x_i, \quad \sigma = \sqrt{\frac{1}{N}\sum_{i=1}^N (x_i - \mu)^2}$$

Why?

Ensures all features have similar scales
Improves gradient descent convergence
Prevents features with large magnitudes from dominating

Applied to:

Input features: $\mathbf{X}$
Target prices: $\mathbf{Y}$

Denormalization (for predictions):

$$x = \tilde{x} \cdot \sigma + \mu$$

5.2 Temporal Splitting

Train/Validation/Test Split: 80% / 10% / 10% by date

Why Temporal? Prevents data leakage - model shouldn't see future data during training.

Critical: Sort by date before splitting:

$$\mathcal{D}_{\text{train}} = \{(\mathbf{X}_i, \mathbf{Y}_i) : t_i < t_{\text{split1}}\}$$

$$\mathcal{D}_{\text{val}} = \{(\mathbf{X}_i, \mathbf{Y}_i) : t_{\text{split1}} \leq t_i < t_{\text{split2}}\}$$

$$\mathcal{D}_{\text{test}} = \{(\mathbf{X}_i, \mathbf{Y}_i) : t_i \geq t_{\text{split2}}\}$$

6. Model Complexity Analysis

6.1 Parameter Count

Embeddings:

Mandi: $M \times 32$
Crop: $C \times 16$

LSTM Encoder (per layer):

Input to hidden: $4 \times (d_{\text{in}} \times d_h + d_h \times d_h + d_h)$
Layer 1: $4 \times (54 \times 128 + 128 \times 128 + 128) = 94,720$
Layer 2: $4 \times (128 \times 128 + 128 \times 128 + 128) = 66,048$

LSTM Decoder (2 layers):

Similar to encoder: $2 \times 66,048 = 132,096$

Output Layer:

$128 \times 1 + 1 = 129$

Total: ~597,745 parameters (for $M=520$, $C=1$)

6.2 Computational Complexity

LSTM Forward Pass: $\mathcal{O}(T \cdot d_h^2)$

Training Complexity (per epoch):

$$\mathcal{O}(N \cdot T \cdot d_h^2)$$

where $N \approx 920,000$ samples, $T = 60$, $d_h = 128$.

GPU Speedup: ~10-15x due to parallelization across batch dimension.

7. Theoretical Justifications

7.1 Universal Approximation

Theorem (Cybenko, 1989): A feedforward network with a single hidden layer containing a finite number of neurons can approximate any continuous function on compact subsets of $\mathbb{R}^n$.

Implication: Our multi-layer LSTM can theoretically approximate any temporal mapping.

7.2 Sequence-to-Sequence Learning

Encoder-Decoder Framework (Sutskever et al., 2014):

$$p(\mathbf{Y} | \mathbf{X}) = \prod_{h=1}^H p(y_h | y_1, ..., y_{h-1}, \mathbf{X})$$

Encoder: Compresses $\mathbf{X}$ into fixed-size representation $\mathbf{h}_{\text{enc}}$

Decoder: Generates $\mathbf{Y}$ conditioned on $\mathbf{h}_{\text{enc}}$

7.3 Attention Mechanism

Motivation (Bahdanau et al., 2015): Fixed-size encoding bottleneck limits performance.

Solution: Attention allows decoder to focus on relevant parts of input:

$$\mathbf{c}_k = \sum_{t=1}^T \alpha_{kt} \mathbf{h}_t$$

where $\alpha_{kt}$ is attention weight for timestep $t$ at decoding step $k$.

8. Evaluation Metrics

8.1 Point Metrics

Mean Squared Error (MSE):

$$\text{MSE} = \frac{1}{N \cdot H} \sum_{n=1}^N \sum_{h=1}^H (y_n^{(h)} - \hat{y}_n^{(h)})^2$$

Root Mean Squared Error (RMSE):

$$\text{RMSE} = \sqrt{\text{MSE}}$$

Mean Absolute Error (MAE):

$$\text{MAE} = \frac{1}{N \cdot H} \sum_{n=1}^N \sum_{h=1}^H |y_n^{(h)} - \hat{y}_n^{(h)}|$$

Mean Absolute Percentage Error (MAPE):

$$\text{MAPE} = \frac{100\%}{N \cdot H} \sum_{n=1}^N \sum_{h=1}^H \left|\frac{y_n^{(h)} - \hat{y}_n^{(h)}}{y_n^{(h)}}\right|$$

8.2 Goodness-of-Fit

R² Score (Coefficient of Determination):

$$R^2 = 1 - \frac{\sum_{i} (y_i - \hat{y}_i)^2}{\sum_{i} (y_i - \bar{y})^2}$$

where $\bar{y}$ is the mean of actual values.

Interpretation:

$R^2 = 1$: Perfect predictions
$R^2 = 0$: Model performs as well as predicting mean
$R^2 < 0$: Model worse than predicting mean

8.3 Horizon-Specific Metrics

Per-Day RMSE:

$$\text{RMSE}_h = \sqrt{\frac{1}{N} \sum_{n=1}^N (y_n^{(h)} - \hat{y}_n^{(h)})^2}$$

Expected Behavior: $\text{RMSE}_1 < \text{RMSE}_2 < ... < \text{RMSE}_{10}$

(Uncertainty compounds over forecast horizon)

9. Limitations and Future Work

9.1 Current Limitations

Simplified Spatial Component: GAT not fully implemented; using only temporal LSTM
Fixed Horizon: Model trained for specific 10-day horizon
Single Crop: Currently optimized for wheat only
No Uncertainty Quantification: Point predictions without confidence intervals

9.2 Future Enhancements

Full GAT Implementation: Incorporate spatial attention across all mandis
Probabilistic Forecasting: Use mixture density networks or Bayesian LSTMs
Multi-task Learning: Jointly predict prices for multiple crops
Exogenous Variables: Include policy changes, global commodity prices
Transformer Architecture: Replace LSTM with self-attention for longer sequences

10. Conclusion

The GAT-LSTM architecture provides a principled approach to spatiotemporal price forecasting by:

Embeddings: Learning mandi and crop representations
LSTM Encoder: Capturing temporal dependencies with gated memory
LSTM Decoder: Generating multi-step forecasts autoregressively
Regularization: Preventing overfitting through dropout, weight decay, and early stopping

The mathematical foundations ensure the model can learn complex patterns while remaining trainable through gradient descent. The architecture balances expressiveness (universal approximation) with practical considerations (computational efficiency, overfitting prevention).

References

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735-1780.
Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., & Bengio, Y. (2017). Graph attention networks. arXiv preprint arXiv:1710.10903.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27.
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1), 1929-1958.

Appendix: Notation Summary

Symbol	Description	Dimension
$\mathbf{X}$	Input sequence	$T \times d$
$\mathbf{Y}$	Output sequence	$H$
$T$	Input sequence length	60
$H$	Forecast horizon	10
$d$	Input feature dimension	6
$M$	Number of mandis	520
$C$	Number of crops	1
$d_m$	Mandi embedding dimension	32
$d_c$	Crop embedding dimension	16
$d_h$	LSTM hidden dimension	128
$L$	Number of LSTM layers	2
$\mathbf{h}_t$	Hidden state at time $t$	$d_h$
$\mathbf{c}_t$	Cell state at time $t$	$d_h$
$\theta$	Model parameters	~597K
$\alpha$	Learning rate	$10^{-3}$
$\lambda$	Weight decay	$10^{-5}$
$p$	Dropout probability	0.2

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
models		models
scraper		scraper
README.md		README.md
gat_lstm_arch.jpg		gat_lstm_arch.jpg
local_server.py		local_server.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

GAT-LSTM for Spatiotemporal Price Forecasting: Technical Report

Abstract

1. Introduction

1.1 Problem Formulation

1.2 Challenges

2. Model Architecture

2.1 Overview

Architecture Diagram:

3. Mathematical Foundations

3.1 Embedding Layer

3.2 Feature Combination

3.3 LSTM Encoder

3.3.1 LSTM Cell Mathematics

3.3.2 Multi-layer LSTM

3.3.3 Why LSTM?

3.4 LSTM Decoder (Autoregressive)

3.4.1 Decoder Mechanism

3.4.2 Why Autoregressive?

3.5 Graph Attention Network (GAT) - Theoretical

3.5.1 Spatial Graph Construction

3.5.2 Graph Attention Mechanism

3.5.3 Multi-head Attention

4. Training Methodology

4.1 Loss Function

4.2 Optimization

4.3 Regularization

4.3.1 L2 Regularization (Weight Decay)

4.3.2 Dropout

4.3.3 Gradient Clipping

4.4 Learning Rate Scheduling

4.5 Early Stopping

5. Data Preprocessing

5.1 Feature Normalization

5.2 Temporal Splitting

6. Model Complexity Analysis

6.1 Parameter Count

6.2 Computational Complexity

7. Theoretical Justifications

7.1 Universal Approximation

7.2 Sequence-to-Sequence Learning

7.3 Attention Mechanism

8. Evaluation Metrics

8.1 Point Metrics

8.2 Goodness-of-Fit

8.3 Horizon-Specific Metrics

9. Limitations and Future Work

9.1 Current Limitations

9.2 Future Enhancements

10. Conclusion

References

Appendix: Notation Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages