A production-ready deep learning framework for time series forecasting with hierarchical sparse attention, TabNet encoders, DCN cross layers, and intermittent demand handling.
- Feature-Level Attention: TabNet encoders for sparse feature selection within each component
- Component-Level Attention: Learns importance across Trend, Seasonal, Holiday, and Regressor components
- Cross-Layer Interactions: Deep Cross Network (DCN) for explicit feature combinations
- TabNet Encoders: Sequential attention with interpretable feature importance
- 4 Components: Trend, Seasonal, Holiday, Regressor (use any combination 1-4)
- Dynamic Ensemble: Softmax weights automatically adapt to available components
- SKU-Specific: Different products learn different patterns through embeddings
- Two-Stage Prediction: Zero probability + magnitude forecasting
- Zero Detection: Hierarchical attention + cross layers for sparse demand patterns
- Toggle Mode: Enable/disable via
enable_intermittent_handlingparameter - Production-Ready: Tested on 910 SKUs with varying sparsity levels
- Efficient: No transformers, lightweight TabNet architecture
- Stable: Low-temperature softmax (no NaN issues)
- Interpretable: Built-in feature importance and attention weights
- Autoregressive: Multi-step forecasting with lag feature updates
# Clone repository
git clone https://github.com/mkuma93/deepsequence-hierarchical-attention.git
cd deepsequence-hierarchical-attention
# Install dependencies
pip install -r requirements.txt
# Install package
pip install -e .import numpy as np
from deepsequence_hierarchical_attention import DeepSequencePWLHierarchical
# Initialize model (intermittent mode)
model = DeepSequencePWLHierarchical(
n_skus=100,
n_features=20,
enable_intermittent_handling=True, # Two-stage prediction
tabnet_feature_dim=16,
tabnet_output_dim=8,
embedding_dim=8,
n_cross_layers=2
)
# Build model
main_model = model.build_model()
# Train
history = main_model.fit(
[X_train, sku_train],
{'final_forecast': y_train},
validation_data=([X_val, sku_val], {'final_forecast': y_val}),
epochs=50,
batch_size=64
)
# Predict (returns dict with multiple outputs)
predictions = main_model.predict([X_test, sku_test])
# Keys: 'base_forecast', 'zero_probability', 'final_forecast'from deepsequence_hierarchical_attention import AutoregressivePredictor
# Initialize predictor
ar_predictor = AutoregressivePredictor(
model=main_model,
lag_feature_indices=[16, 17], # Which features are lags
lags=[1, 7], # Lag orders (t-1, t-7)
n_skus=100
)
# Forecast 14 days ahead
forecast = ar_predictor.predict_multi_step(
X_initial=X_test[:3],
sku_ids=sku_test[:3],
n_steps=14
)
# Shape: (3, 14) - 3 SKUs, 14 daysInput Features β TabNet Encoders (4 components) β Cross Layers β Ensemble
β β β β
[Features] [Sparse Attention] [Interactions] [Softmax Weights]
20 dim per component DCN across components
β
[Zero Probability] (intermittent mode)
β
[Final Forecast]
- Trend Component: Time features (day, week, month) β TabNet
- Seasonal Component: Fourier features (sin/cos) β TabNet
- Holiday Component: Holiday proximity features β TabNet
- Regressor Component: Lag features + external variables β TabNet
Each component:
- TabNet encoder for feature selection
- Sparse attention for interpretability
- Component-specific hidden layers
- Ensemble weights learned per SKU
When enable_intermittent_handling=True:
Base Forecast β Zero Detection Branch β Final Forecast
β (Cross Layers) β
Softmax Zero Probability base Γ (1 - zero_prob)
Ensemble
deepsequence-hierarchical-attention/
βββ deepsequence_hierarchical_attention/
β βββ __init__.py
β βββ components.py # Main model architecture
β βββ tabnet.py # TabNet encoder implementation
β βββ autoregressive.py # Multi-step forecasting
β βββ model.py # Wrapper class (optional)
βββ examples/
β βββ demo.ipynb # Complete tutorial
βββ tests/
β βββ test_components.py
βββ README.md
βββ requirements.txt
βββ setup.py
βββ LICENSE
# Disable intermittent handling for regular demand
model = DeepSequencePWLHierarchical(
n_skus=50,
n_features=15,
enable_intermittent_handling=False, # Direct forecasting
tabnet_feature_dim=16,
embedding_dim=8
)
main_model = model.build_model()
main_model.compile(optimizer='adam', loss='mae')
# Single output: final_forecast only
history = main_model.fit(
[X_train, sku_train],
y_train, # Simple array, not dict
epochs=30
)# In intermittent mode, model exposes intermediate outputs
predictions = main_model.predict([X_test[:5], sku_test[:5]])
base_forecast = predictions['base_forecast'] # Softmax ensemble
zero_prob = predictions['zero_probability'] # P(demand=0)
final_forecast = predictions['final_forecast'] # base Γ (1 - zero_prob)
print(f"Base forecast: {base_forecast[0]}")
print(f"Zero probability: {zero_prob[0]}")
print(f"Final forecast: {final_forecast[0]}")# TabNet provides built-in feature importance
# Access through model layers (requires custom extraction)
# See examples/demo.ipynb for detailed implementation| Parameter | Default | Description |
|---|---|---|
n_skus |
- | Number of unique SKUs/products |
n_features |
- | Number of input features |
enable_intermittent_handling |
True |
Two-stage prediction for sparse demand |
tabnet_feature_dim |
16 |
TabNet feature dimension |
tabnet_output_dim |
8 |
TabNet output dimension |
embedding_dim |
8 |
SKU embedding dimension |
n_cross_layers |
2 |
Number of DCN cross layers |
dropout_rate |
0.1 |
Dropout rate for regularization |
- Batch Size: 64-256 for stability
- Learning Rate: 0.001 (Adam optimizer)
- Epochs: 30-100 depending on dataset size
- Regularization: Dropout + L2 regularization on embeddings
- Validation: Use temporal split (not random) for time series
Tested on retail demand forecasting dataset:
- 910 SKUs, 1000+ samples per SKU
- 30% intermittent (sparse demand patterns)
| Metric | Continuous Mode | Intermittent Mode |
|---|---|---|
| MAE | 2.34 | 2.18 |
| RMSE | 4.67 | 4.23 |
| MAPE | 15.2% | 14.1% |
Intermittent mode shows 7% improvement in MAE for sparse demand SKUs.
# Use only Trend + Seasonal (no Holiday/Regressor)
model = DeepSequencePWLHierarchical(
n_skus=100,
n_features=10, # Only time + Fourier features
enable_intermittent_handling=False
)
# Model automatically adapts ensemble to 2 components- Softmax Temperature: Low temperature (0.1) prevents NaN
- Gradient Clipping: Built-in for stable training
- Batch Normalization: Ghost batch norm in TabNet
- Small Epsilon: 1e-7 for numerical safety
- Architecture Guide - Detailed architecture explanation
- API Reference - Complete API documentation
- Tutorial Notebook - Step-by-step guide
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new features
- Submit a pull request
MIT License - see LICENSE file for details
Mritunjay Kumar
- Email: mritunjay.kmr1@gmail.com
- GitHub: @mkuma93
- TabNet: Google Research - Paper
- DCN: Google Research - Paper
- TensorFlow Team: For excellent deep learning framework
If you use this work, please cite:
@software{kumar2025deepsequence,
author = {Kumar, Mritunjay},
title = {DeepSequence Hierarchical Attention for Time Series Forecasting},
year = {2025},
url = {https://github.com/mkuma93/deepsequence-hierarchical-attention}
}