Skuld, one of three Norns — Urðr (past), Verðandi (present), Skuld (future).
- Key Metrics
- Quick Start
- Project Structure
- Documentation
- Configuration
- Testing
- Pipeline Overview
- Data Leakage Prevention
- Development Philosophy
- References
A comprehensive machine learning pipeline for Learning-to-Rank asset prediction, focusing on the New Zealand Exchange (NZX). The system consists of two main components:
- Java Data Ingestion (
java/) — Fetches and consolidates data from multiple sources - Python ML Pipeline (
python/ml-pipeline/) — Implements ranking-based asset prediction using LightGBM
┌─────────────────────┐ ┌─────────────────────┐
│ JAVA INGESTION │ │ PYTHON ML PIPELINE │
│ │ │ │
│ Yahoo Finance │ │ Feature Engineer │
│ NZ Statistics │ ─────► │ LGBMRanker Model │ ─────► Predictions
│ RBNZ Data │ CSV │ Rolling Backtest │ & Metrics
│ Macroeconomic │ │ Portfolio Analysis │
└─────────────────────┘ └─────────────────────┘
| Metric | Value | Description |
|---|---|---|
| Mean IC | 0.27+ | Correlation between predictions and actual returns |
| ICIR | 4.0+ | Information ratio (consistency measure) |
| Sharpe Ratio | 3.0+ | Risk-adjusted returns (annualized) |
| Quintile Spread | 5-10% | Return gap between top and bottom quintiles |
I am aware these results seem unrealistic when compared to typical values. The many tests suggest no leakage, perhaps NZX is just a very inefficient market. I will continue to investigate this.
- Java 17+ (for data ingestion)
- Maven 3.6+ (for building Java)
- Python 3.11+ (for ML pipeline)
- uv (Python package manager) — Install uv
git clone https://github.com/oneye5/skuld.git
cd skuldcd java
mvn clean compile
mvn exec:java -Dexec.mainClass="lazic.Main"This fetches data from Yahoo Finance and NZ economic sources, outputting to data/data_long.csv.
cd python/ml-pipeline
uv sync # Install dependencies
uv run python scripts/run_model_evaluation.py # Run evaluationskuld/
├── README.md # This file
├── data/
│ └── data_long.csv # Raw data (generated by Java)
├── docs/
│ ├── RANKING_PIPELINE_GUIDE.md # Detailed pipeline documentation
│ ├── ANNUAL_STATISTICS.md # Annual performance metrics
│ ├── CLUSTERING.md # Clustering methodology
│ ├── FEATURES.md # Feature engineering guide
│ ├── TESTING.md # Test coverage documentation
│ └── DATA_LEAKAGE.md # Leakage prevention guide
├── java/ # Data ingestion
│ ├── pom.xml # Maven configuration
│ ├── docs/
│ │ ├── DATA_SOURCES.md # Data source documentation
│ │ └── ARCHITECTURE.md # Java architecture
│ └── src/main/java/lazic/
│ ├── Main.java # Entry point
│ ├── sources/ # Data source implementations
│ └── utils/ # Utilities
└── python/ml-pipeline/ # ML pipeline
├── pyproject.toml # Python dependencies
├── config/ # Configuration
├── core/ # Core utilities
├── features/ # Feature engineering
├── learner/ # ML models
├── evaluation/ # Metrics & backtesting
├── pipeline/ # Main pipeline
├── scripts/ # Entry points
└── tests/ # Unit tests
| Document | Description |
|---|---|
| Project TODO | Active tasks and project roadmap |
| Document | Description |
|---|---|
| Ranking Pipeline Guide | Complete pipeline documentation |
| Annual Statistics | Performance metrics and Monte Carlo |
| Clustering | Statistical sector clustering |
| Features | Feature engineering reference |
| Document | Description |
|---|---|
| Testing | Test coverage and methodology |
| Data Leakage | Leakage prevention strategies |
| Data Audit Reports | Third-party price validation |
| Java Data Sources | Data source configuration |
| Java Architecture | Ingestion architecture |
Key settings in python/ml-pipeline/config/settings.py:
# Target Definition
FORWARD_RETURN_DAYS = 365 # Prediction horizon
RETURN_TYPE = "simple" # simple or log returns
# Model Settings
RANKER_N_ESTIMATORS = 150 # LightGBM iterations
RANKER_NUM_LEAVES = 127 # Tree complexity
# Portfolio Settings
PORTFOLIO_TOP_N = 10 # Long positions
PORTFOLIO_BOTTOM_N = 10 # Short positions
TRANSACTION_COST_BPS = 10 # Trading costsData sources configured in java/src/main/java/lazic/sources/config/Tickers.java.
cd python/ml-pipeline
uv run pytest # Run all tests
uv run pytest -x -q # Stop on first failure
uv run pytest tests/test_leakage_*.py # Run leakage testsTest Coverage: 300+ tests covering:
- Data leakage prevention (comprehensive)
- Feature engineering correctness
- Pipeline integration
- Metric calculations
- Portfolio simulation
cd java
mvn test┌────────────────────────────────────────────────────────────────────────────┐
│ RANKING PIPELINE FLOW │
└────────────────────────────────────────────────────────────────────────────┘
Raw Data (data_long.csv)
│
▼
┌───────────────────┐
│ Long → Wide │ Convert ticker+timestamp rows to wide format
└───────────────────┘
│
▼
┌───────────────────┐
│ Anomaly Detection │ Filter price discontinuities (splits, recycled tickers)
└───────────────────┘
│
▼
┌───────────────────┐
│ Forward Returns │ Compute 365-day lookahead returns (target)
└───────────────────┘
│
▼
┌───────────────────┐
│ Feature Engineer │ Technical indicators, alpha factors, ratios
└───────────────────┘
│
├──────────────────────────────────────────────────────┐
│ │
▼ ▼
┌───────────────────┐ ┌───────────────────┐
│ Rolling Window 1 │ ... Rolling Window N │ Cluster Analysis │
│ │ │ (leakage-safe) │
│ • Train/Test Split│ └───────────────────┘
│ • Scaler Fit │ │
│ • LGBMRanker │ │
│ • Predictions │ ◄──────────────────────────────────────┘
└───────────────────┘
│
▼
┌───────────────────┐
│ Aggregate Results │ Combine all window predictions
└───────────────────┘
│
▼
┌───────────────────┐
│ Evaluation │ IC, ICIR, Quintile Returns, Hit Rate
└───────────────────┘
│
▼
┌───────────────────┐
│ Portfolio Backtest│ Long-short simulation with costs
└───────────────────┘
│
▼
┌───────────────────┐
│ Output Reports │ Metrics, plots, predictions CSV
└───────────────────┘
The pipeline implements multiple safeguards:
- Temporal Validation — Test data strictly after training data
- Scaler Isolation — Fitted on training data only
- Cross-Sectional Features — Computed per-timestamp after split
- Forward Fill Safety — Sorted by
[TICKER, TIMESTAMP]before fill - Cluster Assignment — Using only historical data
See Data Leakage Guide for details.
- Research-Driven — Features based on academic finance literature
- Test First — 300+ comprehensive tests for correctness
- Avoid Leakage — Multiple safeguards and dedicated tests
- Reproducible — Experiment tracking with git commit hashes
- Worst case scenario — Config values such as fees are higher than they would be if implmeneted. (eg ignoring sharesies plans & fee caps opting for the high flat percent fee)
The model appears to prefer volatile illiquid dividend paying assets with consistent trends. Some of the top industries are manufacturing, agriculture and seafood.
- LightGBM Documentation
- Gu, Kelly, Xiu (2020) - "Empirical Asset Pricing via Machine Learning"
- Ang et al. (2006) - "The Cross-Section of Volatility and Expected Returns"