Skip to content

oneye5/skuld

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Skuld - Cross-Sectional Stock Ranking Pipeline

Skuld, one of three Norns — Urðr (past), Verðandi (present), Skuld (future).


Table of Contents

  1. Key Metrics
  2. Quick Start
  3. Project Structure
  4. Documentation
  5. Configuration
  6. Testing
  7. Pipeline Overview
  8. Data Leakage Prevention
  9. Development Philosophy
  10. References

A comprehensive machine learning pipeline for Learning-to-Rank asset prediction, focusing on the New Zealand Exchange (NZX). The system consists of two main components:

  1. Java Data Ingestion (java/) — Fetches and consolidates data from multiple sources
  2. Python ML Pipeline (python/ml-pipeline/) — Implements ranking-based asset prediction using LightGBM
┌─────────────────────┐          ┌─────────────────────┐
│   JAVA INGESTION    │          │  PYTHON ML PIPELINE │
│                     │          │                     │
│  Yahoo Finance      │          │  Feature Engineer   │
│  NZ Statistics      │  ─────►  │  LGBMRanker Model   │  ─────►  Predictions
│  RBNZ Data          │  CSV     │  Rolling Backtest   │          & Metrics
│  Macroeconomic      │          │  Portfolio Analysis │
└─────────────────────┘          └─────────────────────┘

Key Metrics (Latest Results)

Metric Value Description
Mean IC 0.27+ Correlation between predictions and actual returns
ICIR 4.0+ Information ratio (consistency measure)
Sharpe Ratio 3.0+ Risk-adjusted returns (annualized)
Quintile Spread 5-10% Return gap between top and bottom quintiles

I am aware these results seem unrealistic when compared to typical values. The many tests suggest no leakage, perhaps NZX is just a very inefficient market. I will continue to investigate this.

Quick Start

Prerequisites

  • Java 17+ (for data ingestion)
  • Maven 3.6+ (for building Java)
  • Python 3.11+ (for ML pipeline)
  • uv (Python package manager) — Install uv

1. Clone and Setup

git clone https://github.com/oneye5/skuld.git
cd skuld

2. Run Data Ingestion (Java)

cd java
mvn clean compile
mvn exec:java -Dexec.mainClass="lazic.Main"

This fetches data from Yahoo Finance and NZ economic sources, outputting to data/data_long.csv.

3. Run ML Pipeline (Python)

cd python/ml-pipeline
uv sync                                    # Install dependencies
uv run python scripts/run_model_evaluation.py    # Run evaluation

Project Structure

skuld/
├── README.md                          # This file
├── data/
│   └── data_long.csv                 # Raw data (generated by Java)
├── docs/
│   ├── RANKING_PIPELINE_GUIDE.md     # Detailed pipeline documentation
│   ├── ANNUAL_STATISTICS.md          # Annual performance metrics
│   ├── CLUSTERING.md                 # Clustering methodology
│   ├── FEATURES.md                   # Feature engineering guide
│   ├── TESTING.md                    # Test coverage documentation
│   └── DATA_LEAKAGE.md              # Leakage prevention guide
├── java/                             # Data ingestion
│   ├── pom.xml                       # Maven configuration
│   ├── docs/
│   │   ├── DATA_SOURCES.md          # Data source documentation
│   │   └── ARCHITECTURE.md          # Java architecture
│   └── src/main/java/lazic/
│       ├── Main.java                 # Entry point
│       ├── sources/                  # Data source implementations
│       └── utils/                    # Utilities
└── python/ml-pipeline/               # ML pipeline
    ├── pyproject.toml                # Python dependencies
    ├── config/                       # Configuration
    ├── core/                         # Core utilities
    ├── features/                     # Feature engineering
    ├── learner/                      # ML models
    ├── evaluation/                   # Metrics & backtesting
    ├── pipeline/                     # Main pipeline
    ├── scripts/                      # Entry points
    └── tests/                        # Unit tests 

Documentation

Development

Document Description
Project TODO Active tasks and project roadmap

Core Guides

Document Description
Ranking Pipeline Guide Complete pipeline documentation
Annual Statistics Performance metrics and Monte Carlo
Clustering Statistical sector clustering
Features Feature engineering reference

Technical Documentation

Document Description
Testing Test coverage and methodology
Data Leakage Leakage prevention strategies
Data Audit Reports Third-party price validation
Java Data Sources Data source configuration
Java Architecture Ingestion architecture

Configuration

Python ML Pipeline

Key settings in python/ml-pipeline/config/settings.py:

# Target Definition
FORWARD_RETURN_DAYS = 365        # Prediction horizon
RETURN_TYPE = "simple"           # simple or log returns

# Model Settings  
RANKER_N_ESTIMATORS = 150        # LightGBM iterations
RANKER_NUM_LEAVES = 127          # Tree complexity

# Portfolio Settings
PORTFOLIO_TOP_N = 10             # Long positions
PORTFOLIO_BOTTOM_N = 10          # Short positions
TRANSACTION_COST_BPS = 10        # Trading costs

Java Data Ingestion

Data sources configured in java/src/main/java/lazic/sources/config/Tickers.java.

Testing

Python Tests

cd python/ml-pipeline
uv run pytest                         # Run all tests
uv run pytest -x -q                   # Stop on first failure
uv run pytest tests/test_leakage_*.py # Run leakage tests

Test Coverage: 300+ tests covering:

  • Data leakage prevention (comprehensive)
  • Feature engineering correctness
  • Pipeline integration
  • Metric calculations
  • Portfolio simulation

Java Tests

cd java
mvn test

Pipeline Overview

┌────────────────────────────────────────────────────────────────────────────┐
│                          RANKING PIPELINE FLOW                              │
└────────────────────────────────────────────────────────────────────────────┘

Raw Data (data_long.csv)
        │
        ▼
┌───────────────────┐
│ Long → Wide       │  Convert ticker+timestamp rows to wide format
└───────────────────┘
        │
        ▼
┌───────────────────┐
│ Anomaly Detection │  Filter price discontinuities (splits, recycled tickers)
└───────────────────┘
        │
        ▼
┌───────────────────┐
│ Forward Returns   │  Compute 365-day lookahead returns (target)
└───────────────────┘
        │
        ▼
┌───────────────────┐
│ Feature Engineer  │  Technical indicators, alpha factors, ratios
└───────────────────┘
        │
        ├──────────────────────────────────────────────────────┐
        │                                                       │
        ▼                                                       ▼
┌───────────────────┐                               ┌───────────────────┐
│ Rolling Window 1  │  ...  Rolling Window N       │  Cluster Analysis │
│                   │                               │  (leakage-safe)   │
│ • Train/Test Split│                               └───────────────────┘
│ • Scaler Fit      │                                        │
│ • LGBMRanker      │                                        │
│ • Predictions     │ ◄──────────────────────────────────────┘
└───────────────────┘
        │
        ▼
┌───────────────────┐
│ Aggregate Results │  Combine all window predictions
└───────────────────┘
        │
        ▼
┌───────────────────┐
│ Evaluation        │  IC, ICIR, Quintile Returns, Hit Rate
└───────────────────┘
        │
        ▼
┌───────────────────┐
│ Portfolio Backtest│  Long-short simulation with costs
└───────────────────┘
        │
        ▼
┌───────────────────┐
│ Output Reports    │  Metrics, plots, predictions CSV
└───────────────────┘

Data Leakage Prevention

The pipeline implements multiple safeguards:

  1. Temporal Validation — Test data strictly after training data
  2. Scaler Isolation — Fitted on training data only
  3. Cross-Sectional Features — Computed per-timestamp after split
  4. Forward Fill Safety — Sorted by [TICKER, TIMESTAMP] before fill
  5. Cluster Assignment — Using only historical data

See Data Leakage Guide for details.

Development Philosophy

  1. Research-Driven — Features based on academic finance literature
  2. Test First — 300+ comprehensive tests for correctness
  3. Avoid Leakage — Multiple safeguards and dedicated tests
  4. Reproducible — Experiment tracking with git commit hashes
  5. Worst case scenario — Config values such as fees are higher than they would be if implmeneted. (eg ignoring sharesies plans & fee caps opting for the high flat percent fee)

Model trends

The model appears to prefer volatile illiquid dividend paying assets with consistent trends. Some of the top industries are manufacturing, agriculture and seafood.

References

  • LightGBM Documentation
  • Gu, Kelly, Xiu (2020) - "Empirical Asset Pricing via Machine Learning"
  • Ang et al. (2006) - "The Cross-Section of Volatility and Expected Returns"

About

Building upon nzx-predictor proof of concept project. Project aims to predict future asset prices using machine learning.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors