Skuld - Cross-Sectional Stock Ranking Pipeline

_{Skuld, one of three Norns — Urðr (past), Verðandi (present), Skuld (future).}

┌─────────────────────┐          ┌─────────────────────┐
│   JAVA INGESTION    │          │  PYTHON ML PIPELINE │
│                     │          │                     │
│  Yahoo Finance      │          │  Feature Engineer   │
│  NZ Statistics      │  ─────►  │  LGBMRanker Model   │  ─────►  Predictions
│  RBNZ Data          │  CSV     │  Rolling Backtest   │          & Metrics
│  Macroeconomic      │          │  Portfolio Analysis │
└─────────────────────┘          └─────────────────────┘

Key Metrics (Latest Results)

Metric	Value	Description
Mean IC	0.27+	Correlation between predictions and actual returns
ICIR	4.0+	Information ratio (consistency measure)
Sharpe Ratio	3.0+	Risk-adjusted returns (annualized)
Quintile Spread	5-10%	Return gap between top and bottom quintiles

I am aware these results seem unrealistic when compared to typical values. The many tests suggest no leakage, perhaps NZX is just a very inefficient market. I will continue to investigate this.

Quick Start

Prerequisites

Java 17+ (for data ingestion)
Maven 3.6+ (for building Java)
Python 3.11+ (for ML pipeline)
uv (Python package manager) — Install uv

1. Clone and Setup

git clone https://github.com/oneye5/skuld.git
cd skuld

2. Run Data Ingestion (Java)

cd java
mvn clean compile
mvn exec:java -Dexec.mainClass="lazic.Main"

This fetches data from Yahoo Finance and NZ economic sources, outputting to data/data_long.csv.

3. Run ML Pipeline (Python)

cd python/ml-pipeline
uv sync                                    # Install dependencies
uv run python scripts/run_model_evaluation.py    # Run evaluation

Project Structure

skuld/
├── README.md                          # This file
├── data/
│   └── data_long.csv                 # Raw data (generated by Java)
├── docs/
│   ├── RANKING_PIPELINE_GUIDE.md     # Detailed pipeline documentation
│   ├── ANNUAL_STATISTICS.md          # Annual performance metrics
│   ├── CLUSTERING.md                 # Clustering methodology
│   ├── FEATURES.md                   # Feature engineering guide
│   ├── TESTING.md                    # Test coverage documentation
│   └── DATA_LEAKAGE.md              # Leakage prevention guide
├── java/                             # Data ingestion
│   ├── pom.xml                       # Maven configuration
│   ├── docs/
│   │   ├── DATA_SOURCES.md          # Data source documentation
│   │   └── ARCHITECTURE.md          # Java architecture
│   └── src/main/java/lazic/
│       ├── Main.java                 # Entry point
│       ├── sources/                  # Data source implementations
│       └── utils/                    # Utilities
└── python/ml-pipeline/               # ML pipeline
    ├── pyproject.toml                # Python dependencies
    ├── config/                       # Configuration
    ├── core/                         # Core utilities
    ├── features/                     # Feature engineering
    ├── learner/                      # ML models
    ├── evaluation/                   # Metrics & backtesting
    ├── pipeline/                     # Main pipeline
    ├── scripts/                      # Entry points
    └── tests/                        # Unit tests

Documentation

Development

Document	Description
Project TODO	Active tasks and project roadmap

Core Guides

Document	Description
Ranking Pipeline Guide	Complete pipeline documentation
Annual Statistics	Performance metrics and Monte Carlo
Clustering	Statistical sector clustering
Features	Feature engineering reference

Technical Documentation

Document	Description
Testing	Test coverage and methodology
Data Leakage	Leakage prevention strategies
Data Audit Reports	Third-party price validation
Java Data Sources	Data source configuration
Java Architecture	Ingestion architecture

Configuration

Python ML Pipeline

Key settings in python/ml-pipeline/config/settings.py:

# Target Definition
FORWARD_RETURN_DAYS = 365        # Prediction horizon
RETURN_TYPE = "simple"           # simple or log returns

# Model Settings  
RANKER_N_ESTIMATORS = 150        # LightGBM iterations
RANKER_NUM_LEAVES = 127          # Tree complexity

# Portfolio Settings
PORTFOLIO_TOP_N = 10             # Long positions
PORTFOLIO_BOTTOM_N = 10          # Short positions
TRANSACTION_COST_BPS = 10        # Trading costs

Java Data Ingestion

Data sources configured in java/src/main/java/lazic/sources/config/Tickers.java.

Testing

Python Tests

cd python/ml-pipeline
uv run pytest                         # Run all tests
uv run pytest -x -q                   # Stop on first failure
uv run pytest tests/test_leakage_*.py # Run leakage tests

Test Coverage: 300+ tests covering:

Data leakage prevention (comprehensive)
Feature engineering correctness
Pipeline integration
Metric calculations
Portfolio simulation

Java Tests

cd java
mvn test

Pipeline Overview

┌────────────────────────────────────────────────────────────────────────────┐
│                          RANKING PIPELINE FLOW                              │
└────────────────────────────────────────────────────────────────────────────┘

Raw Data (data_long.csv)
        │
        ▼
┌───────────────────┐
│ Long → Wide       │  Convert ticker+timestamp rows to wide format
└───────────────────┘
        │
        ▼
┌───────────────────┐
│ Anomaly Detection │  Filter price discontinuities (splits, recycled tickers)
└───────────────────┘
        │
        ▼
┌───────────────────┐
│ Forward Returns   │  Compute 365-day lookahead returns (target)
└───────────────────┘
        │
        ▼
┌───────────────────┐
│ Feature Engineer  │  Technical indicators, alpha factors, ratios
└───────────────────┘
        │
        ├──────────────────────────────────────────────────────┐
        │                                                       │
        ▼                                                       ▼
┌───────────────────┐                               ┌───────────────────┐
│ Rolling Window 1  │  ...  Rolling Window N       │  Cluster Analysis │
│                   │                               │  (leakage-safe)   │
│ • Train/Test Split│                               └───────────────────┘
│ • Scaler Fit      │                                        │
│ • LGBMRanker      │                                        │
│ • Predictions     │ ◄──────────────────────────────────────┘
└───────────────────┘
        │
        ▼
┌───────────────────┐
│ Aggregate Results │  Combine all window predictions
└───────────────────┘
        │
        ▼
┌───────────────────┐
│ Evaluation        │  IC, ICIR, Quintile Returns, Hit Rate
└───────────────────┘
        │
        ▼
┌───────────────────┐
│ Portfolio Backtest│  Long-short simulation with costs
└───────────────────┘
        │
        ▼
┌───────────────────┐
│ Output Reports    │  Metrics, plots, predictions CSV
└───────────────────┘

Data Leakage Prevention

The pipeline implements multiple safeguards:

Temporal Validation — Test data strictly after training data
Scaler Isolation — Fitted on training data only
Cross-Sectional Features — Computed per-timestamp after split
Forward Fill Safety — Sorted by [TICKER, TIMESTAMP] before fill
Cluster Assignment — Using only historical data

See Data Leakage Guide for details.

Development Philosophy

Research-Driven — Features based on academic finance literature
Test First — 300+ comprehensive tests for correctness
Avoid Leakage — Multiple safeguards and dedicated tests
Reproducible — Experiment tracking with git commit hashes
Worst case scenario — Config values such as fees are higher than they would be if implmeneted. (eg ignoring sharesies plans & fee caps opting for the high flat percent fee)

Model trends

The model appears to prefer volatile illiquid dividend paying assets with consistent trends. Some of the top industries are manufacturing, agriculture and seafood.

References

LightGBM Documentation
Gu, Kelly, Xiu (2020) - "Empirical Asset Pricing via Machine Learning"
Ang et al. (2006) - "The Cross-Section of Volatility and Expected Returns"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Skuld - Cross-Sectional Stock Ranking Pipeline

Table of Contents

Key Metrics (Latest Results)

Quick Start

Prerequisites

1. Clone and Setup

2. Run Data Ingestion (Java)

3. Run ML Pipeline (Python)

Project Structure

Documentation

Development

Core Guides

Technical Documentation

Configuration

Python ML Pipeline

Java Data Ingestion

Testing

Python Tests

Java Tests

Pipeline Overview

Data Leakage Prevention

Development Philosophy

Model trends

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
.github		.github
data		data
docs		docs
java		java
python/ml-pipeline		python/ml-pipeline
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Skuld - Cross-Sectional Stock Ranking Pipeline

Table of Contents

Key Metrics (Latest Results)

Quick Start

Prerequisites

1. Clone and Setup

2. Run Data Ingestion (Java)

3. Run ML Pipeline (Python)

Project Structure

Documentation

Development

Core Guides

Technical Documentation

Configuration

Python ML Pipeline

Java Data Ingestion

Testing

Python Tests

Java Tests

Pipeline Overview

Data Leakage Prevention

Development Philosophy

Model trends

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages