GitHub

Intelligent Scheduling System for Large Language Model Benchmarking

SpectraBench is an advanced benchmarking tool developed by the AI Platform Team at KISTI Large-scale AI Research Center (Korea Institute of Science and Technology Information). The Large-scale AI Research Center was officially launched in March 2024, building upon KISTI's generative large language model 'KONI (KISTI Open Natural Intelligence)' unveiled in December 2023. The AI Platform Team is responsible for developing AI model and Agent service technologies.

SpectraBench optimizes LLM evaluation through smart scheduling, using machine learning to determine the best execution order and resource allocation, dramatically reducing evaluation time and preventing out-of-memory errors.

🔗 Related Projects:

KISTI-KONI Models - KISTI Large Language Models
KISTI-MCP - KISTI Model Context Protocol server

✨ Why SpectraBench?

🚀 Intelligent scheduling - Optimal execution order through machine learning
💾 Smart resource management - Prevents memory errors with adaptive allocation
📊 Maximized efficiency - Better hardware utilization than sequential evaluation
🤖 Adaptive learning - Continuously improves with usage
🔧 Flexible configuration - Easy switching between test and production setups
📈 3-stage evolution - Intelligent → Hybrid → Adaptive progression

🎯 Supported Models (22+ Models)

Korean Language Models

EXAONE 3.5 (2.4B, 7.8B, 32B Instruct)
HyperCLOVA-X SEED (0.5B, 1.5B Text Instruct)
Korean Bllossom (3B, 8B)
Kanana 1.5 (2.1B base/instruct, 8B base)
KISTI-KONI (8B Instruct)
KULLM3
ETRI Eagle (3B preview)
SaltLux Luxia (21.4B alignment v1.0/v1.2, Ko-Llama3-8B)

International Models

LLaMA 3.1 (8B)
LLaMA-DNA (8B Instruct)
Gemma 3 (4B, 12B Instruct)
Mistral 7B (v0.3)
Qwen 3 (8B)

All models from 0.5B to 32B parameters are fully supported with automatic size detection and resource optimization.

📊 Supported Benchmarks (21+ Tasks)

Korean Tasks

KMMLU / KMMLU Hard - Korean language understanding
Haerae - Korean cultural knowledge
KoBEST - Korean benchmark suite
CSATQA - Korean college entrance exam
KorMedMCQA - Korean medical knowledge
Belebele - Multilingual reading comprehension

Reasoning & Knowledge

MMLU / MMLU Pro - Multitask language understanding
BBH - Big-Bench Hard reasoning
GPQA - Graduate-level science questions
AGIEval - Human-centric benchmark

Math & Science

GSM8K - Math reasoning
HRM8K - Advanced math problems
ARC Challenge/Easy - Science questions

Language Understanding

HellaSwag - Commonsense reasoning
PIQA - Physical interaction reasoning
TriviaQA - Factual knowledge
Natural Questions Open - Open-domain QA

Code Generation

HumanEval - Python code generation

Total: 462 possible model-task combinations (22 models × 21 tasks)

🚀 Quick Start

Automated Installation & Execution

For users who want to quickly run experiments:

# 1. Clone repository
git clone https://github.com/gwleee/SpectraBench.git
cd SpectraBench

# 2. Automated installation (recommended)
chmod +x install.sh
./install.sh

# 3. Run experiments
chmod +x run_experiments.sh

# Quick test (15-30 minutes)
./run_experiments.sh --quick

# Full comparison (4-8 hours)
./run_experiments.sh --comparison

Manual Installation

If you prefer manual setup:

git clone https://github.com/gwleee/SpectraBench.git
cd SpectraBench

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt

# Configure environment variables (optional)
cp .env.example .env
# Edit .env with your API keys

Basic Usage

Interactive Mode (Standard Approach)

# Activate environment
source venv/bin/activate

# Run interactive benchmark
python entrance.py

# Choose execution mode:
# 1. Baseline mode (Traditional sequential)
# 2. Optimized mode (Intelligent scheduling) 
# 3. Comparison mode (Both modes for analysis)

# Choose dataset size:
# - Test mode (limit=2): Quick validation for new models/tasks
# - Full mode (limit=None): Complete benchmark evaluation

Automated Execution Scripts

The repository includes automated scripts for easy experiment execution:

Script	Purpose	Duration	Best For
`install.sh`	Complete environment setup	5-15 min	First-time setup
`run_experiments.sh`	Basic test run	15-30 min	Quick validation
`run_experiments.sh --quick`	Fast test (limit=2)	15-30 min	Development/testing
`run_experiments.sh --comparison`	Full comparison	4-8 hours	Comprehensive evaluation

Dataset Limit Configuration

For faster testing and validation, SpectraBench supports configurable data limits:

# In evaluation_lm.py or entrance.py
FULL_RUN = False  # Uses limit=2 for quick testing
FULL_RUN = True   # Uses full dataset

# You can modify the limit value in the code:
limit = 2    # Quick test (default for development)
limit = 10   # Small validation 
limit = 50   # Medium test
limit = None # Full dataset (production)

When to use different limits:

limit=2: New model testing, debugging, proof-of-concept (minutes)
limit=10-50: Development validation, parameter tuning (hours)
limit=None: Production evaluation, research papers (full runtime)

Advanced Usage

# Use extended configuration (rename files first)
mv "code/config/models copy.yaml" code/config/models.yaml
mv "code/config/tasks copy.yaml" code/config/tasks.yaml
python entrance.py

# Run experimental validation
python code/experiments/phase1_validation_experiment.py  # Limit optimization
python code/experiments/phase2_threshold_optimization.py  # Threshold tuning

# Generate performance comparison report
python code/experiments/comparison_analyzer.py

Configuration

SpectraBench provides two configuration sets for different use cases:

Standard Configuration (Faster Testing)

models.yaml - 13 carefully selected models for efficient testing
tasks.yaml - 10 core benchmarks for quick evaluation
Total combinations: 130 (13 × 10)
Dataset limit: Configurable (default limit=2 for testing)
Recommended for: Development, testing, proof-of-concept, new model validation

Extended Configuration (Comprehensive Evaluation)

models copy.yaml - 22 models covering the full Korean AI ecosystem
tasks copy.yaml - 21 benchmarks for thorough assessment
Total combinations: 462 (22 × 21)
Dataset limit: Full dataset (limit=None) or configurable for testing
Recommended for: Research, production evaluation, comprehensive comparison

# Standard models.yaml (13 models)
models:
  - name: "LLaMA 3.1 8B"
    id: "meta-llama/Llama-3.1-8B"
  - name: "EXAONE-3.5-32B-Instruct"
    id: "LGAI-EXAONE/EXAONE-3.5-32B-Instruct"
  # ... 11 more models

# Extended models copy.yaml (22 models)  
models:
  - name: "LLaMA 3.1 8B"
    id: "meta-llama/Llama-3.1-8B"
  - name: "EXAONE-3.5-7.8B-Instruct"    # Additional model
    id: "LGAI-EXAONE/EXAONE-3.5-7.8B-Instruct"
  - name: "KISTI-KONI"                  # Additional model
    id: "KISTI-KONI/KONI-Llama3.1-8B-Instruct-20241024"
  # ... 20 more models including Bllossom, Kanana variants, KULLM3, etc.

# Standard tasks.yaml (10 tasks)
harness_tasks:
  - kmmlu
  - mmlu
  - arc_challenge
  - hellaswag
  # ... 6 more core tasks

# Extended tasks copy.yaml (21 tasks)
harness_tasks:
  - kmmlu
  - kmmlu_hard                         # Additional task
  - belebele                           # Additional task
  - mmlu_pro                           # Additional task
  - triviaqa                           # Additional task
  - nq_open                            # Additional task
  - agieval                            # Additional task
  # ... 14 more comprehensive tasks

Switching Between Configurations

To use the extended configuration, simply rename the files:

# Switch to comprehensive evaluation
mv code/config/models.yaml code/config/models_standard.yaml
mv code/config/tasks.yaml code/config/tasks_standard.yaml
mv code/config/models\ copy.yaml code/config/models.yaml
mv code/config/tasks\ copy.yaml code/config/tasks.yaml

# Switch back to standard configuration
mv code/config/models.yaml code/config/models_extended.yaml
mv code/config/tasks.yaml code/config/tasks_extended.yaml
mv code/config/models_standard.yaml code/config/models.yaml
mv code/config/tasks_standard.yaml code/config/tasks.yaml

Advanced Configuration

The system uses code/config/scheduler_config.yaml for fine-tuning:

# Dynamic threshold management
stage_transitions:
  min_learning_data: 50    # Intelligent → Hybrid
  stable_learning_data: 200  # Hybrid → Adaptive
  
# Domain-specific optimization
domain_thresholds:
  large_models:     # >8B parameters
    min_learning_data: 75
    stable_learning_data: 300

🎮 How It Works

SpectraBench uses a 3-stage evolution approach with ConfigManager integration:

🧠 Intelligent Stage - Rule-based scheduling with smart heuristics
🔄 Hybrid Stage - ML predictions + rule-based decisions (weighted combination)
🤖 Adaptive Stage - Fully ML-driven with Random Forest scheduling

Key Features:

Dynamic model size detection - Automatically handles 0.5B to 32B models
Memory-aware scheduling - Prevents OOM with intelligent batch sizing
Task-specific optimization - Zero-shot detection for appropriate tasks
Real-time resource monitoring - GPU memory, utilization tracking
Performance learning - Builds prediction models from execution history

⚡ Configuration Comparison

Standard Configuration (130 combinations)

Purpose: Quick validation, development, testing
Models: 13 carefully selected representative models
Tasks: 10 core benchmarks
Best for: Proof-of-concept, algorithm testing, development cycles

Extended Configuration (462 combinations)

Purpose: Comprehensive research, production evaluation
Models: 22 models covering the full Korean AI ecosystem
Tasks: 21 benchmarks for thorough assessment
Best for: Research papers, production deployment, comprehensive comparison

The intelligent scheduling system automatically optimizes both configurations for maximum efficiency.

📈 Key Benefits

SpectraBench's intelligent scheduling provides significant improvements over traditional sequential evaluation:

Benefit	Description
⏱️ Faster Execution	Optimized task ordering reduces total time
💾 Memory Efficiency	Smart resource management prevents crashes
🎯 Better Utilization	Maximizes GPU usage through intelligent scheduling
✅ Higher Success Rate	Adaptive batch sizing reduces failure rates
🧠 Learning System	Gets smarter with each evaluation run

🔬 Research Pipeline

SpectraBench development follows a systematic 3-phase research approach:

Phase 1: Limit Convergence Analysis (Completed)

python code/experiments/phase1_validation_experiment.py

Purpose: Determines optimal data limits for accuracy convergence analysis. Results: Integrated into system configurations for efficient testing.

Phase 2: Threshold Optimization (Completed)

python code/experiments/phase2_threshold_optimization.py

Purpose: Optimizes stage transition thresholds based on Phase 1 results. Results: Applied to code/config/scheduler_config.yaml for intelligent scheduling.

Main Experiments (Ready to Use)

./run_experiments.sh --comparison

Purpose: Executes comprehensive benchmarking using optimized configurations. Results: Generates performance comparisons and evaluation metrics.

Note: Phase 1 and Phase 2 optimizations have been completed and their results are integrated into the system. Users can directly run main experiments.

🔬 Additional Analysis Tools

Performance Comparison

python code/experiments/comparison_analyzer.py

Detailed baseline vs optimized analysis with visualizations.

Advanced Threshold Testing (Experimental)

python code/experiments/threshold_optimization.py

Comprehensive threshold combination testing with simulation.

🛠️ System Requirements

This system has been successfully deployed and tested on the following environment:

Hardware Environment

GPU: 2× NVIDIA A100 (80GB VRAM)
RAM: 444GB
CPU: Intel Xeon Processor (Icelake, 48 cores)
Storage: 1.9TB ext4 for fast model loading and execution data storage

Software Environment

OS: Ubuntu 22.04.5 LTS (Kernel 5.15.0-139-generic)
Python: 3.10.16
PyTorch: 2.7.1
CUDA: 12.8
Transformers: 4.51.3

⚠️ Important Notes for Different Hardware

For smaller GPU configurations:

8-16GB VRAM: Use small models only (0.5B-3B parameters)
24GB VRAM: Support up to 8B models
40GB+ VRAM: Support most models, but large models (21.4B, 32B) may still cause OOM

Recommendations:

Start with standard configuration (13 models) before trying extended (22 models)
Monitor GPU memory usage and adjust model selection accordingly
Large models (>20B parameters) require substantial VRAM and may need single-GPU execution

See requirements.txt for complete dependency list.

🚀 Quick Start Guide

To quickly run experiments and validate results:

Quick Validation (15-30 minutes)

git clone https://github.com/gwleee/SpectraBench.git
cd SpectraBench
./install.sh
./run_experiments.sh --quick

Full Evaluation (4-8 hours)

./run_experiments.sh --comparison

The automated scripts handle all configuration and provide detailed logs for analysis. Results are saved in experiments_results/ with performance tracking in data/performanceDB/performance_history.db.

📄 License

Apache License 2.0 - see LICENSE for details.

🏛️ Citation

@software{spectrabench2025,
  title={SpectraBench:  A Three-Stage Evolution Framework for Intelligent Large Language Model Evaluation},
  author={KISTI Large-scale AI Research Center},
  year={2025},
  doi={10.5281/zenodo.16865650},
  url={https://github.com/gwleee/SpectraBench},
  license={Apache-2.0},
  note={Supports 22+ models and 21+ benchmarks with intelligent scheduling}
}

Developed with ❤️ by Gunwoo Lee from the AI Platform Team (Leader: Ryong Lee) at KISTI Large-scale AI Research Center (Director: Kyong-Ha Lee)

Supporting the Korean AI ecosystem with intelligent benchmarking tools through automated Docker multi-version evaluation

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
code		code
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
evaluation_data.zip		evaluation_data.zip
install.sh		install.sh
license.md		license.md
requirements.txt		requirements.txt

License

gwleee/SpectraBench

Folders and files

Latest commit

History

Repository files navigation