This repository contains a purely local setup for running LLM experiments on formalization tasks (SQL, semi-formal, and low-formal tasks) using HuggingFace models. Supports both NVIDIA GPUs (CUDA) and Apple Silicon (MPS/Metal).
conda create -n llm-formalization python=3.12 -y
conda activate llm-formalizationImportant: PyTorch requires Python <=3.12. Do not use 3.13 or 3.14.
Apple Silicon (M1/M2/M3/M4):
pip install torch torchvision torchaudioVerify MPS is available:
python - << 'EOF'
import torch
print("MPS available:", torch.backends.mps.is_available())
print("PyTorch version:", torch.__version__)
EOFYou should see MPS available: True.
Note: 4-bit quantization (BitsAndBytes) is not supported on Apple Silicon. Models run in FP16, which works well on M4 Max with its large unified memory.
NVIDIA GPU (RTX 3090 etc.):
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -yVerify CUDA is available:
python - << 'EOF'
import torch
print("CUDA available:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU only")
EOFYou should see your GPU listed.
pip install -r requirements.txtRun the verification script to check that everything is installed correctly:
python scripts/verify_setup.pyTest that a model can be loaded and generate text:
# Test with a HuggingFace hub model (will download on first run)
python scripts/test_model.py --model meta-llama/Meta-Llama-3-8B-Instruct --use-4bit
# Or test with a local model
python scripts/test_model.py --model models/llama3-8b --use-4bit
# Use FP16 instead of 4-bit
python scripts/test_model.py --model models/mistral-7b --fp16Log in to HuggingFace (if needed for gated models):
huggingface-cli loginDownload models (optional - they'll download automatically on first use):
# Example: download ahead of time
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --local-dir models/llama3-8b
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.3 --local-dir models/mistral-7b-
Prepare your data: Create
data/high_formal/sql_tasks.csvwith columns:id: Task identifierschema: Database schema descriptionquestion: Natural language questiongold_sql: Ground truth SQL query
See
data/high_formal/sql_tasks.csv.examplefor a sample format. -
Run experiments:
python scripts/run_high_formal_local.py- Evaluate results:
python scripts/eval_high_formal.py-
Prepare your data: Create
data/semi_formal/semi_formal_tasks.csvwith columns:id: Task identifiertext: Input text descriptiontask_type: "entity" or "process"gold_extraction: Ground truth extraction
See
data/semi_formal/semi_formal_tasks.csv.examplefor a sample format. -
Run experiments:
python scripts/run_semi_formal_local.py- Evaluate results (uses semantic similarity):
python scripts/eval_semi_formal.py-
Prepare your data: Create
data/low_formal/low_formal_tasks.csvwith columns:id: Task identifierscenario: Business scenario descriptionquestion: Optional question about the scenario
See
data/low_formal/low_formal_tasks.csv.examplefor a sample format. -
Run experiments:
python scripts/run_low_formal_local.py- Manual evaluation: Low-formal tasks require human evaluation. Review the generated responses in the output JSONL file and add ratings manually.
To measure output consistency across multiple runs:
- Run consistency evaluation (K=5 runs per task by default):
# For high-formal tasks
python scripts/run_consistency_eval.py
# Edit the script to change:
# - DATA_PATH: Path to your task CSV
# - TASK_TYPE: "high_formal", "semi_formal", or "low_formal"
# - K_RUNS: Number of runs per task (default: 5)- Analyze consistency metrics:
python scripts/eval_consistency.pyThis will compute:
- Consistency scores (frequency of most common output)
- Number of unique outputs per task
- Distribution of consistency across tasks
Edit the respective script files to change:
MODEL_DIR: Path to model (local or HuggingFace hub name)LOAD_IN_4BIT: Use 4-bit quantization (True) or FP16 (False)DATA_PATH: Path to input CSVOUT_PATH: Path to output JSONLK_RUNS: Number of runs for consistency evaluation (default: 5)
ER26/
├── scripts/
│ ├── local_model.py # LocalChatModel class
│ ├── run_high_formal_local.py # Run SQL experiments
│ ├── eval_high_formal.py # Evaluate SQL results
│ ├── run_semi_formal_local.py # Run entity/process extraction
│ ├── eval_semi_formal.py # Evaluate semi-formal (semantic similarity)
│ ├── run_low_formal_local.py # Run management/policy tasks
│ ├── run_consistency_eval.py # Run K iterations for consistency (H2)
│ ├── eval_consistency.py # Analyze consistency metrics
│ ├── verify_setup.py # Verify installation
│ └── test_model.py # Test model loading
├── data/
│ ├── high_formal/ # SQL tasks
│ │ └── sql_tasks.csv.example # Example data format
│ ├── semi_formal/ # Entity/process tasks
│ │ └── semi_formal_tasks.csv.example
│ ├── low_formal/ # Management/policy tasks
│ │ └── low_formal_tasks.csv.example
│ └── results_raw/ # Experiment outputs
├── models/ # Local model storage (optional)
├── requirements.txt # Python dependencies
└── README.md # This file
Apple Silicon M4 Max (64GB+ unified memory):
- Strong model (8B):
meta-llama/Meta-Llama-3-8B-Instruct(FP16 — fits easily) - Baseline model (7B):
mistralai/Mistral-7B-Instruct-v0.3(FP16) - No quantization needed — unified memory is shared between CPU and GPU
NVIDIA RTX 3090 (24GB VRAM):
- Strong model (8B):
meta-llama/Meta-Llama-3-8B-Instruct(use 4-bit quantization) - Baseline model (7B):
mistralai/Mistral-7B-Instruct-v0.3(can use FP16)
- High-formal tasks: Run SQL experiments and evaluate with exact match
- Semi-formal tasks: Run extraction experiments and evaluate with semantic similarity
- Low-formal tasks: Run generation experiments and evaluate with human ratings
- Run consistency evaluation for each task type (K=5 runs per task)
- Analyze consistency scores to measure output stability
- Compare consistency across different formalization levels
To compare different models:
- Change
MODEL_DIRin the respective script - Update
OUT_PATHto include model name - Run the same experiments with different models
- Compare results in
data/results_raw/
- Apple Silicon: Models run in FP16 on MPS. The M4 Max has plenty of unified memory for 7B-8B models. No quantization needed.
- 4-bit quantization: Use for 8B+ models on NVIDIA GPUs to fit in 24GB VRAM
- FP16: Use for 7B models or when you have headroom (always used on Apple Silicon)
- Temperature: Lower (0.3-0.5) for consistency, higher (0.7-1.0) for diversity
- Batch processing: Results are written incrementally, so you can stop/resume safely