Local LLM Formalization Experiment

This repository contains a purely local setup for running LLM experiments on formalization tasks (SQL, semi-formal, and low-formal tasks) using HuggingFace models. Supports both NVIDIA GPUs (CUDA) and Apple Silicon (MPS/Metal).

Setup

1. Create Conda Environment

conda create -n llm-formalization python=3.12 -y
conda activate llm-formalization

Important: PyTorch requires Python <=3.12. Do not use 3.13 or 3.14.

2. Install GPU-enabled PyTorch

Apple Silicon (M1/M2/M3/M4):

pip install torch torchvision torchaudio

Verify MPS is available:

python - << 'EOF'
import torch
print("MPS available:", torch.backends.mps.is_available())
print("PyTorch version:", torch.__version__)
EOF

You should see MPS available: True.

Note: 4-bit quantization (BitsAndBytes) is not supported on Apple Silicon. Models run in FP16, which works well on M4 Max with its large unified memory.

NVIDIA GPU (RTX 3090 etc.):

conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y

Verify CUDA is available:

python - << 'EOF'
import torch
print("CUDA available:", torch.cuda.is_available())
print("Device:", torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU only")
EOF

You should see your GPU listed.

3. Install Python Dependencies

pip install -r requirements.txt

4. Verify Setup

Run the verification script to check that everything is installed correctly:

python scripts/verify_setup.py

5. Test Model Loading (Optional)

Test that a model can be loaded and generate text:

# Test with a HuggingFace hub model (will download on first run)
python scripts/test_model.py --model meta-llama/Meta-Llama-3-8B-Instruct --use-4bit

# Or test with a local model
python scripts/test_model.py --model models/llama3-8b --use-4bit

# Use FP16 instead of 4-bit
python scripts/test_model.py --model models/mistral-7b --fp16

6. Set Up HuggingFace Models

Log in to HuggingFace (if needed for gated models):

huggingface-cli login

Download models (optional - they'll download automatically on first use):

# Example: download ahead of time
huggingface-cli download meta-llama/Meta-Llama-3-8B-Instruct --local-dir models/llama3-8b
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.3 --local-dir models/mistral-7b

Usage

High-Formal Tasks (SQL)

Prepare your data: Create data/high_formal/sql_tasks.csv with columns:
- id: Task identifier
- schema: Database schema description
- question: Natural language question
- gold_sql: Ground truth SQL query
See data/high_formal/sql_tasks.csv.example for a sample format.
Run experiments:

python scripts/run_high_formal_local.py

Evaluate results:

python scripts/eval_high_formal.py

Semi-Formal Tasks (Entity/Process Extraction)

Prepare your data: Create data/semi_formal/semi_formal_tasks.csv with columns:
- id: Task identifier
- text: Input text description
- task_type: "entity" or "process"
- gold_extraction: Ground truth extraction
See data/semi_formal/semi_formal_tasks.csv.example for a sample format.
Run experiments:

python scripts/run_semi_formal_local.py

Evaluate results (uses semantic similarity):

python scripts/eval_semi_formal.py

Low-Formal Tasks (Management/Policy)

Prepare your data: Create data/low_formal/low_formal_tasks.csv with columns:
- id: Task identifier
- scenario: Business scenario description
- question: Optional question about the scenario
See data/low_formal/low_formal_tasks.csv.example for a sample format.
Run experiments:

python scripts/run_low_formal_local.py

Manual evaluation: Low-formal tasks require human evaluation. Review the generated responses in the output JSONL file and add ratings manually.

Consistency Evaluation (H2 Hypothesis)

To measure output consistency across multiple runs:

Run consistency evaluation (K=5 runs per task by default):

# For high-formal tasks
python scripts/run_consistency_eval.py

# Edit the script to change:
# - DATA_PATH: Path to your task CSV
# - TASK_TYPE: "high_formal", "semi_formal", or "low_formal"
# - K_RUNS: Number of runs per task (default: 5)

Analyze consistency metrics:

python scripts/eval_consistency.py

This will compute:

Consistency scores (frequency of most common output)
Number of unique outputs per task
Distribution of consistency across tasks

Configuration

Edit the respective script files to change:

MODEL_DIR: Path to model (local or HuggingFace hub name)
LOAD_IN_4BIT: Use 4-bit quantization (True) or FP16 (False)
DATA_PATH: Path to input CSV
OUT_PATH: Path to output JSONL
K_RUNS: Number of runs for consistency evaluation (default: 5)

Project Structure

ER26/
├── scripts/
│   ├── local_model.py                 # LocalChatModel class
│   ├── run_high_formal_local.py       # Run SQL experiments
│   ├── eval_high_formal.py            # Evaluate SQL results
│   ├── run_semi_formal_local.py       # Run entity/process extraction
│   ├── eval_semi_formal.py            # Evaluate semi-formal (semantic similarity)
│   ├── run_low_formal_local.py        # Run management/policy tasks
│   ├── run_consistency_eval.py        # Run K iterations for consistency (H2)
│   ├── eval_consistency.py            # Analyze consistency metrics
│   ├── verify_setup.py                # Verify installation
│   └── test_model.py                  # Test model loading
├── data/
│   ├── high_formal/                   # SQL tasks
│   │   └── sql_tasks.csv.example      # Example data format
│   ├── semi_formal/                   # Entity/process tasks
│   │   └── semi_formal_tasks.csv.example
│   ├── low_formal/                    # Management/policy tasks
│   │   └── low_formal_tasks.csv.example
│   └── results_raw/                   # Experiment outputs
├── models/                            # Local model storage (optional)
├── requirements.txt                   # Python dependencies
└── README.md                          # This file

Model Recommendations

Apple Silicon M4 Max (64GB+ unified memory):

Strong model (8B): meta-llama/Meta-Llama-3-8B-Instruct (FP16 — fits easily)
Baseline model (7B): mistralai/Mistral-7B-Instruct-v0.3 (FP16)
No quantization needed — unified memory is shared between CPU and GPU

NVIDIA RTX 3090 (24GB VRAM):

Strong model (8B): meta-llama/Meta-Llama-3-8B-Instruct (use 4-bit quantization)
Baseline model (7B): mistralai/Mistral-7B-Instruct-v0.3 (can use FP16)

Experiment Workflow

Part 1: Performance Evaluation (H1)

High-formal tasks: Run SQL experiments and evaluate with exact match
Semi-formal tasks: Run extraction experiments and evaluate with semantic similarity
Low-formal tasks: Run generation experiments and evaluate with human ratings

Part 2: Consistency Evaluation (H2)

Run consistency evaluation for each task type (K=5 runs per task)
Analyze consistency scores to measure output stability
Compare consistency across different formalization levels

Comparing Models

To compare different models:

Change MODEL_DIR in the respective script
Update OUT_PATH to include model name
Run the same experiments with different models
Compare results in data/results_raw/

Tips

Apple Silicon: Models run in FP16 on MPS. The M4 Max has plenty of unified memory for 7B-8B models. No quantization needed.
4-bit quantization: Use for 8B+ models on NVIDIA GPUs to fit in 24GB VRAM
FP16: Use for 7B models or when you have headroom (always used on Apple Silicon)
Temperature: Lower (0.3-0.5) for consistency, higher (0.7-1.0) for diversity
Batch processing: Results are written incrementally, so you can stop/resume safely

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
docs		docs
hpc		hpc
scripts		scripts
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
COGNITIVE_EFFICIENCY.md		COGNITIVE_EFFICIENCY.md
COMPREHENSIVE_COMPARISON.md		COMPREHENSIVE_COMPARISON.md
EXPERIMENT_STATUS.md		EXPERIMENT_STATUS.md
FINAL_RESULTS.md		FINAL_RESULTS.md
FINDINGS_MISTRAL.md		FINDINGS_MISTRAL.md
FINDINGS_PER_LLM.md		FINDINGS_PER_LLM.md
MODELS_USED.md		MODELS_USED.md
QUICK_ACTIONS.md		QUICK_ACTIONS.md
README.md		README.md
README_MULTI_MODEL.md		README_MULTI_MODEL.md
ROBUST_RESULTS_GUIDE.md		ROBUST_RESULTS_GUIDE.md
SETUP_LLAMA.md		SETUP_LLAMA.md
check_progress.sh		check_progress.sh
econ_viz.py		econ_viz.py
monitor_comparison.sh		monitor_comparison.sh
monitor_progress.sh		monitor_progress.sh
requirements.txt		requirements.txt
tail_live.sh		tail_live.sh
watch_progress.sh		watch_progress.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Local LLM Formalization Experiment

Setup

1. Create Conda Environment

2. Install GPU-enabled PyTorch

3. Install Python Dependencies

4. Verify Setup

5. Test Model Loading (Optional)

6. Set Up HuggingFace Models

Usage

High-Formal Tasks (SQL)

Semi-Formal Tasks (Entity/Process Extraction)

Low-Formal Tasks (Management/Policy)

Consistency Evaluation (H2 Hypothesis)

Configuration

Project Structure

Model Recommendations

Experiment Workflow

Part 1: Performance Evaluation (H1)

Part 2: Consistency Evaluation (H2)

Comparing Models

Tips

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Local LLM Formalization Experiment

Setup

1. Create Conda Environment

2. Install GPU-enabled PyTorch

3. Install Python Dependencies

4. Verify Setup

5. Test Model Loading (Optional)

6. Set Up HuggingFace Models

Usage

High-Formal Tasks (SQL)

Semi-Formal Tasks (Entity/Process Extraction)

Low-Formal Tasks (Management/Policy)

Consistency Evaluation (H2 Hypothesis)

Configuration

Project Structure

Model Recommendations

Experiment Workflow

Part 1: Performance Evaluation (H1)

Part 2: Consistency Evaluation (H2)

Comparing Models

Tips

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages