Autoformulation of Mathematical Optimization Models Using LLMs

This repository contains the code of the paper

Autoformulation of Mathematical Optimization Models Using LLMs published at ICML 2025

Overview

The repository implements the Autoformulation of mathematical to formulate step by step. It includes scripts for two primary experimental tasks:

MCTS: Model formulating step by step mathematical optimization models. Including prunning, ranking self-evaluation, etc.
DFS: A way to construct a dense tree for ablation studies.

API keys

For OpenAI models, place your key in utils.py.

Running Experiments

Batch Trace Generation Pipeline

Generate MCTS traces for multiple problems using the batch_generate_traces.py script:

Basic Usage

# Run on a range of problems (e.g., problems 0-123)
python batch_generate_traces.py --start 0 --end 50 --output-dir "results_dir" --engine "gpt-4o"

# Run on specific problem indices
python batch_generate_traces.py --indices 0 5 10 15 --output-dir "results_dir" --engine "gpt-4o"

# Run on entire dataset
python batch_generate_traces.py --all --output-dir "results_dir" --engine "gpt-4o"

Model Configuration Options

GPT-4o (OpenAI):

python batch_generate_traces.py --start 0 --end 50 \
    --output-dir "gpt4o_results" \
    --engine "gpt-4o" \
    --n-used-gpt 4  # 4-shot approach

Tinker (Fine-tuned models):

python batch_generate_traces.py --start 0 --end 50 \
    --output-dir "tinker_results" \
    --engine "tinker://682b7b52-9080-58b6-9be7-c91c44b0bdcd:train:0/sampler_weights/final" \
    --n-used-gpt 1  # 1-shot approach

Ollama (Local models):

python batch_generate_traces.py --start 0 --end 50 \
    --output-dir "ollama_results" \
    --engine "ollama:deepseek-r1:8b" \
    --n-used-gpt 1

MCTS Parameters

--n-used-gpt: Number of LLM generations per step (default: 4)
- 1 for 1-shot (~31 LLM calls per problem, faster)
- 4 for 4-shot (~124 LLM calls per problem, higher quality)
--n-reward: Number of reward evaluations (default: 3)
--n-top: Number of top candidates to keep (default: 2)

Example Experiments

1-shot GPT-4o (Cost-efficient):

python batch_generate_traces.py --start 0 --end 123 \
    --output-dir "gpt4o_single_call_results" \
    --engine "gpt-4o" \
    --n-used-gpt 1

4-shot Tinker v3 (Best fine-tuned model):

python batch_generate_traces.py --start 0 --end 123 \
    --output-dir "tinker_4shot_results_v3" \
    --engine "tinker://682b7b52-9080-58b6-9be7-c91c44b0bdcd:train:0/sampler_weights/final" \
    --n-used-gpt 4

Evaluation Metrics Calculation

After generating traces, evaluate the results using analyze_tinker_results.py:

python analyze_tinker_results.py --results-dir "your_results_directory"

Metrics Explained

The script calculates four key metrics:

Success Rate: Percentage of problems with at least one optimal solution
- Formula: (Problems with ≥1 optimal) / (Total problems attempted)
Completion Rate: Percentage of problems that finished MCTS exploration
- Formula: (Problems completed) / (Total problems attempted)
Quality on Completed: Average accuracy for problems that completed
- Formula: Average(optimal solutions / total attempts) for completed problems only
Overall Correct: Same as Success Rate (problems solved / total attempted)

Example Evaluation

# Evaluate GPT-4o results
python analyze_tinker_results.py --results-dir "gpt4o_single_call_results"

# Evaluate Tinker results
python analyze_tinker_results.py --results-dir "tinker_4shot_results_v3"

# Evaluate baseline
python analyze_tinker_results.py --results-dir "NL4OPT_results"

Sample Output

Found 123 problem directories
================================================================================
Problem 0: 1/1 optimal (100.0%)
  Ground Truth: 1160.0, Best Found: 1160.0
...
================================================================================

SUMMARY:
Total problems attempted: 123
Problems completed: 87 (70.7%)
Problems failed/incomplete: 36 (29.3%)

Problems with at least one optimal solution: 76/123 (61.8%)
Average success rate (completed only): 68.3%

================================================================================
METRICS BREAKDOWN (treating incomplete as failures):
================================================================================
1. Success Rate (problems solved):          61.8%
2. Completion Rate (problems finished):     70.7%
3. Quality on Completed (avg accuracy):     68.3%
4. Overall Correct (same as #1):            61.8%

Experimental Results Summary

Model	Configuration	Success Rate	Completion Rate	Quality on Completed
GPT-4o	1-shot	67.3%	80.8%	83.3%
GPT-4o	4-shot	92.7%	N/A	90.1%
Tinker v3	1-shot	14.6%	17.9%	81.8%
Tinker v3	4-shot	61.8%	70.7%	68.3%

NL4OPT

Example of running the whole dataset NL4OPT.

sh run_all_NLP

The code is still not fully clean yet. However, all the components are here.

Citation

@inproceedings{astorgaautoformulation,
  title={Autoformulation of Mathematical Optimization Models Using LLMs},
  author={Astorga, Nicol{\'a}s and Liu, Tennison and Xiao, Yuanzhang and van der Schaar, Mihaela},
  booktitle={Forty-second International Conference on Machine Learning}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
clients		clients
data		data
search_to_reasoning		search_to_reasoning
MCTS_used.py		MCTS_used.py
README.md		README.md
analyze_tinker_results.py		analyze_tinker_results.py
batch_generate_traces.py		batch_generate_traces.py
figure-1.png		figure-1.png
mcts_tracer.py		mcts_tracer.py
prompts.py		prompts.py
requirements.txt		requirements.txt
run_NL4OPT.py		run_NL4OPT.py
run_all_NLP.sh		run_all_NLP.sh
utils.py		utils.py
utils_diff.py		utils_diff.py
utils_exec.py		utils_exec.py
utils_general.py		utils_general.py
utils_metrics.py		utils_metrics.py
utils_prompt.py		utils_prompt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Autoformulation of Mathematical Optimization Models Using LLMs

Overview

API keys

Running Experiments

Batch Trace Generation Pipeline

Basic Usage

Model Configuration Options

MCTS Parameters

Example Experiments

Evaluation Metrics Calculation

Metrics Explained

Example Evaluation

Sample Output

Experimental Results Summary

NL4OPT

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Autoformulation of Mathematical Optimization Models Using LLMs

Overview

API keys

Running Experiments

Batch Trace Generation Pipeline

Basic Usage

Model Configuration Options

MCTS Parameters

Example Experiments

Evaluation Metrics Calculation

Metrics Explained

Example Evaluation

Sample Output

Experimental Results Summary

NL4OPT

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages