This repository contains a literature review and simple implementation of [OptiMUS: Optimization Modeling Using Solvers and large language models] (https://github.com/teshnizi/OptiMUS).
Papers regarding the project can be found at here:
-
V0.1: OptiMUS: Optimization Modeling Using mip Solvers and large language models.
-
V0.2: OptiMUS: Scalable Optimization Modeling with (MI) LP Solvers and Large Language Models.
-
V0.3: OptiMUS-0.3: Using Large Language Models to Model and Solve Optimization Problems at Scale.
This literature review heavily focuses on the V0.1 article.
Optimization problems are common in many fields such as operations, economics, engineering, and computer science. However, optimization modeling - transforming a business problem into a mathematical optimization problem - requires an expert knowledge. Therefore, automating optimization modeling would allow people who cannot afford an access to optimization experts to improve their work efficiency using optimization techniques. The paper (V0.1 article) explored the capabilities and limitations of large language models (LLMs) in optimization, aiming to extend an access to optimization across application domains.
-
They introduced a novel dataset, NLP4LP, consisting of 52 (currently 269) LP and MILP optimization problems. To construct the dataset, they introduced a standardized format to represent optimization problems in natural languages.
-
They presented OptiMUS, an LLM-based agent to formulate and solve optimization problems.
-
They developed techniques to improve the quality of OptiMUS, such as debugging, automated testing, and data augmentation.
You can download the dataset from https://huggingface.co/datasets/udell-lab/NLP4LP.
Please note that NLP4LP is intended and licensed for research use only. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
OptiMUS starts with a structured description of the optimization problem, called SNOP, and a separate data file. It first transforms SNOP into (1) a mathematical formulation of the problem and (2) tests that check the validity of a purported solution. Afterwards, OptiMUS transforms the mathematical formulation into solver (e.g., Gurobi, CPLEX) code. It joins the solver code with the problem data to solve the problem. If the code raises an error or fails a test, OptiMUS revises the code (debugging) and repeats until the problem is solved or maximum iterations are reached.
Left: Natural language of problem description, Right: SNOP (Source: https://arxiv.org/pdf/2310.06116)
SNOP is no longer required, as the preprocesser converting natural languages to structured forms has been developed from the V0.2 article.
They evaluated OptiMUS on NLP4LP dataset and used GPT-3.5 and GPT-4 models for their experiments. For the experiments, they considered these five modes:
-
Prompt (baseline): The problem is described in a few prompts and the LLM is asked to formulate the problem and write code to solve it using a given optimization solver.
-
Prompt + Debug: In addition to the above, if the code raises syntax or runtime errors, the errors along with the code and the problem info are passed to the LLM, which is prompted to debug the code.
-
Prompt + Debug + AutoTests: In addition to the above, when the code successfully runs and produces an output, automatically-generated tests are run to check the validity of the output.
-
Prompt + Debug + Supervised Tests: Same as the above, except that the automatically-generated tests are all manually revised and fixed by experts if necessary.
-
Prompt + Debug + Supervised Tests + Augmentation (OptiMUS): In addition to the above, each problem is rephrased using an LLM five times, and then the whole pipeline is applied to each of the rephrased versions independently.
They assessed the models based on two metrics: success rate (the ratio of outputs satisfying all constraints and finding the optimal solution) and execution rate (the ratio of generated codes that are executable and generate an output).
Success and Execution Rates of the Five Modes (Source: https://arxiv.org/pdf/2310.06116)
When using GPT-4, the success rate improves with each additional feature, and OptiMUS (Prompt + Debug + Supervised Test + Augmentation) achieves the highest success and execution rates among the five modes. However, with the weaker model (GPT-3.5), the results fluctuate across modes. Because GPT-3.5 makes errors more frequently than GPT-4, debugging or testing can increase the number of incorrect codes. However, augmentation (OptiMUS) significantly improves the success rate by giving the model multiple attempts to solve the problem with different rephrasings.
As the article compared GPT-3.5 and GPT-4, I was curious about the performance of the newest model, GPT-5, on optimization modeling. Since there was an issue accessing the GPT-5 API, I used GPT-5-mini instead and compared it with GPT-4. From the NLP4LP dataset, I selected the first five instances as test sets. I applied the Prompt + Debug mode, and the performance was already strong enough that I did not proceed with additional methods such as AutoTests, Supervised Tests, or Augmentation.
An interesting finding is that the debugging process caused by errors occurred as frequently, or even more often, with GPT-5-mini compared to GPT-4. This difference stems from the distinct purposes of the two models. GPT-4, as a high-performance general-purpose model, excels at solving complex problems, reasoning, and maintaining logical consistency. In contrast, GPT-5-mini is a lightweight and faster model, better suited for short responses, repetitive tasks, and simple code generation. This demonstrates that a higher model number does not necessarily mean superiority; rather, each model has its own appropriate use cases.
Similar to SNOP, we can design a structured format tailored to transportation. Problem types can be categorized as Network Flow Optimization, Traffic Assignment, Fleet Assignment, Facility Location, Routing, and others. This taxonomy allows the system to recognize the nature of the problem and generate appropriate formulations automatically.
Each problem type can be mapped to the most suitable solver or algorithm. For example, routing problems can be linked to specialized graph algorithms (e.g., A*, K-shortest path) rather than generic MILP formulations. This reduces computation time and increases solution reliability, as the LLM does not need to explore unnecessary solution methods.
Transportation problems are highly dynamic, with conditions changing over time. By integrating real-time data sources such as Google Maps or Apple Maps APIs, the system can capture up-to-date travel times, congestion, and road closures. Incorporating these data into optimization models significantly improves realism and practical applicability.