Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 4 additions & 106 deletions milp-evolve/README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,4 @@
---
language: English
license: cdla-2.0
multilinguality: monolingual
size_categories:
- 100K<n<1M
source_datasets: original
task_categories:
- multi-task
- regression
- reinforcement-learning
- other
task_ids:
- integrality-gap-prediction
- learning-to-branch
- language-milp-alignment
---

# Dataset Card for MILP-Evolve
# Dataset Card for [MILP-Evolve](https://huggingface.co/datasets/microsoft/MILP-Evolve)

## Table of Contents
- [Dataset Card for MILP-Evolve](#dataset-card-for-milp-evolve)
Expand All @@ -25,10 +7,6 @@ task_ids:
- [Dataset Summary](#dataset-summary)
- [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards)
- [Languages](#languages)
- [Dataset Structure](#dataset-structure)
- [Data Instances](#data-instances)
- [Data Fields](#data-fields)
- [Data Splits](#data-splits)
- [Dataset Creation](#dataset-creation)
- [Curation Rationale](#curation-rationale)
- [Source Data](#source-data)
Expand All @@ -52,12 +30,10 @@ task_ids:
## Dataset Description

- **Homepage:** [The OptiGuide Project](https://www.microsoft.com/en-us/research/project/optiguide-genai-for-supply-chain-optimization/?msockid=1a1ccce4197d663e1c2bdd4318e1678d)
- **Repository:** [MILP-Evolve](https://github.com/microsoft/MILP-Evolve)
- **Paper:** [arXiv]([arXiv](https://arxiv.org/abs/2410.08288))
- **Repository:** [MILP-Evolve](https://github.com/microsoft/OptiGuide/tree/main/milp-evolve)
- **Dataset:** [Hugging Face](https://huggingface.co/datasets/microsoft/MILP-Evolve)
- **Paper:** [arXiv]([arXiv](https://arxiv.org/abs/2410.08288)), [openreview](https://openreview.net/forum?id=6yENDA7J4G&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DICLR.cc%2F2025%2FConference%2FAuthors%23your-submissions))
- **Leaderboard:** Beibin Li, Ishai Menache, Sirui Li, Janardhan Kulkarni, Cathy Wu
- **Point of Contact:**
- [beibin.li@microsoft.com](mailto:beibin.li@microsoft.com)
- Sirui Li

### Dataset Summary

Expand All @@ -75,84 +51,6 @@ MILP-Evolve is a large-scale dataset of Mixed Integer Linear Programming (MILP)

The dataset is primarily in English. It includes Python code for MILP formulations, natural language descriptions in English, and standard MILP file formats like MPS.

## Dataset Structure

### Data Instances

An example of a data instance for integrality gap prediction:

```json
{
"milp_class": "ConferenceRoomScheduling",
"instance_id": "CRS_001",
"num_variables": 500,
"num_constraints": 300,
"density": 0.05,
"integrality_gap": 0.12
}
```

An example for learning to branch:

```json
{
"milp_instance": "SetCover_123",
"branching_decisions": [
{"node": 1, "variable": "x_5", "decision": "branch_up"},
{"node": 2, "variable": "x_12", "decision": "branch_down"}
// ...
],
"solve_time": 150.5
}
```

An example for language-MILP alignment:

```json
{
"milp_instance": "ResourceAllocation_456",
"description": "Optimize the allocation of resources to maximize profit while adhering to budget constraints.",
"graph_features": {
"nodes": [...],
"edges": [...],
"node_features": [...],
"edge_features": [...]
}
}
```

### Data Fields

- **milp_class:** String identifier for the MILP class.
- **instance_id:** Unique identifier for each MILP instance.
- **num_variables:** Integer count of variables in the instance.
- **num_constraints:** Integer count of constraints in the instance.
- **density:** Float representing the density of the constraint matrix.
- **integrality_gap:** Float value of the integrality gap (for integrality gap prediction task).
- **branching_decisions:** List of branching decisions taken during the solve (for learning to branch task).
- **solve_time:** Float representing the time taken to solve the instance.
- **description:** Natural language description of the MILP instance (for language-MILP alignment task).
- **graph_features:** Graph representation of the MILP instance, including nodes, edges, and associated features.

### Data Splits

For **Integrality Gap Prediction**:

- **Training Set:** 38,256 instances from 643 classes.
- **Validation Set:** 9,564 instances from the same classes as training.
- **Test Set:** 11,584 instances from 157 unseen classes.

For **Learning to Branch**:

- **Training Set:** Data from 579 classes, totaling 26,502 instances.
- **Validation Set:** Data from 59 classes, totaling 512 instances.
- **Test Set:** Data from 162 classes, totaling 4,756 instances.

For **Language-MILP Alignment**:

- **Training Set:** Instances from 80% of the classes.
- **Test Set:** Instances from the remaining 20% of the classes.

## Dataset Creation

### Curation Rationale
Expand Down
Binary file removed milp-evolve/data/.DS_Store
Binary file not shown.
Binary file removed milp-evolve/data/milp_code/.DS_Store
Binary file not shown.
Binary file removed milp-evolve/data/milp_code/evolve_tab1/.DS_Store
Binary file not shown.
Binary file not shown.
Binary file removed milp-evolve/data/milp_code/evolve_tab2/.DS_Store
Binary file not shown.
22 changes: 15 additions & 7 deletions milp-evolve/setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -275,6 +275,7 @@ CUDA_VISIBLE_DEVICES=0 python -u branching_test.py --n_cpus $N_CPUS \
- Extract MILP input features: We run the following code to extract the MILP input features. The code is similar to `gap_collect.py`, except here we only solve each MILP instance to the root-node LP relaxation to collect the input feature, and we do not need to solve the MILP instance to optimal (only required to collect gap data).

```script
export N_CPUS=60
export PARENT_DATA_DIR=save_dir/contrast_data # location to save the milp input features
export PARENT_INSTANCES_DIR=save_dir/instances/mps/code_v1 # location where the MILP instances are saved

Expand All @@ -296,12 +297,17 @@ python contrast_mps_conv.py --parent_code_dir $PARENT_CODE_DIR --parent_instance

We then run the following code to split the multi-modal dataset (MILP and text) into disjoint train and test splits. In particular, `$MULTIMODAL_DATA_FILE` is a json file that contains a directory with format `[{"milp": path to the input features of the milp instance, "text_path": path to the text description of the milp instance}, ...]` to split into train and set sets with disjoint MILP classes. The train/test splits are saved as `{out_dir}/train_{out_suffix}_data.pkl.gz` and `{out_dir}/test_{out_suffix}_data.pkl.gz`, which are used to train and test the language-MILP contrastive model.

```script
export MULTIMODAL_DATA_FILE=[json file that provides the associated milp and text paths]
export OUT_DIR=save_dir/contrast
export OUT_SUFFIX=ours

python contrast_class_split.py --multimodal_data_file $MULTIMODAL_DATA_FILE --out_dir $OUT_DIR --out_suffix $OUT_SUFFIX
```
export PARENT_CODE_DIR=milp_code_v1/code
export PARENT_DATA_DIR=save_dir/contrast/data
export PARENT_DESC_DIR=save_dir/contrast/conv
export PARENT_SAVE_DIR=save_dir/contrast
export MULTIMODAL_DATA_FILE=save_dir/contrast/data_ours.json
export OUT_SUFFIX=ours_

python contrast_class_split.py --parent_code_dir $PARENT_CODE_DIR --parent_data_dir $PARENT_DATA_DIR \
--parent_desc_dir $PARENT_DESC_DIR --parent_save_dir $PARENT_SAVE_DIR \
--multimodal_data_file $MULTIMODAL_DATA_FILE --out_suffix $OUT_SUFFIX
```

</details>
Expand All @@ -319,4 +325,6 @@ export TEXT_TYPES="description only"

python contrast_train_test.py --epochs $EPOCH --dataset $DATASET --eval_epochs $EVAL_EPOCHS --print_iters $PRINT_ITERS --text_types $TEXT_TYPES
```
</details>
</details>


Binary file removed milp-evolve/src/.DS_Store
Binary file not shown.
Binary file removed milp-evolve/src/milp_evolve_llm/.DS_Store
Binary file not shown.
Binary file removed milp-evolve/src/multi_class_learning/.DS_Store
Binary file not shown.
145 changes: 105 additions & 40 deletions milp-evolve/src/multi_class_learning/contrast_class_split.py
Original file line number Diff line number Diff line change
@@ -1,25 +1,18 @@
import glob
import json
import os
import pdb
import sys
import argparse
import pickle
import re
import argparse
from collections import defaultdict

import numpy as np

# First, randomly determine the IDs for trainning and testing
total_length = 10000 # assume we have 10000 classes, which is above the actual number
train_ids = np.random.choice(total_length, 8000, replace=False)
test_ids = [i for i in range(total_length) if i not in train_ids]

def add_data(filename, text):
global TRAIN_DATA, TEST_DATA
_id = milp_id(filename)
if _id in train_ids:
TRAIN_DATA[filename].append(text)
else:
TEST_DATA[filename].append(text)

############# helper functions
def milp_id(path):
x = re.findall("milp_(\d+)-", path)
if x:
Expand All @@ -35,10 +28,31 @@ def milp_id(path):
z = re.findall("(\d+)_algo", path)
if z:
return int(z[0])

z = re.findall("(\d+)", path)
if z:
return int(z[0])

raise ValueError("Cannot find the MILP ID for " + path)

def add_data(filename, text, train_ids):
global TRAIN_DATA, TEST_DATA
_id = milp_id(filename)
if _id in train_ids:
TRAIN_DATA[filename].append(text)
else:
TEST_DATA[filename].append(text)

####### helper function
def _remove_heading_spaces(solve_code):
while True:
lines = solve_code.split("\n")
# Check if all non-empty lines have leading spaces
if all(line.startswith(" ") or line == "" or line.startswith("#") for line in lines):
# Remove two leading spaces from each line
solve_code = "\n".join([line[2:] if line.startswith(" ") else line for line in lines])
else:
break
return solve_code

def parse_code(code_filename):
if not os.path.exists(code_filename):
Expand All @@ -62,57 +76,108 @@ def parse_code(code_filename):
solve_code = solve_imp[0]
_code = _remove_heading_spaces(solve_code)
ans.append(_code)
# pdb.set_trace()
return ans
#######


def _remove_heading_spaces(solve_code):
while True:
lines = solve_code.split("\n")
# Check if all non-empty lines have leading spaces
if all(line.startswith(" ") or line == "" or line.startswith("#") for line in lines):
# Remove two leading spaces from each line
solve_code = "\n".join([line[2:] if line.startswith(" ") else line for line in lines])
else:
break
return solve_code

####### Now, Load the Data #####
### aggregate data and description
def build_dataset(parent_data_dir, parent_desc_dir, multimodal_data_file, desc_suffix=""):
desc_path_glob = os.path.join(parent_desc_dir, f"*/desc_*{desc_suffix}.txt")

def aggregate_data(multimodal_data_file,
out_dir="save_dir/contrast", out_suffix=""):
descs = glob.glob(desc_path_glob)

count = 0
data = []
for desc in descs:
gz_path = desc.replace(parent_desc_dir, parent_data_dir).replace("desc", "data").replace(desc_suffix, "").replace(".txt", ".pkl.gz")

if not os.path.exists(gz_path):
continue

count += 1
data.append({
"id": str(count), "image": gz_path, "text_path": desc,
"conversations": [{
"from": "human",
"value": "<image>\nDescribe the data."
}, {
"from": "gpt",
"value": open(desc, "r").read()
}]
})

json.dump(data, open(multimodal_data_file, "w"), indent=2)


### split data into train/test/val
def split_data(multimodal_data_file, parent_data_dir, parent_code_dir, parent_save_dir, train_ids, out_suffix=""):
global TRAIN_DATA, TEST_DATA
TRAIN_DATA = defaultdict(list)
TEST_DATA = defaultdict(list)

# First, loading the llava description
multimodal_data = json.load(open(multimodal_data_file, "r"))
multimodal_files = [item["milp"] for item in multimodal_data]
multimodal_files = [item["image"] for item in multimodal_data]
multimodal_desc_files = [item["text_path"] for item in multimodal_data]
# remove files that does not exist
x = zip(multimodal_files, multimodal_desc_files)
x = [item for item in x if os.path.exists(item[0]) and os.path.exists(item[1])]
multimodal_files, multimodal_desc_files = zip(*x)

for i, (mps_file, desc_file) in enumerate(zip(multimodal_files, multimodal_desc_files)):
add_data(mps_file, open(desc_file, "r").read())
add_data(mps_file, open(desc_file, "r").read(), train_ids=train_ids)

class_name = os.path.basename(os.path.dirname(desc_file))
code_filename = os.path.join(parent_code_dir, f"{class_name}.py")

code_filename = re.sub("desc_seed.*.txt", "milp.py", desc_file)
if code_filename.endswith(".py"):
if os.path.exists(code_filename):
for component in parse_code(code_filename):
add_data(mps_file, component)
add_data(mps_file, component, train_ids=train_ids)


if parent_data_dir:
for problem_dir in glob.glob(os.path.join(parent_data_dir, "*")):
_id = milp_id(problem_dir)
src_codename = glob.glob(os.path.join(parent_code_dir, f"milp_{_id}-*.py"))[0]
code_components = parse_code(src_codename)

for mps_filename in glob.glob(os.path.join(problem_dir, "*.pkl.gz")):
for component in code_components:
add_data(mps_filename, component, train_ids=train_ids)

# Finally, dump the data. Let's mainly use the pickle format because of its compression
json.dump(TRAIN_DATA, os.path.join(out_dir, open(f"train_{out_suffix}_data.json", "w")), indent=2)
json.dump(TEST_DATA, os.path.join(out_dir, open(f"test_{out_suffix}_data.json", "w")), indent=2)
json.dump(TRAIN_DATA, open(os.path.join(parent_save_dir, f"train_{out_suffix}data.json"), "w"), indent=2)
json.dump(TEST_DATA, open(os.path.join(parent_save_dir, f"test_{out_suffix}data.json"), "w"), indent=2)

pickle.dump(TRAIN_DATA, os.path.join(out_dir, open(f"train_{out_suffix}_data.pkl.gz", "wb")))
pickle.dump(TEST_DATA, os.path.join(out_dir, open(f"test_{out_suffix}_data.pkl.gz", "wb")))
pickle.dump(TRAIN_DATA, open(os.path.join(parent_save_dir, f"train_{out_suffix}data.pkl.gz"), "wb"))
pickle.dump(TEST_DATA, open(os.path.join(parent_save_dir, f"test_{out_suffix}data.pkl.gz"), "wb"))


if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--multimodal_data_file", type=str, default="save_dir/contrast/ours_multimodal.json", help="Multimodal data file")
parser.add_argument("--out_dir", type=str, default="save_dir/contrast", help="Output directory")
parser.add_argument("--out_suffix", type=str, default="ours_", help="Output suffix")
parser.add_argument("--parent_code_dir", type=str, default="milp_code_v1/code")
parser.add_argument("--parent_data_dir", type=str, default="save_dir/contrast/data")
parser.add_argument("--parent_desc_dir", type=str, default="save_dir/contrast/conv")
parser.add_argument("--parent_save_dir", type=str, default="save_dir/contrast")
parser.add_argument("--multimodal_data_file", type=str, default="save_dir/contrast/data.json")
parser.add_argument("--desc_suffix", type=str, default="")
parser.add_argument("--out_suffix", type=str, default="ours")

args = parser.parse_args()

aggregate_data(args.multimodal_data_file, args.out_dir, args.out_suffix)
build_dataset(args.parent_data_dir, args.parent_desc_dir, multimodal_data_file=args.multimodal_data_file,
desc_suffix=args.desc_suffix)


# First, randomly determine the IDs for trainning and testing
total_length = 10000 # assume we have 10000 classes, which is above the actual number
train_ids = np.random.choice(total_length, 8000, replace=False)
test_ids = [i for i in range(total_length) if i not in train_ids]

split_data(multimodal_data_file = args.multimodal_data_file,
parent_data_dir = args.parent_data_dir,
parent_code_dir=args.parent_code_dir,
parent_save_dir=args.parent_save_dir,
train_ids=train_ids,
out_suffix=args.out_suffix)
Loading
Loading