Skip to content

Detailed model training control and batch modelling#365

Draft
nikitakuklev wants to merge 29 commits intoxopt-org:mainfrom
nikitakuklev:model_training
Draft

Detailed model training control and batch modelling#365
nikitakuklev wants to merge 29 commits intoxopt-org:mainfrom
nikitakuklev:model_training

Conversation

@nikitakuklev
Copy link
Collaborator

@nikitakuklev nikitakuklev commented Sep 29, 2025

This PR introduces batch GP models and finer control over model training. The former can be useful for scalarized objectives, while latter is necessary to speed up BO in operational contexts. As recently demonstrated at a NAPAC25 talk, one can significantly relax fitting tolerances to meet real-time requirements without impact on convergence, especially with scalarized objectives. There is physical motivation - we cannot set the physical devices precisely enough for exact fitting to matter.

Changes:

  • New model constructor parameters to control training (that will be passed to the optimizer) + Pydantic classes wrapping key Adam/LBFGS knobs
  • New option for using adam/torch as optimizer, with appropriate defaults
  • New batched GP model constructor. It is not used by default for now, but once stuff like visualizer supports it, we should probably make it so for large datasets.
  • Complete rework of the benchmarking scripts, moving them into resources for easier import. It is now possible to run profiling and benchmarking for snippets with either fixed run count or fixed time budget.

Caveats:

  • Batched model hyperparameter do not match list model exactly, except if data is identical on all outputs. This is an issue somewhere in botorch, since bare gpytorch works as expected.
  • Batched model is slower on small problems on GPU, and sometimes on CPU. Several variables determine crossover point, such as switch from cholesky to CG-based solvers. As a rough guideline, n <= 100. For large problems, there are significant gains on GPU which makes batch modelling worthwhile.
  • Default behavior with cached hyperparameters has changed - model is now trained in all cases. With cached hyperparameters, this results in fine tuning of previous state. To disable this, set train_model=False.

Benchmarks for n_vars=12, n_obj=5, n_constr=2, n=500
CPU:

f n t_avg t_med t_max t_min t_tot std
0 bench_build_standard 10 1.266 1.220 1.684 1.214 12.660 0.139
1 bench_build_batched 10 1.563 1.561 1.580 1.553 15.630 0.009
2 bench_build_standard_adam 10 2.689 2.607 3.407 2.572 26.892 0.241
3 bench_build_batched_adam 10 2.977 2.953 3.051 2.943 29.772 0.040
4 bench_build_standard_gpytorch 10 15.057 15.042 15.415 14.909 150.569 0.144
5 bench_build_batched_gpytorch 10 14.926 14.867 15.100 14.820 149.261 0.100

GPU (RTX 3070, H100 todo):

f n t_avg t_med t_max t_min t_tot std
0 bench_build_standard 10 0.924 0.858 1.493 0.844 9.238 0.190
1 bench_build_batched 10 0.726 0.716 0.826 0.666 7.258 0.041
2 bench_build_standard_adam 10 1.918 1.884 2.505 1.690 19.177 0.218
3 bench_build_batched_adam 10 1.148 1.136 1.314 1.063 11.475 0.087
4 bench_build_standard_gpytorch 10 7.773 7.762 7.890 7.706 77.725 0.061
5 bench_build_batched_gpytorch 10 5.884 5.861 6.171 5.746 58.838 0.116

To reproduce:
python bench_runner.py bench_build_standard bench_build_batched bench_build_standard_adam bench_build_batched_adam bench_build_standard_gpytorch bench_build_batched_gpytorch -n 10 -device cpu

@roussel-ryan
Copy link
Collaborator

Looking good! LMK when this is ready for review and we can have a short discussion to go over it

@nikitakuklev nikitakuklev added the enhancement New feature or request label Oct 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants