SmallDatasetBenchmarks

Testing machine learning classifiers on small tabular datasets. The original blog post is here: https://www.data-cowboys.com/blog/which-models-are-best-for-small-datasets

Setup

uv sync

Experiments

Results are produced in figures.ipynb (all models including AutoML) and figures_no_automl.ipynb (non-AutoML models only). Each benchmark uses nested cross-validation (4-fold outer × 4-fold inner) with stratified random splits and fixed seeds. The evaluation metric is PR AUC (weighted average precision, OvR), which is less sensitive to class imbalance than ROC AUC.

Script	Description
`compare_baseline_models.py`	SVC, Logistic Regression, Random Forest — tuned with `GridSearchCV`
`optuna_models.py`	SVC, LogReg (GridSearch); RF, XGBoost, SGD, LightGBM, CatBoost (Optuna, 50 trials per fold)
`benchmark_autogluon.py`	AutoGluon with a 1000s wall-clock time budget per fold (`best_quality` preset, 8 CPUs)
`benchmark_mljar.py`	MLJAR Supervised with a 1000s wall-clock time budget per fold (`Compete` mode, `n_jobs=8`)

To reproduce all results sequentially:

uv run python run_all.py
# AutoGluon must use the venv Python directly (Ray incompatibility with uv run):
.venv/bin/python benchmark_autogluon.py

Categorical features

compare_baseline_models.py uses one-hot encoding. optuna_models.py handles categories properly:

CatBoost — native cat_features support
RF, XGBoost, LightGBM — ordinal encoding via category_encoders
SVC, LogReg, SGD — encoding strategy is an Optuna hyperparameter (ordinal, target, James–Stein, m-estimate, CatBoost encoder)

AutoGluon and MLJAR handle categorical features internally.

Results

Model performance relative to Random Forest baseline

Mean PR AUC delta across 146 datasets (positive = better than RF).

Time–accuracy tradeoff

Wall-clock training time per dataset vs mean PR AUC gain over RF.

Rank distribution across datasets

How often each model achieves each rank (1 = best on a given dataset).

Observations

Non-linear models outperform linear ones even on datasets with fewer than 100 samples.
Optuna-tuned XGBoost and CatBoost are the strongest individual models, competitive with AutoML frameworks.
AutoGluon and MLJAR show higher median PR AUC, but require a substantial wall-clock budget (1000s/fold used here).
Proper categorical feature handling gives a meaningful boost on datasets with string features (~30% of the benchmark).
LightGBM with linear trees (boosting_type="rf" variant) is a useful addition to the Optuna model set.

Data

A subset of UCI++: "a huge collection of preprocessed datasets for supervised classification problems in ARFF format"

146 datasets, up to 10 000 rows each (larger datasets are subsampled). Note that UCI++ reuses the same datasets in different configurations and some categorical features are not clearly labeled.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SmallDatasetBenchmarks

Setup

Experiments

Categorical features

Results

Model performance relative to Random Forest baseline

Time–accuracy tradeoff

Rank distribution across datasets

Observations

Data

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
benchmark		benchmark
datasets		datasets
figures		figures
results		results
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark_autogluon.py		benchmark_autogluon.py
benchmark_mljar.py		benchmark_mljar.py
compare_baseline_models.py		compare_baseline_models.py
config.py		config.py
database.json		database.json
figures.ipynb		figures.ipynb
figures_no_automl.ipynb		figures_no_automl.ipynb
optuna_models.py		optuna_models.py
pyproject.toml		pyproject.toml
run_all.py		run_all.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

SmallDatasetBenchmarks

Setup

Experiments

Categorical features

Results

Model performance relative to Random Forest baseline

Time–accuracy tradeoff

Rank distribution across datasets

Observations

Data

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages