Appendices, Supplements, and Code for IJCAI 2025 submission "Identifying Drivers of Predictive Aleatoric Uncertainty".
This README details how to reproduce the results from our manuscript Identifying Drivers of Predictive Aleatory Uncertainty. The README is divided into multiple sections shown in the Table of Contents.
- 1 Additional examples and verifications
- Licences
We initially separated noise and mean-influencing features in the synthetic data to enable unambiguous evaluations. However, in reality, we expect noise features to overlap with features influencing the mean. Specifically, we add five mixed features to the setup described in section 2.4 of the manuscript. These mixed features are generated using the same polynomial model as described, but they contribute to both the mean and the heteroscedastic noise. Thereby, the dataset comprises 80 features: 70 that exclusively influence the target's mean, five that exclusively affect the noise variance of the conditional target distribution, and five mixed features that impact both aspects.
Figure 0: Example of a synthetic dataset with one feature that influences mean and uncertainty.
We provide a command-line interface for running an synthetic uncertainty explanation experiments. The script will train a heteroscedastic Gaussian neural network on the train data. We subsequently explain the variance estimates on a test set using variance feature attribution (VFA flavors), infoshap, and Counterfactual Latent Uncertainty Explanations(CLUE). By default, we will run the explainers on 200 test instances with the highest predicted uncertainty "highU", highest predicted confidence "lowU", and random intances "randomU". For CLUE we adapted code from https://github.com/cambridge-mlg/CLUE . We use Python 3.11.5.
To run the global_synthetic_benchmark uncertainty explanation experiment, follow these steps (conda is required to install the environment):
-
Open a terminal or command prompt.
-
Install the requirements.txt file using
conda env create -f environment.yml -
Navigate to the directory
global_synthetic_benchmark. -
Use the following command to execute the script:
python synthetic_uc_epl_experiment.py [options]
Options:
-
--n_instances_to_explain(Default: 512): The number of instances to explain in the experiment. -
--noise_scaler(Default: 2.0): The noise scaler value used in the experiment. (In the paper this is alpha) -
--n(Default: 40000): The number of training instances (20% of these will be used for early stopping.) -
--n_test(Default: 1500): The number of data instances. -
--remake_data: An optional flag. If specified, it will resample data, if not it will look for an existing dataset with the parameters specified -
--beta_gaussian: An optional flag. If specified, beta gaussian loss will be used instead of vanilla gaussian NLL.
python synthetic_uc_epl_experiment.py --n_instances_to_explain 256 --explainer_repeats 1 --noise_scaler 3.0 --n 30000 --n_test 2048 --remake_dataThe script creates (directed and undirected) feature importances as results. The feature importances can be analyzed using
a) shap_summaries.ipynb for Figure 2 of the paper.
b) plotting/plotting.R for Figure 3 of the paper.
Go to the folder metrics_benchmark
- Global Perturbation metrics (will run for all methods and datasets):
python run_perturbation_exp_global.py- Local Accuracy Metrics (will run for all methods and the specified datasets):
python run_localization_exp.py --dataset="<dataset>"Dataset is a selection of synthetic, red_wine, ailerons, and lsat. synthetic is the default.
- Local Lipschitz Continuity Metrics (will run for all methods and the specified datasets):
python run_robustness_lipschitz_exp.py --dataset="<dataset>"Dataset is a selection of synthetic, red_wine, ailerons, and lsat. synthetic is the default.
We extend our analysis of robustness to the synthetic dataset with mixed mean and noise features. Figure 2 shows that the results for this dataset reflect the observations for the other datasets. The Shapley-value-based methods VFA-SHAP and InfoSHAP, and VFA-IG seem more robust than CLUE and VFA-LRP. The methods' individual ranking differs between datasets, suggesting that the choice of the most robust method is subject to the dataset.
Figure 2: Local Lipschitz continuity estimates for 200 test set instances for all methods and datasets. Lower values indicate higher robustness. VFA-SHAP, InfoSHAP, and VFA-IG are the most robust.
Go to the folder mnist_plus_u
-
To create the dataset follow the notebook
mnist_plus_create_data.ipynb. Depending on the machine this can take a moment. NOTE: This repository does not contain the full MNIST+U dataset. However, the generation of the dataset is seeded and the same dataset should be created every time the notebook is run. -
To train the model used in the paper you can either exectute
mnist_plus_train_model_two_paths.pyor use the notebookmnist_plus_train_model_two_paths.ipynb. We provide the checkpoint we used in/checkpoints_two_models. There is also a version of the network where the convoluational layers are shared between the mean and variance outout. This model can be trained withmnist_plus_train_model.ipynband a checkpoint is avaialble in/checkpoints. -
To execute the evaluation using anymthod run one of the corresponding scripts with the following naming scheme:
mnist_plus_<method name>_double_path.pysuch as
python mnist_plus_lrp_zennit_double_path.pyWe show an example of what the images in the MNIST+U dataset look like. For each sample we also save a mean an variance mask used for evaluation.
Figure 3: MNIST+U samples and the ground truth masks we use to evaluate localization of the explainers. We dillute the pixels of the original MNIST pixels by two pixels to account for explanations that focus on the edges of the digits.
We provide basic evaluation of the MNIST+U dataset in mnist_plus_model_eval_double_path.ipynb. We verify that the model's prediced uncertainty correlates with the ground truth uncertainty embedded in the images.
Figure 4: We evaluate the ability of the model to estimate the uncertainty. As expected, higher digits correspond to higher predicted variance.
High-quality uncertainty estimates are essential for analyzing explanations of uncertainties. To evaluate the quality of the uncertainty estimates, we first examine the calibration of the uncertainty estimates using the Uncertainty Toolbox. To assess calibration, $ \alpha$-prediction intervals are constructed which cover the observed values with a predicted probability of
Figure 5: Calibration plot for models trained on the synthetic and age detection datasets: predicted probability vs. observed proportions of instances covered by various probability intervals. The orange-colored angle bisector marks a perfectly calibrated hypothetical model.
For meaningful uncertainty estimates, we also expect that we can reduce the prediction error by restricting the prediction to low-uncertainty instances. Therefore, as depicted in Figure 4, we determine the reduction in root mean squared error on a remaining test set when instances with the highest uncertainty are iteratively removed. We compare this to removal based on the distance of the prediction to the mean prediction as a baseline. For the synthetic datasets, we further depict the deduction in root mean squared error based on the highest ground truth noise standard deviation from the data generating process as the best attainable reduction. We find for the models trained on the synthetic data (Figure 4 - A), mixed synthetic data (Figure 4 - B), and the age detection task (Figure 4 - C) that the uncertainty-based filtering is effective. This indicates that the predicted uncertainty is a meaningful indicator of the expected prediction error.
Figure 6: Root mean squared error of the data for various uncertainty quantiles of the test set. From left to right, data points are removed based on predicted uncertainty (green), distance from the mean prediction (blue), or, for synthetic data, the ground truth noise standard deviation of the data-generating process (purple). For models trained on the synthetic and age detection datasets, lower uncertainty leads to reduced RMSE compared to the overall RMSE of the test set. This indicates that the uncertainty is a meaningful indicator of the expected prediction error.
Age detection finds application in various areas, from security to retail. We apply MiVOLO (Kuprashevich and Tolstykh, 2023), a state-of-the-art vision transformer that achieves best-of-its-class performance in multiple benchmarks. It was designed to tackle age and gender detection simultaneously to leverage synergies. For simplicity, we use a version of the model that only uses face images as input and omits additional body images.
We use a pre-trained version of MiVOLO and, following our procedure introduced in the manuscrupt in Section 2.1, extend the parameter matrices of the MiVOLO head, auxiliary head, and their respective bias terms. We initialize them using a Gaussian distribution following Glorot and Bengio [2010] and a bias of zero. We fine-tune this model using the IMBD-clean dataset Lin et. al., 2022, using the annotations and pre-processing by Kuprashevich and Tolstykh [2023]. We use the GNLL and the Adam optimizer with a learning rate of
If not specified, all paths refer to files in the age_detection directory.
If you have not done so, please follow the following steps to set up the conda environment:
-
Open a terminal or command prompt.
-
Install the requirements.txt file using
conda env create -f environment.yml
Follow the instructions on the IMBD-clean GitHub repository to download the images. The downloaded images (in the numbered directories) need to be stored in a directory called images:
mivolo/data/dataset/imagesAdditionally, the MiVOLO specific annotations need to be downloaded. The CSV files for the train, validation, and test sets need to be stored at:
mivolo/data/dataset/annotationsWe provide a separate example set, imdb_example_new.csv, that contains the example images used in our manuscript.
Refer to the MiVOLO GitHub to download an IMBD-clean checkpoint of choice and the YOLO checkpoint also provided by the authors of MiVOLO and store them at:
models/Note: We provide the variance fine-tuned checkpoint used in the manuscript: models/variance_feature_attribution_mivolo_checkpoint.pth.tar.
Note: To use a downloaded checkpoint, you must extend it with a variance output!
The variance_output_injection.ipynb notebook shows how to inject an additional weight vector into the head and auxiliary head and extend the bias.
To fine-tune the extended model, use train_mivolo.py. To run the training with the parameters from the manuscript, run the following:
python train_mivolo.py --name="example_run" --learning-rate=0.00001 --weight-decay=0.01 --epochs=50To run the trained model on the test set, refer to the test_mivolo.ipynb. The default checkpoint selected in the notebook is the one we trained and used to produce the results in the manuscript.
To reproduce Figure 8 from the Appendix, run the data on the test set and then use generate_data_for_uncertainty_evaluation.ipynb to generate the necessary files used for the plot in plotting/plotting.R.
There are two options to create variance explanations:
To reproduce the explanations for the variance from our age detection example, refer to the example_explanations.ipynb notebook. Similar to the testing notebook, it has the checkpoint used to create the figures from the manuscript pre-selected.
If you want to create examples for the whole test set, use create_explanations.py. The configuration used in the manuscript is:
python create_explanations.py --checkpoint="variance_feature_attribution_mivolo_checkpoint" --method="hiresCAM"The age_detection/calibration folder contains a notebook to recalibrate the uncertainties using std-scaling.
We showcase the application of VFA to age detection using MiVOLO and the IMDB-clean dataset. Applying VFA with HiResCAM reveals reasonable potential explanations for the predictive uncertainty (see Figure 7). The explanations mainly focus on areas around the eyes, mouth, nose, and forehead. These areas are highlighted especially strongly when the person in the image shows emotions that lead to distortions of these facial areas.
Figure 7: Input images and uncertainty explanations in an age detection experiment using VFA with HiResCAM. Images are annotated with the ground truth and predicted mean and standard deviation.
We use code from different projects that we build on. In the following, we list the appropriate licenses for different sections of the code that we have used from other work.
We use and adapt various code from MiVOLO which is under this licenses. This involves code in the following files:
mivolo/predictor.pyandmivolo/predictor_orig.pymivolo/structures.py- All code in
mivolo/data/ mivolo/model/create_timm_model.pymivolo/model/cross_bottleneck_attn.pymivolo/model/mi_volo.pyandmivolo/model/mi_volo_orig.pymivolo/model/mivolo_model.pymivolo/model/yolo_detector.py
We use and adapt the code in mivolo/model/volo.py from huggingface PyTorch Image Models, which is under the Apache 2.0 license.
We use and adapt the code in mivolo/model/explanation_generator.py from Transformer-Explanations, which is under this MIT Licence.
We use and adapt code in metrics_benchmark/lipschitz_metric.py from Robustness-of-Interpretability-Methods, which is under this MIT Licence.
We apadt code in synthetic_experiments/CLUE and in the /synthetic_experiments/synthetic_experiment_utils.py explain_clue() function from CLUE, which is under this MIT Licence.






