Abstract

Efficiently quantifying predictive uncertainty in medical images remains a challenge. While Bayesian neural networks (BNN) offer reliable predictive uncertainty, they require substantial computational resources to train. Although Bayesian approximations such as ensembles have shown promise, they still suffer from high training costs. Existing approaches to reducing computational burden primarily focus on lowering the costs of BNN inference, with limited efforts to improve training efficiency and minimize parameter complexity. This study introduces a training procedure for a sparse (partial) Bayesian network. Our method selectively assigns a subset of parameters as Bayesian by assessing their deterministic saliency through gradient sensitivity analysis. The resulting network combines deterministic and Bayesian parameters, exploiting the advantages of both representations to achieve high task-specific performance and minimize predictive uncertainty. Demonstrated on multi-label ChestMNIST for classification and ISIC, LIDC-IDRI for segmentation, our approach achieves competitive performance and predictive uncertainty estimation by reducing Bayesian parameters by over 95%, significantly reducing computational expenses compared to fully Bayesian and ensemble methods.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0608_paper.pdf

SharedIt Link: https://rdcu.be/dV55t

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72117-5_63

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0608_supp.pdf

Link to the Code Repository

https://github.com/zabboud/SparseBayesianNetwork

Link to the Dataset(s)

https://challenge.isic-archive.com/data/ https://medmnist.com/ https://wiki.cancerimagingarchive.net/pages/viewpage.action?pageId=1966254

BibTex

@InProceedings{Abb_Sparse_MICCAI2024,
        author = { Abboud, Zeinab and Lombaert, Herve and Kadoury, Samuel},
        title = { { Sparse Bayesian Networks: Efficient Uncertainty Quantification in Medical Image Analysis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {675 -- 684}
}

Reviews

Review #1

Please describe the contribution of the paper

This research focuses on the computational demands of training Bayesian neural networks by limiting the number of Bayesian parameters, under medical imaging applications. The reduction is realized by analysing sparsity in Bayesian parameters by gradient sensitivity analysis, and by introducing a hybrid training approach that blends deterministic and Bayesian parameters within any network structure. The efficacy of this innovative training method is validated through its application in medical image classification and segmentation, where it has delivered competitive performance.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper is clear and easy to follow, related work is well explained and introduced.
2. Experiments are reasonably conducted and results are promising. A range of indicators and datasets are evaluated.
3. The training method effectively reduce the majority of parameters while achieving comparable performance in the given datasets and experiments.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The masking part and sensitivity based selection step is not well explained in the Initialization paragraph.
2. Eq 1 missing a right bracket in the first term.
3. Since KL divergence is unbounded (upper bound), the coefficient beta might have a more significant influence to the overall loss than expected, beta is not well explained and examined.
4. Diversity of experiments and model type is rather limited, given that the author claims it is suitable for ‘any network architecture’
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Do you have any additional comments regarding the paper’s reproducibility?

No
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
1. Might be good to examine joint Gaussian distribution on the weights rather than a number of univariate Gaussian on each weight, this is because NN weights are correlated.
2. Would be good to explain more about beta in the loss function and examine its influence.
3. Almost no information is provided regarding the variational inference scheme adopted, this should significantly affect the performance, it would be nice to see a bit more explanation
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The work should be of interests to a group of readers of this conference, and results are promising. However, the details of the method is not very well explained and some of the key limitations still raise concerns about the generalisability and effectiveness of the method in wider network architectures.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The authors presents a sparse Bayesian neural network and introduces ‘pruning’ to the variational inference BNN. The results show that partially variational BNN can achieve good uncertainty estimation while fully variational can harm the accuracy.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The authors propose a novel variational inference (VI) BNN framework with weight randomness pruning, guided by weight saliency.
- The (training) computational load is reduced compared to full VI-BNN and deep ensemble.
- A deep investigation of the influence of Bayesian-ness i.e. the percentage of random weights on the performance.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The authors motivated their work by saving computational load. To me, a full ‘Bayesian inference’ consists of two stages: 1) posterior estimation and sampling (training) and 2) marginalization over weights drawn from the posterior (test on new data). It seems that the proposed method can save the computational load of the training stage only. But the test time remains almost the same as other methods. The authors should make explicit discussion on this.
- Table 1. It is unclear how these FLOPs are computed, is it training FLOPS or test FLOPs or a combination of both? Does the cost of random sampling in weight space also contribute to the FLOPS?
- MC-Dropout and other quick ensemble methods like (arXiv:2212.06278) take as much time as training a deterministic model, so it can save more time than the proposed method, at least in terms of the training FLOPS. The authors should make discussions on these quick ensemble methods.
- The distribution of Bayesian weights is not included, it would be interesting to see where the selected top-k weights are located, do they concentrate on shallow layers or deep layers, or they are rather uniformly distributed across layers?
Minor issues:
- I would not call Fig. 1 (b) as ‘Bayesian’ but rather ‘full-variational Bayesian’ because the paper focuses only on the variational inference variants with Gaussian approximation, which forms only a subset of Bayesian NNs.
- Eq. (1), ELBO should be log p(D theta) instead of lo p(x theta)
- Table 1, ChestMNIST accuracy: ensemble is slightly better than partial 1%, bold text in the wrong place.
- The authors stated “ Deep ensembles are especially impractical with models with large parameter counts, which is common in medical imaging.”. Deep ensembles are only time-consuming in terms of the training phase which can be solved by parallel computing, for examples, nnUNet (arXiv:1809.10486) incorporates deep ensemble (5-fold). I would be careful with this statement.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

I would recommend on the computational cost separately on the training (posterior estimation and sampling) phase and the test phase (running a few forward passes given weight samples), incorporating discussion on other quick ensemble methods. Additionally, a detailed discussion on the distribution on top-k weights selected would make this paper stronger.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper presents a partial VI BNN, which achieves competitive performance with only 1% of the weights being random. The study of the level randomness is interesting and validates that the full VI BNN can harm the accuracy with a higher computational cost. However, the saving of computational load, which is the motivation of this work, should be discussed in more detail, decomposing training and testing, taking other quick ensemble methods into consideration.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The article presents a strategy for subnetwork selection in Bayesian Neural Networks and applies mean-field variational inference. The method is evaluated in image classification and segmentation.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- the paper is well structured, clearly written and the figures are expressive
- the problem of subnetwork selection in BNN has not been studied under variational inference and also not in segmentation ([1] state that they do not use variational inference in Section 5)
- [2] finds evidence that the strategy of selecting a subnetwork is promising and this paper follows up on this with a new methodology
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- recent related work in subnetwork selection in BNNs is missing: consider including [1] and [2]
- missing theoretical arguments and details for the subnetwork selection procedure (see questions)
- there is not distinction between aleatoric and epistemic uncertainty in the evaluation of the method. Check [3] for an overview of how the evaluate in the case where you have access both to the parameter distribution of the model and the data distribution.
- The recent methods of subnetwork Laplace [1], HMC, SWAG [2] are missing as benchmarks and would make the paper stronger.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?
Questions
- are parameters distributed Normal independently?
- is the prior an isotropic Gaussian?
- Usually the Loss is computed per batch and the gradient is then an approximation of the gradient of the entire dataset. Which batch do you choose for the sensitivity analysis? Do you use all batches and averages? What is the intuition behind allowing parameters to vary if they have a high gradient. It would be interesting to see the reasoning in context of Section 5 of [1].
- Evaluation of segmentation uncertainty: Why is lower entropy better? The chosen datasets contain multiple annotations per image, and therefore there is a degree of aleatoric uncertainty that should not be reduced.
  -In [1,2,4] there is strong evidence that being Bayesian only on the last layer is performing best. Do you find evidence for this, i.e. does your selection strategy also prefers parameters in the last layer? -Describe the connection between the VI with an isotropic Gaussian in your method and the diagonal Hessian Laplace approximations on the subnetwork.
Minor
- Equation (1) missing bracket
Important related work: [1] Daxberger, Erik, et al. “Bayesian deep learning via subnetwork inference.” International Conference on Machine Learning. PMLR, 2021.

[2] Sharma, Mrinank, et al. “Do Bayesian Neural Networks Need To Be Fully Stochastic?.” International Conference on Artificial Intelligence and Statistics. PMLR, 2023.

[3] Kahl, Kim-Celine, et al. “ValUES: A Framework for Systematic Validation of Uncertainty Estimation in Semantic Segmentation.” arXiv preprint arXiv:2401.08501 (2024).

[4] Daxberger, Erik, et al. “Laplace redux-effortless bayesian deep learning.” Advances in Neural Information Processing Systems 34 (2021): 20089-20103.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

Please see above.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper fills the gap of investigating mean-field VI on the subnetwork of a BNN. However, the evaluation is missing important benchmarks.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Author Feedback

Response to Reviewer 2: We thank reviewer#2 for their time and recommendations. The reviewer points out the limited diversity of experiments (to multi-class classification and segmentation). Due to limited space, we could only show these results; however, based on our experimentation, models that use fully connected (linear) layers or convolutional layers are applicable. Extension to recurrent layers is of interest and would be a future avenue for exploration. Comments: 1) It would be interesting to have a comparative analysis of the joint Gaussian distribution. In our work, we approached the problem from a standard variational inference perspective, where each parameter is represented by a distribution. 2) We conducted experiments on the impact of the beta hyperparameter on the loss function and how to balance learning the weight distribution and refining the NLL loss in training step 2 - we can include this additional analysis in the final paper.

Response to Reviewer 3: We thank reviewer#3 for their time and recommendations.

1) Regarding computational savings, the current method only improves costs related to training; we can clarify this in the final paper to avoid confusion. Further savings can be made from an inference perspective, but that is work to be completed in the near future.

2) Due to space constraints, we could not add our relative FLOPs computation; we will add it to the supplementary information in the final paper. The FLOPs are related to training, not inference; the number of samples contributes to the total number of FLOPs; this is considered along with the number of trainable parameters in the relative FLOPs count.

3) MC-dropout exhibits overconfidence compared to ensembles, for example, when tested on OOD data, which we briefly mentioned in the introduction.

4) The distribution of weight selected by the TopK first-order gradient method generally concentrates at the edges of the network (towards the input and output), consistent with previous published work.

Response to Reviewer4: We thank reviewer#4 for their time and recommendations. The highlighted related work on Laplace/HMC/SWAG will be added to the related works section of the final paper.

2) On the theoretical backing: there’s a solid theoretical backing to our subnetwork selection criterion based on the Taylor series approximation of introducing small perturbations into the network parameters (g(theta+delta(theta)). We can include more details in the final paper, given that space permits it.

3) We were more interested in the total predictive uncertainty, but we agree that a decomposition of the two uncertainties would be valuable, especially with respect to the multi-rater dataset.

Questions: 1) The parameters are modeled as univariate Gaussians. 2) The prior is set to an isotropic standard normal. 3) We accumulate the gradients based on all the batches after the pertaining step - to allow the selection to be sample-independent.
4) Lower entropy is desired for in-distribution data and correctly classified data. A higher entropy is needed in uncertain or incorrectly classified data (see qualitative examples on the ISIC dataset) or out-of-distribution data. For the multi-rater dataset, we would like the model to exhibit low uncertainty where the multiple-raters agree and higher uncertainty where the labels diverge, which we see consistent with our approach (higher entropy with higher disagreement and low entropy in easy-to-classify regions). 5) Our selection criterion is consistent with this finding; the selection converges to the first/last layers at low percentages. 6) We need to investigate further to accurately describe the connection between our subnetwork selection and the diagonal Hessian Laplace method by Daxberger (2021) to provide an accurate answer.

Meta-Review

Meta-review not available, early accepted paper.

back to top

Sparse Bayesian Networks: Efficient Uncertainty Quantification in Medical Image Analysis

Author(s):