Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Uncertainty estimation is critical for reliable decision-making in medical imaging. State-of-the-art uncertainty methods require significant computational overhead and complex modeling. In this work, we present and explore a simple, effective approach to incorporating Bayesian uncertainty into deterministic networks by replacing the first and/or last layer (visible layers) with their variational Bayesian counterpart. This lightweight modification enables uncertainty quantification through mean-field variational estimation, making it practical for real-world medical applications. We evaluate the methods on ISIC and LIDC-IDRI for the segmentation task and DermaMNIST and ChestMNIST for the classification task using post-hoc and jointly-trained visible layers. We demonstrate that variational visible layers enable uncertainty-based failure detection for both in-distribution and near-out-of-distribution samples, preserving task performance while reducing the number of variational parameters required for Bayesian estimation. We provide an easy-to-implement solution for integrating uncertainty estimation into existing pipelines.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0787_paper.pdf

SharedIt Link: https://rdcu.be/eHxdj

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05185-1_64

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/zabboud/Variational-Visible-Layers

Link to the Dataset(s)

N/A

BibTex

@InProceedings{AbbZei_Variational_MICCAI2025,
        author = { Abboud, Zeinab AND Lombaert, Herve AND Kadoury, Samuel},
        title = { { Variational Visible Layers: A Practical Framework for Uncertainty Estimation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15973},
        month = {September},
        page = {670 -- 679}
}

Reviews

Review #1

Please describe the contribution of the paper

The authors build on earlier work to present a method for capturing uncertainty in deterministic (pre-trained) methods. For this, the first and/or last layers of the network are replaced by variational layers and different training strategies are subsequently explored. For me, the paper would make more impact as a rigorous benchmark on existing work, rather than the introduction of a novel approach (Variational Visible Layers), as the contributions seem limited in that regard.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Solid introduction of the problem and theoretical embedding
- Experiments on several public data sets and to-be-released code
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Experimental setup is questionable and unclear at some points.
- Novelty seems limited, papers [1], [11] and [26] present very similar work
- Evaluation approach is prone to bias and also seem to lacks important uncertainty quantification metrics and visualizations. –> See 10 for a detailed description on these issues.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
%% SUMMARY %% This work focusses on capturing uncertainty into deterministic networks in a straightforward way. Building on previous work, the authors replace the first and/or last layers of the network with Bayesian layers and evaluate different training strategies. Four data sets are used in the experiments, two for segmentation and two for classification, where the authors frame the uncertainty estimation performance as a failure classification problem. Results demonstrate that condition-specific training strategies can lead to decent uncertainty estimation performance, posing it as an attractive alternative to fully Bayesian Neural Networks (BNNs).

%% OVERALL COMMENTS %% While the paper is well written and covers an interesting topic, there are some major points that could still be improved to safeguard the reliability of the results. I would encourage the authors to reframe the paper as a solid benchmark in this area, rather than posing it as the introduction of a novel method. In that light, the experimental setup should be scrutinized to eliminate potential bias and the evaluation approach should be reconsidered and (at least) extended to include additional results (including metrics/figures). Please find below some more detailed remarks.

%% MAJOR POINTS %%
1. EXPERIMENTAL SETUP: For me, the experimental setup is unclear. As I understand it, how well the approach captures uncertainty is posed as a binary classification problem, with a pre-defined performance threshold to decide what is a failure. This threshold is mentioned on page 6, but not elaborated on, while this has major implications on the computed failure detection scores. Additionally, the authors introduce synthetic perturbations to assess robustness to covariate shift, but this analysis is not included in the paper. If the authors have just included the perturbed samples in their data sets for all experiments, this also creates a dependency on the level of corruption and potential bias in the results: does the method capture failure vs non-failure or corrupted vs not corrupted? I am honestly a bit confused, and I would encourage the authors to clarify the experimental setup, as a couple of different variables seem to be mixed (OOD vs ID, Failure vs Non-failure, etc.).
2. EVALUATION APPROACH: Moreover, I would expect also more explicit methods to demonstrate uncertainty quantification, in the form of e.g. reliability diagrams or Generalized Energy Distance. Finally, in a paper about uncertainty, I would expect also uncertainty intervals on the produced numbers:) Especially, since a lot of them are relatively close: a few percentage points on UUROC or DICE for example could easily fall within the 95%CI, let alone whether such difference would have any actual clinical implications.
3. NOVELTY: While the work is definitely interesting, it hinges quite a bit on earlier work, and it does not seem to considerably extend or improve on this. Especially papers [1], [11] and [26] seem to already contain most of the proposed contributions.
%% MINOR REMARKS %% a. Typo in abstract: missing period and space  “applicationsWe” b. Typo bottom of p3 “initialized from random.”. c. Typo bottom of p4 “networks trained from random.”. d. On p5 the authors mention that they are the first to target segmentation failure detection under covariate distributional shifts. This might be the case for those specific data sets (ISIC and LIDC-IDRI) but not for medical image analysis in general. For example, see https://doi.org/10.1016/j.media.2024.103157, which evaluates both classification- and segmentation performance drops under such shifts.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is interesting and relevant but lacks novelty. This in itself is no reason to reject, but while it is framed as the introduction of a novel technique (Variational Visible Layers), in my humble opinion, it presents the results of a benchmark study. Judging it as such, it lacks rigor and clarity, hampering the reliability of the conclusions drawn from the results.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

While I still have some reservations with the paper, the authors did adequately address most of my concerns. Given this additional information, I think the paper meets the standard for MICCAI and I thank the authors for their clear answers.

Review #2

Please describe the contribution of the paper

The authors propose a variational Bayesian approach to uncertainty quantification in deep neural networks whereby only the first and last layers are made stochastic (with a mean field approximation). Several training strategies are explored: post-hoc reparametrization (frozen/fine-tuning) and joint training. All combinations of training strategies and {first/last/both} stochastic layers are compared. The authors compare the proposed approach to MF-VI, sparse VI, ensembles and deterministic NNs. The method is evaluated on two classification datasets: DermaMNIST and ChestMNIST, and two segmentation datasets: ISIC and LIDC-IDRI.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- I believe the proposed approach has novelty.
- The methodology is quite clear.
- The investigation, including ID and OOD performance, is interesting.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The results and most insights are presented by family of uncertainty quantification technique (for instance, FirstLast VI), but there is significant variability within a given family depending on the training strategy. For instance on LIDC-IDRI, Joint-FirstLast is 10 Dice points below the posthoc reparametrization strategies (and the uAUPR varies by more than 15 points within this family). This makes it difficult to draw general insights out of the experiments.
- Generally speaking, it is difficult to extract general messages and guidelines out of the experiments, because different methods come out on top in different experiments/datasets. This also raises general questions about the replicability of the results if we rerun the experiments with a different seed, different fold, etc.
- The OOD evaluation is limited to a failure detection metric, which is already interesting in its own right for an OOD scenario, but it is not clear whether methods perform well in terms of classification/segmentation performance in this OOD scenario? I am thinking of situations where an approach has many false negatives/positives, but systematically detects them.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- Why highlight uAUPR in Table 1? Other metrics are also interesting in their own right and sometimes give a different story about the best method.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The strengths mentioned above slightly outweigh the weaknesses at this stage in the review process.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

For me the paper is entirely borderline, taking into account the other reviews.

Weakness: weak/no methodological contribution, and the abstract is misleading in that respect (stated contributions in the intro are clearer).

Strengths: The experiments present interesting results. The investigation of VVLs for failure detection + applications to the medical domain were not done in the closest related work [26]. This makes the paper potentially interesting to the medical imaging community.

Having to choose between accept/reject, I recommend acceptance based on the sentiment that the paper can be interesting to the community despite its shortcomings.

Review #3

Please describe the contribution of the paper

The largest contribution is a new heuristic the paper proposes to automatically set a weighting hyperparameter for KL divergence weighting.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is well written and clear to follow, which is not trivial in such a complex domain. The paper in addition does an evaluation about which layers to convert to variational layers. The method of only converting some layers is simple and efficient, however has been done in literature, as the paper mentions. Nevertheless it results in a workable solution that could be adopted by the community. The paper not only presents results but actually drafts a proper conclusion when which approach should be used.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Since the concept of only selecting a limited subset of layers to be transformed to variational layers is not new, the innovative aspect of this paper is rather limited. In the experiments it would be best to also evaluate how the method would perform when the heuristic for beta in eq 1 was not used, so if people had to manually set such a value. By evaluating various values, the reader could get a feeling if A) the proposed heuristic actually achieves (near) optimal results, b) how big of a problem it is that the paper attempts to resolve. Even though 2 different use cases were evaluated, the heuristic is only evaluated for 2 different network architectures. Will the same conclusion (and benchmark) hold if deeper, or wider or transformer like networks are used. Showing robustness to architecture choice would be essential to promote the heuristical hyperparameter setting.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

In section 2.2m post hoc reparametrization, which values for sigma are used? Just after equation 1, “hyperparameter beta is based on”, i assume it is equal to, or is a more complex formula used?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The easy to follow explanation and simplicity of the method could make it useful for the community, although some additional experiments should proof the robustness.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank the reviewers for their insightful comments and constructive feedback to improve the manuscript.

EVALUATION & EXPERIMENTS

Additional Experiments (R1, R2, R3) : We evaluate Variational Visible Layers (VVL) for failure detection in both in- and out-of-distribution (ID/OOD) for segmentation and classification with 9 different training schemes. We demonstrate that our VVL yields better overall ID / OOD failure detection performance and provide recommendations based on use cases for failure detection and data size. Our evaluations are done on 2 different tasks (classification and segmentation) and across 4 different datasets. We will replace Table 1 with boxplots to include CI% across different seeds for ease of representation in the final version. We agree that further experiments on different network architectures would add value. We defer this to future work, and will acknowledge this in the final version.

uAUROC Threshold (R1): We frame uncertainty estimation as a sample level binary classification task as follows (Section 2.3):

Classification: Use accuracy as an indicator of pass/fail.

Segmentation: We require a single pass/fail label per image, reflecting the clinical workflow in which an entire sample is flagged for human review. To translate pixel wise outputs into a sample level decision, we compare each image’s overall performance (IoU or Dice) against a threshold. We set the threshold to the model’s mean ID score: any case scoring below the threshold is deemed a failure.

Performance on OOD (R1, R2): We confirm that failure detection on synthetically perturbed (OOD) samples is included in our analysis. The uAUPR results are reported in Table 1 & Figure 2, while qualitative OOD segmentations appear in Figure 3. Due to space constraints, we omitted absolute performance results on OOD data, as they do not add to the overall analysis/conclusions and show expected trends: a modest decline with increased corruption. Our focus is on failure detection, which is reported in the manuscript.

NOVELTY & CONTRIBUTION We acknowledge that using a subset of variational parameters has been previously explored, as cited in references [1, 11, 26]:

[1] explores sparse variational parameters for medical image segmentation/classification but does not evaluate uncertainty under distributional shift. [11] applies variational parameters to last layers, for classification/regression. [26] explores variational parameters to input OR output layers, and compares them to full Variational Inference (VI) for classification/regression.

However, the key differences between our work and prior work are:

Prior works [1,11,26] agree that full VI is unnecessary for uncertainty estimation. While prior work considered applying variational parameters to the last layer [1,26], or first layer [26], or sparse subnetwork [1], we introduce stochasticity into both first and last layers (VVL), jointly to capture uncertainty, which was not previously investigated.

We investigate the ability of VVLs of sample-level failure detection under ID & OOD shifts in medical imaging, using both joint/post-hoc training, and compare to individual VVLs, full VI, and ensemble.

Beta Setting (R3): We also address a key challenge in VI training: the sensitivity of the KL-weight (beta) in the ELBO loss. While existing methods use grid search, annealing, or minibatch scaling, assuming full VI networks, these do not extend to our VVL setting, where only a subset of parameters are stochastic. We propose a method to automatically set beta = (#stochastic parameters / #total parameters), which eliminates manual tuning and improves practicality and reproducibility (Section 2.2). This will be emphasized in the final version.

Our study offers actionable guidance for method selection (Section 4).

We will further highlight these contributions and the differences to existing works in the final version.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

I recommend acceptance of this paper. While there are some methodological limitations and a somewhat overstated abstract, the authors have adequately addressed most reviewer concerns in the rebuttal. The investigation of vision-language models (VLMs) for failure detection, especially in a medical context, presents novel insights not explored in closely related work [26]. The experimental results are meaningful and contribute to a growing area of research at the intersection of AI robustness and clinical application. Overall, despite the paper’s limitations in methodological novelty, the combination of a clear application direction, thoughtful experimentation, and timely topic justifies its acceptance.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

Variational Visible Layers: A Practical Framework for Uncertainty Estimation

Author(s):