List of Papers Browse by Subject Areas Author List
Abstract
Uncertainty quantification is necessary for developers, physicians, and regulatory agencies to build trust in machine learning predictors and improve patient care. Beyond measuring uncertainty, it is crucial to express it in clinically meaningful terms that provide actionable insights. This work introduces a conformal risk control (CRC) procedure for organ-dependent uncertainty estimation, ensuring high-probability coverage of the ground-truth image. We first present a high-dimensional CRC procedure that leverages recent ideas of length minimization. We make this procedure semantically adaptive to each patient’s anatomy and positioning of organs. Our method, semCRC, provides tighter uncertainty intervals with valid coverage on real-world computed tomography data while communicating uncertainty with clinically relevant features.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2001_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/Sulam-Group/semantic_uq
Link to the Dataset(s)
TotalSegmentator: https://github.com/wasserth/TotalSegmentator
FLARE: https://codalab.lisn.upsaclay.fr/competitions/12239
AbdomenAtlas-8K: https://github.com/MrGiovanni/AbdomenAtlas
BibTex
@InProceedings{TenJac_Conformal_MICCAI2025,
author = { Teneggi, Jacopo and Stayman, J. Webster and Sulam, Jeremias},
title = { { Conformal Risk Control for Semantic Uncertainty Quantification in Computed Tomography } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15973},
month = {September},
page = {45 -- 55}
}
Reviews
Review #1
- Please describe the contribution of the paper
- provide more useful uncertainty intervals when using CRC / RCPS
- useful: intervals are tighter
- get tighter intervals by calibrating not on task for a whole image but on individual regions
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- get tighter intervals indeed
- showed on two different tasks
- clear results
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Please note that the reviewer is open to increase or decrease the score based on the rebuttal phase.
- main concern: novelty and has been done before -> Subgroup-Specific Risk-Controlled Dose Estimation in Radiotherapy by Fischer et al.
- in this paper, a calibration of individual regions of images has been performed (known and unknown at test time)
- therefore: no methodological contribution but only application in different domain
- suggested method needs to have access to some sort of “ground truth”
- also: trivial that one will get tighter bounds when calibrating on individual structures and not on whole image
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(2) Reject — should be rejected, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- idea itself okay but has been done before (related work not mentioned / identified)
- idea trivial though -> very limited novelty
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Reject
- [Post rebuttal] Please justify your final decision from above.
The main weakness is the similarity to Fischer et al. The paper needs to be re-written very differently such that a clear distinction to this related work is becomes apparent as from the methodological side there are too many similarities to the other paper. Therefore, the contributions can be indeed on optimizing the interval length and another domain application but then it becomes a very different paper. Therefore I recommend reject here.
Review #2
- Please describe the contribution of the paper
The paper proposes a method for predicting uncertainty intervals at organ level, with application to image reconstruction. To my understanding, the key contribution here is the capability to provide control over the organ-specific risk levels.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Novel method for organ-specific prediction of uncertainty with guarantees on the risk control
- Evaluated across different tasks and datasets
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Unclear how this method contributes to the ability to quantify uncertainty in way that its meaningful to physicians and regulatory agencies to verify the safety and reliability (which was used as a motivation)
- Difficult to interpret the results and there is little guidance to readers less familiar with the topic
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
I have major difficulties understanding the paper’s results. I do not know how to meaningfully interpret the metrics (length vs risk), the comparisons to other methods, and the visual results. The paper could do a much better job guiding readers through the results. Do we have prefer risk over length, what are the practical implications of each, and how do can they be interpreted by physicians or regulators?
I cannot really see much difference in the maps shown in Fig. 2, and I am not sure how to interpret the different shades of gray in the bottom row. How does good and bad look like in this case? How can I assess whether the proposed method makes a meaningful improvement over previous methods?
As it stands, the paper might be of limited interest to the wider MICCAI community.
- Reviewer confidence
Somewhat confident (2)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The rebuttal has addressed most of my questions and clarified the practical downstream use. I now believe that this could make an interesting contribution to the conference.
Review #3
- Please describe the contribution of the paper
The manuscript describes methodological improvements to conformal risk control, an emerging general paradigm for uncertainty quantification, by tailoring the method to inverse problems in medical image analysis. This results in tighter uncertainty bounds for the same level of error control, as well as in separate error control for different semantic image regions (such as organs) instead of over the whole image, yielding more semantically meaningful uncertainty estimates. The method is evaluated on two synthetic (CT) case studies, on in image reconstruction and one in noise removal.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Accurate uncertainty quantification in inverse imaging problems (e.g. image reconstruction, denoising, synthesis) is an important and largely open problem. Conformal risk control is a successful recent general paradigm, and its tailored adaptation to the field of medical image analysis is timely and welcome.
The paper is generally well-written, mathematically well-presented and sound. The method is (to the best of my knowledge) novel and improves in clear ways upon previous methods for conformal risk control.
Despite being on the more theoretical side for a conference such as MICCAI, the paper includes an evaluation on two reasonably realistic case studies (using the TotalSegmentator dataset), yielding favorable results both quantitatively and qualitatively.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Somewhat unexpectedly, I found the more theoretical parts of the paper very clearly presented but the medical imaging experiments difficult to understand. These should be clarified in several regards; see some specific questions below. In general, since this is MICCAI and not a more theoretical conference, I would recommend highlighting very early on in the abstract and the introduction what specific (realistic) medical imaging applications this methodology would be useful for.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- The authors write about “d-dimensional images” - I believe d here denotes the number of pixels, not d=2/3 for slices/volumes. This was a bit confusing to me, and I recommend clarifying it.
- In the background section on conformal risk control, it reads a bit as if the loss in Eq. (2) were the only thing being optimized. This would of course be nonsensical since this loss is always minimized by setting lambda=inf. The later formulations then actually do not minimize this loss but rather minimize the mean interval length under the constraint that loss <= eps. This of course makes sense; I would recommend adding a remark in this regard to the background section.
- In Fig. 3, the legend is (too) tiny. Additionally, what is currently positioned as a y axis label should instead be an x axis label.
- As hinted at above, various aspects of the experiments remained unclear to me. Let me try to rephrase what I believe to have understood:
- There are two separate synthetic (?) experiments, one on denoising (artificial noise) and one on CT image reconstruction from cone beam projections simulated from real recordings? What exactly are the model inputs and outputs in both cases (especially the reconstruction task)?
- Both tasks are evaluated separately on both the TotalSegmentator and FLARE23 datasets?
- For each task and target dataset, a MONAI 3D UNet is trained on the AbdomenAtlas-8K dataset? So the evaluation on TotalSegmentator and FLARE23 is fully o.o.d. and there is no i.i.d. evaluation at all…? How is this model trained, then - using the same synthetic noise addition / cone beam projection simulation?
- Why are different results obtained on the same dataset for the performance of the organ segmentation model between the two tasks? Shouldn’t these be identical?
- I believe many of these issues could be effectively addressed by adding a schematic figure illustrating the overall experimental setup.
- Why is the background still strongly overcovered by the proposed method? Or am I misinterpreting the results somehow?
- Finally (relatedly?), could it not be the case that the predicted quantile already overcovers the target risk, necessitating a negative lambda to obtain the desired level of risk control while minimizing interval length?
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The manuscript presents significant methodological developments with clear applicability and utility to inverse medical imaging problems. The theory and method are well-presented, however, for acceptance, the experimental section needs to be clarified.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The authors have addressed my initial concerns effectively in their rebuttal. I consider this is a sound, interesting, and well-written technical contribution. While I agree with R2 that this contribution will probably be too technical for the more applied parts of the community, I personally would consider it well within scope for MICCAI: a sound methodological development with clear applicability and case studies in medical image analysis.
Author Feedback
We sincerely thank all Reviewers for their thoughtful and constructive feedback. Here, we address the main points raised by the Reviewers, and we assure all outstanding suggestions about presentation and layout will be included in the revised version of the paper.
Novelty. We thank Reviewer #3 for pointing out the connection with Fischer et al. (SG-RCPS), which we will include. While both works focus on risk control for clinically relevant structures, they significantly differ in methodology (and application domain). Briefly, we study mean interval length minimization, and our per-organ calibration procedure is different from SG-RCPS. In detail, we remark that we provide two distinct methods to construct organ-dependent uncertainty intervals, one that provides risk control over the entire image (Eq. 9), and one for per-organ risk control (Eq. 12). First, for overall risk control, we use principles of convex optimization to minimize the mean interval length. This is novel, as Fischer et al. do not study interval length minimization. Second, for per-organ risk control, our procedure differs from SG-RCPS, which stops when the maximum upper confidence bound across groups exceeds the tolerance level, while ours calibrates each organ independently. This is an important technical difference, as SG-RCPS returns a scalar value that controls multiple risks, while both our procedures return vectors whose entries correspond to different organs. At test time, SG-RCPS applies the same scalar to all intervals, whereas we assign intervals different values according to which organ they belong to. Lastly, we stress that per-organ risk control inflates the mean interval length, and it is the former method, for overall risk control, that provides the tightest intervals.
Practical implications. We thank Reviewer #2 for these excellent questions. The tradeoff between per-organ risk control and interval length should be informed by the downstream clinical task. In tumor resection planning or organ transplant evaluation, for example, targeted per-organ coverage is critical to avoid treatment errors. Here, intervals may be large, but they capture the uncertainty of the model in reconstructing the target organ. On the other hand, for whole-abdomen segmentation or total lesion volume measurement, minimizing the mean interval length over several organs informs on the distribution of the error of the model. Our method improves on alternatives by conveying uncertainty with semantic structures that are clinically meaningful rather than pixels. For example, clinicians may use the minimal dose that guarantees risk control for a target organ with uncertainty intervals shorter than a task-driven tolerance, and regulatory agencies may issue risk/length standards for algorithms to be approved. We will highlight these aspects in the revised version of the paper.
Experimental details. We are grateful to Reviewer #1 and #2 for their questions on the experimental setup and results. For denoising, the input to the model is the simulated noisy image, and for reconstruction, it is the FBP estimate from the simulated noisy sinogram. For each task, we train a 3D U-Net (2 models in total) by applying the same noise addition, and projection simulation with FBP, to AbdomenAtlas. Evaluation is fully ood because the segmentation model (SuPrem) is also pretrained on AbdomenAtlas, biasing iid evaluation. We use SuPrem to segment the output of the 3D U-Net, which is task-specific, hence why performance varies on the same dataset across tasks (we will include schematics). Yes, background is overcovered because the output of the 3D U-Net is good enough. Negative lambdas may break convexity of the upper bound (see [26, Appendix A.3]), and this is a great point for further improvements. Lastly, the different shades of gray in the bottom row of Fig. 2 indicate the level of correction needed in each organ to achieve risk control. Brighter colors represent worse performance.
Meta-Review
Meta-review #1
- Your recommendation
Invite for Rebuttal
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
This work tailors conformal risk control to medical inverse problems, demonstrating two calibration modes: one convex-optimised scalar that achieves the tightest image-wide intervals, and one vector of organ-specific factors that delivers per-structure guarantees.
Initial reviewer scores averaged 3.0 (R1 4(WA), R2 3(WR), R3 2(R)). After the rebuttal clarified the experimental pipeline, added clinical context for risk-versus-length trade-offs, R1 and R2 raised their recommendations to Accept; R3 maintained Reject, citing overlap in spirit rather than any flaw in theory or results.
The organ-aware extension of conformal risk control is methodologically solid, novel for medical imaging. After rebuttal, two reviewers moved to Accept; only one remains opposed. The AC recommends accept.
For the camera-ready version please:
- cite Fischer et al. and state the differences,
- add the schematic of data generation/evaluation,
- offer concise guidance on when to choose global vs. organ-specific calibration.
Meta-review #3
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
Most of reviewers voted for acceptance.