Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Reliable out-of-distribution (OOD) detection is important for safe deployment of deep learning models in fetal ultrasound amidst heterogeneous image characteristics and clinical settings. OOD detection relies on estimating a classification model’s uncertainty, which should increase for OOD samples. While existing research has largely focused on uncertainty quantification methods, this work investigates the impact of the classification task itself. Through experiments with eight uncertainty quantification methods across four classification tasks on the same image dataset, we demonstrate that OOD detection performance significantly varies with the task, and that the best task depends on the defined ID-OOD criteria; specifically, whether the OOD sample is due to: i) an image characteristic shift or ii) an anatomical feature shift. Furthermore, we reveal that superior OOD detection does not guarantee optimal abstained prediction, underscoring the necessity to align task selection and uncertainty strategies with the specific downstream application in medical image analysis. Code: https://github.com/wong-ck/ood-fetal-us.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3992_paper.pdf

SharedIt Link: https://rdcu.be/eHwWu

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04981-0_28

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/wong-ck/ood-fetal-us

Link to the Dataset(s)

BCNatal dataset: https://zenodo.org/records/3904280 African dataset: https://zenodo.org/records/7540448

BibTex

@InProceedings{WonChu_Influence_MICCAI2025,
        author = { Wong, Chun Kit AND Christensen, Anders N. AND Bercea, Cosmin I. AND Schnabel, Julia A. AND Tolsgaard, Martin G. AND Feragen, Aasa},
        title = { { Influence of Classification Task and Distribution Shift Type on OOD Detection in Fetal Ultrasound } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},
        page = {293 -- 303}
}

Reviews

Review #1

Please describe the contribution of the paper

The manuscript studies eight common classifier-based uncertainty quantification (UQ) techniques for out-of-distribution (OOD) detection on three fetal ultrasound datasets and provide two major findings: (1) the OOD performance of a UQ-model depends on both the classification task, and the type of distribution shift; (2) the OOD performance does not imply the best abstinence performance.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper provides a relatively comprehensive study of classifier-based UQ techniques on fetal ultrasound datasets, which could be useful for future work on fetal ultrasound image analysis.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The organization and major findings of this work are very similar to the previous benchmark (Mucsányi, et al. NIPS 2024). The key difference is that the experiments were conducted on fetal ultrasound datasets. However, the paper provides limited discussion about the clinical significance regarding the key findings, and the scope of experiments (i.e., number of studied methods, size of datasets) is narrower than the previous work. Therefore, the novelty and contribution of this paper is limited.

Mucsányi, Bálint, Michael Kirchhof, and Seong Joon Oh. “Benchmarking uncertainty disentanglement: Specialized uncertainties for specialized tasks.” Advances in neural information processing systems 37 (2024): 50972-51038.
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
1. The novelty and contribution are limited, which is explained in the major weakness comment.
2. The writings are poor. For example, the references are used inappropriately. In academic writing, reference numbers typically do not serve as grammatical subjects in sentences. Besides, there are many grammar errors.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

The paper shows, through a large‑scale benchmark on fetal‑ultrasound data, that the choice of base classification task dramatically alters the performance of eight uncertainty methods for OOD detection, and that the task producing the best OOD AUROC can differ from the one that maximises downstream accuracy under distribution shift.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. This paper conducts extensive experiments to find that the choice of base classifier for the OOD detector has a large effect on OOD performance.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. No new algorithm or theoretical insight proposed, the contribution is purely empirical, showing limited methodological novelty.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

No new algorithm or theoretical insight proposed, the contribution is purely empirical, showing limited methodological novelty.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The paper applies OOD detection to ultrasound data using a range of established uncertainty quantification techniques, with a specific focus on classifier-based uncertainty quantification.

The paper’s key novelty lies in applying uncertainty quantification for OOD detection not only to the primary task classifier, but compares against a range of auxiliary classifiers trained to predict metadata such as scanner type or demographics. The authors find that auxiliary classifiers are often better suited for OOD detection, in particular when the distribution shifts can be attributed to different acquisition settings.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is clearly written, well structured, and well evaluated.
- The idea is simple, yet I’m not aware of any other work using auxiliary classifiers for confidence-based out-of-distribution detection. This makes sense and could be easily applicable in practice, as metadata only needs to be available for training data. I could see this being further developed in the future as well.
- The results are consistent with what one would expect: classifiers trained on factors that have some relationship with the nature of the shift are better at detecting OOD shifts than the primary task classifier, but the task classifier is more suitable for detecting degraded model performance (the “abstained prediction, or deferral setting).
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- I was a bit surprised to see the simplest baseline missing: max confidence (https://arxiv.org/abs/1610.02136). However, while I would recommend including it, I do not think this would fundamentally change any conclusions.
- The results are less surprising to me than the authors imply in their presentation and discussion of the results. For example, I would absolutely expect that uncertainty quantification in the primary task classifier would be better at detecting task misclassification (deferral setting, Fig. 4) than an auxiliary task classifier.
- Relatively few details are provided on training the individual uncertainty quantification techniques, and now code is available or promised. I understand the space limitation, but would strongly recommend the authors to provide code at least.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This is not a groundbreaking contribution, but a promising idea with a solid evaluation.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The rebuttal did not change my own assessment. Also, after reading the other reviewer comments as well as the rebuttal, I still believe this is a worthwhile contribution and could spark interest in the MICCAI community.

Author Feedback

We thank the reviewers for their time, effort and constructive feedback.

== Our contribution == This application paper does not aim to propose a new method but to highlight – to the best of our knowledge – unknown factors that affect how we develop and validate uncertainty quantification methods:

(1) dependency on classification task during training, and (2) how model selection based on the traditional OOD (predict-the-dataset) performance yields suboptimal models for abstained prediction (aka not using predictions with high uncertainty) – which is usually the true use case in medical applications.

These points are not made in Mucsányi et al, and are not commonly known. In particular, OOD detection is standard for validating epistemic uncertainty performance, precisely as we seek to avoid making incorrect predictions on unseen data. We hope this clarification addresses the novelty concerns of reviewers 2 and 3.

== Specific concerns == [R2] Similarity to Mucsányi et al.: We utilize the framework of Mucsányi et al. under different experimental settings to address an entirely new research question:

How does the choice of classification task impact OOD detection performance? To the best of our knowledge, this systematic investigation into task dependency is novel and fills a gap in the existing literature. This is highly relevant in medical image analysis, where AI software in the clinic would often include multiple predictors for the same image – e.g. both identifying the anatomical ultrasound plane; assessing ultrasound plane quality; and predicting an outcome – all on the same images. Intuitively, you might expect the definition of “OOD” to be stable across tasks, but that is not what we observe. Our additional demonstration that model selection should not be done based on OOD performance alone adds further value to our paper.

[R2] Narrower scope of experiment: We agree that the scope of our experiments is narrower than large-scale natural image benchmarks like Mucsányi et al – we consider this a quality: As Mucsányi et al. also note, evaluations using the target domain dataset are crucial as findings may not transfer across domains. Our focus on fetal ultrasound, a specific medical domain with inherently smaller datasets than massive natural image collections, necessitated a more focused evaluation. Furthermore, the decision to primarily evaluate deterministic methods was motivated by the clinical relevance of providing real-time support for clinicians, where the computational cost and latency of probabilistic methods can be prohibitive.

[R2] Clinical significance of key findings: While this paper is not a clinical validation study, the finding itself has significant clinical implications. Demonstrating that OOD detection performance varies drastically depending on the classification task and distribution shift type means that a model deemed “reliable” for one type of OOD shift might be unreliable for another. This directly impacts the trustworthiness of deployed models in clinical settings where multiple types of distribution shifts can occur. The additional experiment on abstained prediction further underscores this, showing that the task optimal for OOD detection may not be optimal for deciding whether to use predictions, further highlighting the necessity of considering the specific downstream clinical application.

[R1] Missing max confidence baseline: Thank you for pointing this out. In our early experiments, both entropy (as presented) and maximum softmax probability were evaluated as baselines. As we observed very similar performance and trends for our OOD detection setup, we chose to report only logit entropy to sharpen focus and be concise. We agree this is a valid and simple baseline, but including it would indeed not alter the fundamental conclusions of our study.

[R1,2,3] Reproducibility: We cannot release our dataset due to privacy regulations, but we will make the source code available upon acceptance.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

While this paper two Weak Rejects, they are from the first round of reviews. R1 argued Accept, and after rebuttal R1 validated their initial assessment, whereas R2 and R3 did not come back.

I went through the paper, the reviews and the response and, to be honest, I like the idea and the hypotheses being tested here - that being OoD-good at one task does not imply being good at another, and that OoD performance does not translate directly to abstention-based performance. I think these are relevant, and worth having at MICCAI. Even if I understand the authors mostly used the work of Mucsányi et al, adapting it to a MICCAI context, it does appear that they followed a different research direction, so I am okay with that.

BTW maybe I am wrong, but I am not sure that the (nice) idea that R1 mentions when describing their understanding of this paper (observing confidence of auxiliary classifiers on metadata prediction for OoD purposes - you do not need the meta-data in test time, brilliant) is what authors are really doing here. If I am right, I would like to invite the authors to pursue that idea, as they have most of the code and data almost ready to try it out, and acknowledge R1’s uninteional resarch line suggestion, if it works well!

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The rebuttal address the issues from reviewers from some extent, but the answers to some questions raised by reviewers are not convincing, e.g. the methology novelty is limited.

back to top

Influence of Classification Task and Distribution Shift Type on OOD Detection in Fetal Ultrasound

Author(s):