List of Papers Browse by Subject Areas Author List
Abstract
Deep learning models are vulnerable to performance degradation when encountering out-of-distribution (OOD) images, potentially leading to misdiagnoses and compromised patient care. These shortcomings have led to great interest in the field of OOD detection.
Existing unsupervised OOD (U-OOD) detection methods typically assume that OOD samples originate from an unconcentrated distribution complementary to the training distribution, neglecting the reality that deployed models passively accumulate task-specific OOD samples over time.
To better reflect this real-world scenario, we introduce Iterative Deployment Exposure (IDE), a novel and more realistic setting for U-OOD detection.
We propose CSO, a method for IDE that starts from a U-OOD detector that is agnostic to the OOD distribution and slowly refines it during deployment using observed unlabeled data.
CSO uses a new U-OOD scoring function that combines the Mahalanobis distance with a nearest-neighbor approach, along with a novel confidence-scaled few-shot OOD detector to effectively learn from limited OOD examples. We validate our approach on a dedicated benchmark, showing that our method greatly improves upon strong baselines on three medical imaging modalities.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2508_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/LarsDoorenbos/IDE/
Link to the Dataset(s)
https://www.nature.com/articles/s41746-020-0273-z
https://stanfordmlgroup.github.io/competitions/mura/
https://www.kaggle.com/competitions/diabetic-retinopathy-detection
https://www.cs.toronto.edu/~kriz/cifar.html
BibTex
@InProceedings{DooLar_Iterative_MICCAI2025,
author = { Doorenbos, Lars and Sznitman, Raphael and Márquez-Neila, Pablo},
title = { { Iterative Deployment Exposure for Unsupervised Out-of-Distribution Detection } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15965},
month = {September},
page = {369 -- 379}
}
Reviews
Review #1
- Please describe the contribution of the paper
The research tackles the problem of Iterative Deployment Exposure (IDE) for OOD detection - a refreshing shift from conventional OOD detectors, which typically remain static after deployment. The authors propose CSO (Confidence-Scaled U-OOD detector), a method that adapts to newly encountered outliers over time. It works by using pseudo-labels from an initial unsupervised OOD detector to train a binary classifier, which is then refined iteratively until it fully replaces the original detector. The method is evaluated against existing outlier exposure techniques on newly introduced datasets, benchmarks, and metrics. It outperforms competing methods on 3 out of 4 benchmarks, indicating strong potential in this evolving deployment setting.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The work explores an exciting direction: moving beyond static OOD models toward ones that adapt over time with real-world exposure. It’s one of the first studies I’ve seen applying this idea in the medical domain, which adds practical relevance. The introduction of AUF (Area Under FPR@95) and AUA (Area Under AUC curve) as evaluation metrics is a nice contribution, offering a tailored way to quantify model performance in the IDE setting.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
The authors reference prior adaptive methods [4, 28] and claim they rely on the entire unlabeled test set for adaptation. This isn’t entirely accurate—those methods often use an adaptive π\piπ-mixing strategy, which can operate on just a subset of the test set. While larger sets may be used in practice, there’s nothing inherent in those approaches that prevents using only a small number of samples. In fact, this work includes such an implementation as a baseline, reinforcing that point. Additionally, it’s worth noting that hyperparameters are tuned on CIFAR-10 - a dataset quite different from the medical domains evaluated here (NIH, MURA, and DRD). Some discussion around this domain mismatch would help contextualize the results. Additional implementation details of the baseline methods should also be discussed to ensure the comparison is fair. Lastly, the abbreviation CSO isn’t introduced until page 4. It would improve readability if it were defined earlier, ideally when the method is first mentioned.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
This paper addresses the important problem of adapting OOD detectors post-deployment through iterative exposure to real-world outliers, proposing the CSO framework. While the novelty is somewhat limited - building on known components from prior work - the application to the medical domain is timely, and the strong empirical results across several benchmarks make it a relevant contribution. The introduction of AUF and AUA as task-specific metrics is also a useful addition. However, the paper somewhat overstates the limitations of prior adaptive methods and provides limited information on its implementation as baselines. Clarification on the mentioned issues could improve my understanding of the work and overall score.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #2
- Please describe the contribution of the paper
This paper presents Iterative Deployment Exposure (IDE) as a realistic scenario for U-OOD detection. The authors propose to iteratively update the model using unlabeled deployment data encountered over time. The authors also propose Confidence-Scaled OOD detection (CSO), combining a novel Mahalanobis nearest-neighbor scoring function and a confidence-based few-shot learner. Additionally, the authors introduce new evaluation metrics, AUF and AUA, to assess performance improvements over iterative deployments. Extensive experiments conducted on three medical imaging benchmarks demonstrate that CSO outperforms strong existing methods across diverse modalities, highlighting its effectiveness and practical relevance in medical imaging applications.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Novel IDE Formulation: The paper introduces Iterative Deployment Exposure, where OOD detectors evolve iteratively closely matching the reality of model deployment.
- Use of Unlabeled Data: The method progressively leverages unlabeled OOD samples encountered during deployment.
- The novel CSO method dynamically combines a robust Mahalanobis nearest-neighbor approach with a confidence-based few-shot detector for downstream medical tasks.
- Comprehensive Evaluation: Extensive experiments across three medical imaging modalities with multiple strong baselines and clearly defined metrics (AUF and AUA) validate the proposed methods.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Limited theoretical supports: While CSO performs well empirically, the theoretical justification for combining the few-shot detector with the binary classifier based on confidence scaling (Eq. 2, Eq. 6, Eq. 8) is limited. Why the chose confidence in few-shot learner works? It is only a scaler (lambda) to control the two detectors.
- Experimental setups: (1) Although hyperparameter sensitivity analysis is provided, the method’s effectiveness still heavily depends on selecting suitable parameters, the grid search may take a very long time if scaling to larger datasets or models. Also should compare computation time with baselines if need to tune the two hyperparameters (2) only conduct experiments on ResNet-18 backbone, I would like to see more comparison or ablation study on alternative architectures, such as RN101 and ViT. (3) Is data augmentation random for every step? Then will it impact the results?
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
My recommendation of weak accept is primarily based on the practical relevance and empirical results, if authors can enhance theoretical supports and showing effectiveness even with heavily hyperparameter tuning, it will strengthen the contribution of this work
- Reviewer confidence
Somewhat confident (2)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
The authors define a novel problem setup of continual learning for unsupervised out-of-distribution detection (U-OOD). In this setting, the goal is to continuously improve the U-OOD performance of a deployed model over time as more data becomes available. The authors argue that this a practically useful problem setup in the medical domain. They propose a method to improve U-OOD detection in this setting by adaptively balancing between a low-shot U-OOD method (based on outlier detection in the classifier’s embedding space) to a fully-supervised neural network based OOD detector. They show its superiority to U-OOD methods that do not use the continual approach.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Strong novelty: The paper presents a new problem setting of incremental U-OOD iterative deployment exposure. Looking at U-OOD in the continual learning setting makes sense and matches many real-world deployment settings. This is a general and clinically relevant problem.
- Each component of the method is well-explained and makes sense intuitively. The idea adaptively balancing a low-shot with a stronger learner as more OOD samples become available is sound.
- Good empirical results on the NIH and MURA datasets
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The method uses previous models to generate pseudo-labels for OOD and ID samples at each deployment iteration. It would seem that there is a risk of confirmation bias here: if samples incorrectly predicted as OOD early on are used to train the next stage of the OOD detector, could errors accumulate over iterations, reversing the benefit of the continual learning approach? How can we ensure pseudolabels are reliable? Can the authors comment on this risk, and, if considered, how it was assessed and/or mitigated? Fig 2 (a). shows that this does not seem to occur for the NIH dataset, but it may occur on the other datasets where the proposed method did not perform as well. If it is a risk, and OOD detection eventually worstens instead of improving, how do we know when to stop the iteration, practically speaking?
- Findings are limited to 3 medical datasets, and only outperform prior art on 2/3. The main improvements come from the NIH dataset. Since the problem setting is general, further experiments on more datasets would lend more support to the method’s efficacy.
- Reproducibility - will code be provided? Baselines were taken from other literature, not natively applied in the setting of IDE U-OOD, so reproducing the baselines itself is not straightforward without some extra discussion or code reference.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- Visual examples of OOD vs in-distribution samples or features space visualizations would help the reader gain intuition about the method
- Why use the term “Iterative deployment exposure” instead of the more well-established “continual learning” terminology?
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(5) Accept — should be accepted, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
This paper takes a fresh look at OOD detection as a continual learning and provides a simple but effective and novel approach to combat it. The approach and problem setup seem highly relevant for solving practical clinical problems.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
N/A
Meta-Review
Meta-review #1
- Your recommendation
Provisional Accept
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A