Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Shifts in data distribution can substantially harm the performance of clinical AI models and lead to misdiagnosis. Hence, various methods have been developed to detect the presence of such shifts at deployment time. However, root causes of dataset shifts are varied, and the choice of shift mitigation strategies highly depends on the precise type of shift encountered at test time. As such, detecting test-time dataset shift is not sufficient: precisely identifying which type of shift has occurred is critical. In this work, we propose the first unsupervised dataset shift identification framework, effectively distinguishing between prevalence shift, covariate shift and mixed shifts. We show the effectiveness of the proposed shift identification framework across three different imaging modalities (chest radiography, digital mammography, and retinal fundus images) on five types of real-world dataset shifts, using five large publicly available datasets.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4612_paper.pdf

SharedIt Link: https://rdcu.be/eHwV2

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04981-0_7

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/biomedia-mira/shift_identification

Link to the Dataset(s)

PadChest: https://bimcv.cipf.es/bimcv-projects/padchest/ EMBED: https://github.com/Emory-HITI/EMBED_Open_Data/tree/main Messidor-v2: https://www.adcis.net/en/third-party/messidor2/ Kaggle Aptos: https://www.kaggle.com/competitions/aptos2019-blindness-detection/data Kaggle Diabetic Retinopathy: https://www.kaggle.com/c/diabetic-retinopathy-detection/data

BibTex

@InProceedings{RosMél_Automatic_MICCAI2025,
        author = { Roschewitz, Mélanie AND Mehta, Raghav AND Jones, Charles AND Glocker, Ben},
        title = { { Automatic dataset shift identification to support safe deployment of medical imaging AI } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},
        page = {67 -- 76}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper systematically applies BBSD and MMD, existing shift identification techniques, to identify and measure prevalence and covariate shifts. It addresses an important problem in clinical generalizability.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The methodology is presented in a clear and organized manner, and the accompanying figures effectively support the explanation of the approach and results.
2. The work addresses a significant and timely problem in the field of clinical AI - model generalizability across different data distributions.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Unclear motivation
  - While detecting distributional shifts in data is indeed important, the motivation for the proposed multi-step approach remains unclear. For instance, in cases of prevalence shift, if label information is available, such shifts can often be directly identified from the labels themselves. On the other hand, when labels are unavailable and the shift is inferred from model outputs, it is difficult to disentangle whether the observed changes are due to actual label distribution shifts or simply the model’s poor generalization to unseen datasets.
  - A similar concern applies to covariate shift. If metadata is available, subgroup distribution differences can often be detected from it. In cases where a covariate is missing from the metadata, it is not clear how one can ensure that SSL-based feature extraction methods accurately capture such covariate shifts. As shown in Figure 3, different SSL techniques result in varying magnitudes of detected shift. This raises the question: could some covariates simply not be encoded in the latent space by these SSL techniques?
  - If the authors aim to go beyond traditional label- or metadata-based methods by characterizing shifts learned by the model itself, this should be clearly stated in the manuscript. However, in that case, the justification for using different SSL encoders to identify covariate shifts becomes even more crucial and should be clearly explained. Overall, clarifying the scope, goals, and advantages of the proposed approach over more direct methods would greatly strengthen the paper.
2. Potentially incorrect or misleading claims
  - The manuscript claims that the ‘Duo’ approach performs best overall. However, as shown in Figure 3, either BBSD or MMD outperforms ‘Duo’ in several cases. Moreover, for larger dataset sizes, the performance of ‘Duo’ appears comparable to that of other techniques rather than clearly superior. This claim should be revised or better supported by the results.
  - In the background section, the authors state that P_ref(X) = P_test(X) in the case of prevalence shift. This appears to be inaccurate and should be corrected or clarified.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

See weakness above
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the paper tackles an important issue in clinical AI, the motivation and methodological justification are unclear. The proposed approach adds complexity without sufficiently explaining why existing label- or metadata-based methods are inadequate. It is also unclear how the method distinguishes between true distributional shifts and poor model generalization, especially when labels are unavailable.

The use of SSL-based feature extraction for detecting covariate shifts lacks justification, as different SSL techniques yield inconsistent results.

Due to these concerns, I do not recommend acceptance.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The rebuttal clarifies some points that were previously unclear to me. I think it meets the standards for acceptance at MICCAI.

Review #2

Please describe the contribution of the paper

This paper proposed an unsupervised dataset shift identification framework, to effectively distinguishing between prevalence shift, covariate shift and mixed shifts. The effectiveness is validated using three different imaging modalities chest radiography, digital mammography, and retinal fundus images on five types of real-world dataset shifts, using five large publicly available datasets.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

A practical formulation to detect shift of the datasets which could be very useful for the AI deployment. Domain shift is always a problem. If can accurately detect the type of shift and exact counter measure can be designed to mitigate the shift. The description of the implementation is clear.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The methodology appear to be solid, but may not have too much novelty.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

1) While the detection of exact shifts is important for the AI model to be deployed, can you discuss what might be resulted if we just apply the domain adaptation techniques without knowing the exact domain shift type. This may further highlight the importance of the work. 2) Just a minor suggestion, The presentation of the paper could be improved by incorporating more illustrative block diagrams/tables instead of long descriptions on the implementation process.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

a practical investigation and validtaion.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The paper presents a method to detect and distinguish 3 types of test-time data shift (prevalence shift, covariate shift, mixture of prevalence and covariate shift).
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The evaluation is very extensive and seems to well assess realistic test cases. The task of identifying test-time data shift seems critical for clinical translation.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The paper basically assembles existing methods for data shift detection to distinguish different types. The stressed value of identifying the type of data shift is not clearly demonstrated, as methods to automatically correct data shift were not tested. And the type of shift might not be enough to understand the details of the shift (which labels, which attributes are shifted).
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

novalty is limited (assembled existing methods), value is not fully clear (how well does automatic correction work, does the type help really identifying the problem in the shifted data?)
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Method is well explained and a large set of experiments was performed. The motivation and value of the method was promised to be explained in more detail.

Author Feedback

[R1,R2,R3] Why is it necessary to solve the shift identification problem? As highlighted by R2, automated shift identification ‘is critical for clinical translation’. It is key for (i) effective and targeted model auditing and (ii) for effectively mitigating the consequences of such shifts. Indeed, many domain adaptation techniques are shift-specific: applying the wrong mitigation method to the wrong type of shift may substantially worsen model performance. For example, image-harmonisation techniques like [30] or automatic correction methods like [19] effectively mitigate effects of acquisition shifts on model performance but will fail in the case of prevalence shift. Conversely, applying label shift adaptation methods when the shift is actually caused by covariate shift can drastically degrade model calibration and clinical metrics. We will strengthen the motivation of our work in the introduction.

[R2,R3] Novelty We propose the first framework able to identify the type of test-time shifts in an unsupervised manner for imaging data, beyond solely detecting shifts. This is the main novelty of our work. We acknowledge that individual components are based on the shift detection literature, however the overall framework is novel and solves an important and currently open problem. Our results provide evidence that the proposed method is generalisable across a large variety of shifts, imaging modalities and tasks. Besides this main contribution, our analysis in 4.1 also provides novel insights into existing shift detection methods.

[R1] Why is label, output and metadata monitoring not enough to solve shift identification? Our framework is to be used as a monitoring tool for deployed classifiers, hence we have to assume that we don’t have access to labels (especially as ground truth can sometimes take weeks to be established, e.g. cancer biopsy results). Moreover, solely monitoring model outputs is insufficient to identify the type of shift, as different types of shifts can have similar effects on output distribution. That’s why our identification framework uses both low-dimensional representations of images (from SSL encoders) and model outputs. As discussed in Sec 5, metadata monitoring can complement our method, but it is not sufficient on its own as (i) it can not detect prevalence shift and (ii) can only detect changes related to currently monitored metadata, if the variable causing the shift is not monitored the shift will go undetected.

[R1] Why are SSL encoders better to detect covariate shifts? SSL encoders are not biased towards a specific task, but instead encode a general summary of the semantic information of each image. For example, while a pneumonia detection model focuses on pneumonia-like features, an SSL model will also encode other image characteristics (e.g. gender or scanner) in the learned representation, providing better signal for detecting covariate shifts (e.g. acquisition shift). While there are indeed differences across SSL encoders, they all outperform their supervised encoders for detection of covariate shifts, across all scenarios which is an important finding (Fig 3).

[R1] Validity of ‘Duo performs best overall’ claim With this claim, we mean that it yields the highest average detection rate across shifts. We do not claim that Duo outperforms BBSD/MMD in every case. For example, in PadChest, with 1000 test cases: MMD (ImageNet-SSL features) has 65% detection rate of prevalence shift, 100% for gender and acquisition shift, i.e. an average of 88% detection across shifts. BBSD has an average of 100+5+75=60% across shifts, and Duo has an average of ~100%. Hence overall, averaged across all shifts, Duo is best (similar results for EMBED and other amounts of test data). This justifies the validity of our claim. We will clarify our point in the results.

[R1] Typo in definition of prevalence shift Thanks for pointing this out, we meant to write P_ref(X|Y) = P_test(X|Y), we will update this in the paper.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The authors have addressed the concerns raised by all three reviewers, each of whom now leans toward acceptance. It is recommended that the authors further refine the current version in accordance with the reviewers’ suggestions..

back to top

Automatic dataset shift identification to support safe deployment of medical imaging AI

Author(s):