Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Deep learning-based medical image classification techniques are rapidly advancing in medical image analysis, making it crucial to develop accurate and trustworthy models that can be efficiently deployed across diverse clinical scenarios. Concept Bottleneck Models (CBMs), which first predict a set of explainable concepts from images and then perform classification based on these concepts, are increasingly being adopted for explainable medical image classification. However, the inherent explainability of CBMs introduces new challenges when deploying trained models to new environments. Variations in imaging protocols and staining methods may induce concept-level shifts, such as alterations in color distribution and scale. Furthermore, since CBM training requires explicit concept annotations, fine-tuning models solely with image-level labels could compromise concept prediction accuracy and faithfulness - a critical limitation given the high cost of acquiring expert-annotated concept labels in medical domains. To address these challenges, we propose a training-free confusion concept identification strategy. By leveraging minimal new data (e.g., 4 images per class) with only image-level labels, our approach enhances out-of-domain performance without sacrificing source domain accuracy through two key operations: masking misactivated confounding concepts and amplifying under-activated discriminative concepts. The efficacy of our method is validated on both skin and white blood cell images. Our code is available at: \url{https://github.com/riverback/TF-TTI-XMed}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1672_paper.pdf

SharedIt Link: https://rdcu.be/eHxdg

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05185-1_61

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/riverback/TF-TTI-XMed

Link to the Dataset(s)

Fitzpatrick17k dataset: https://github.com/mattgroh/fitzpatrick17k DDI dataset: https://ddi-dataset.github.io/index.html#paper PBC: https://data.mendeley.com/datasets/snkd93bnjr/1 RabbinWBC: https://raabindata.com/free-data/ Scirep: https://www.nature.com/articles/s41598-023-29331-3

BibTex

@InProceedings{HeHan_Trainingfree_MICCAI2025,
        author = { He, Hangzhou AND Tang, Jiachen AND Zhu, Lei AND Li, Kaiwen AND Lu, Yanye},
        title = { { Training-free Test-time Improvement for Explainable Medical Image Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15973},
        month = {September},
        page = {638 -- 648}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes a training-free confusion concept identification strategy to improve the out-of-domain performance of Concept Bottleneck Models (CBMs) in medical image classification. By masking confounding concepts and amplifying under-activated discriminative concepts, the OOD performance of CBMs can be improved.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is well-written and organized. The idea to do test-time adaptation for CBMs is interesting. The method is simple but effective for certain datasets.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The practical significance of the paper is limited. Although test-time adaptation improves OOD performance for CBM, it compromises model interpretability — the concept explanations are still unreliable due to the large domain shifts. And the intervention ability of CBM also degrades. Besides, some results are poor—for example, on RaabinWBC, fine-tuning achieves 75.83 F1-score while the proposed method only reaches 22.93—indicating limited practical value. Fine-tuning may reduce in-domain performance, but it enables practically usable performance in OOD settings (F1-score: 75.83) while maintaining interpretability and intervenability.

The paper only compares with fine-tuning and lacks comparisons with other test-time adaptation methods and approaches based on foundation models. Leveraging the strong generalization ability of foundation models to improve OOD concept prediction may be a more practical and feasible approach [1, 2].

[1] Language in a bottle: Language model guided concept bottlenecks for interpretable image classification, CVPR 2023. [2] Label-free concept bottleneck models, ICLR 2023.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

See section 7.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

After reading the authors’ rebuttal, I still have concerns about the practical value of the proposed method, so I maintain my rejection decision.

Review #2

Please describe the contribution of the paper

The manuscript proposes a training-free strategy for improving the performance of concept bottleneck models. The results show that the proposed strategy help the CBM achieve a compromise of accuracy and faithfulness.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The observation of the authors on the present challenge of CBM is accurate. The propose method addressed the challenge (no need of training).
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The manuscript claimed the improvement of the faithfulness of CBM. However, according to existing findings [1,2,3], training the model jointly (i.e., the training scheme in the paper) could bring information leakage in the concept bottleneck. The information leakage could lead to a high model performance of the concept prediction from CBM - which is a “false hope” since the model learns these concepts as surrogates to the classification label. This means that the model’s faithfulness is not really improved.

[1] Addressing Leakage in Concept Bottleneck Models, NeurIPS 2022 [2] Do concept bottleneck models learn as intended? [3] Promises and pitfalls of black-box concept learning models
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Just some additional comments: for the stated weakness, in the improved version of the submission, I would suggest the authors do an additional experiment to exclude this case. The authors should evaluate the improvement of the model performance, under human intervention. That is, by replacing more and more concept prediction with the ground truth concept values, the model performance should be gradually improved. Please check the original concept bottleneck model paper for this experiment.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The rebuttal well addresses my concern. I would suggest accept this submission, as it is a qualified cnadidate for this year’s MICCAI.

Review #3

Please describe the contribution of the paper

The paper focuses on concept bottleneck models for interpretable image classification. The problem at hand is that at inference time, there might be distribution shift that hurt the performance of the intermediate concept predictions and overall classification performance. The proposed method detects and adapts the model to distribution shifts in the target distribution
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The idea of test-time adaptation for concept bottleneck models is compelling
- The proposed method is intriguing
- Experimental results are promising
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Notation could be clarified and simplified
- Eq. (6) could be grounded in statistical testing
I have a few clarifying questions and I am looking forward to discussing with the authors!
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- Notation Notation could be simplified and made more intuitive. For example, in Eq.(1) it is not stated where the predicted concepts live, are these binary or continuous? In Eqs. (3) and (5), sets are not functions, so the notation “{ }(k)” is confusing. In Eq. (6), the use of the brackets “[ ]” for concatenation is not obvious, also, if sigma is a function, this should be corrected to sigma([ ]). If “[ ]” means concatenation, it is overloaded with entrywise access in Eqs. (4) and (6).
Making sure notation is clear and concise will improve accessibility of the manuscript
- Statistical foundations of Eq. (6) It is my understanding the goal of the algorithm is to detect concepts for which there has been a shift between the source and target domains. The intuition of doing so by comparing the distribution of the class-wise activations is neat and simple. However, Eq. (6) is heuristic, and it could be made more rigorous, for example, with common test statistics for two gaussian populations with unknown mean and variance.
Could the authors expand on the intuition behind including the mean from the source domain to estimate the standard deviation of the target?

Could the authors expand on which kind of shifts might not be detected by Eq. (6)? For example, it is assumed shifts will increase the variance, but this may not always be the case.

Finally, it should be formalized how a drifting concept is grouped in under/over activated. So far, this is only described in passing in the text (Lines 143-145).
- Experiments Is finetuning performed with concept-level annotations or image labels only?
- Minor comments
- the term “confusion concept” is unclear and could be rephrased, maybe “drifting concept”?
- certain terms are hand-wavy, e.g. “particularity”, “renowned”, “sustainable solution”, “emerging self-explainable”, “mixed blessing”
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The submission is interesting and it would be valuable to the community. The text and claims need clarification to improve presentation of the contribution. Better motivation behind Eq. (6) would also improve quality of the submission
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The submission remains borderline after the rebuttal phase. Suggestions from other reviewers to compare more extensively with existing methods would strengthen the evidence in support of the submission. However, the contribution is valuable to the broader medical imaging sub-field. The clarification points included in the rebuttal should be incorporated in the final version of the manuscript.

Author Feedback

We thank all reviewers for their constructive feedback. Below, we address the key concerns.

R1 Q1 & R3 Q1, intervention ability A1: We adopted the joint CBM architecture primarily because the baseline (CBM+K) requires it to incorporate domain knowledge. To rule out concept leakage, we performed both ID and OOD interventions on the skin disease dataset (only it provides concept labels for both domains).

Except for fine-tuning, all methods showed performance improvement with increasing intervention steps on both domains (our strategy will skip handling intervened concepts), and ours even slightly outperformed others at the initial step. However, fine-tuned CBMs showed minimal or negative gains, supporting that our approach maintains interpretability and intervenability, while fine-tuning compromises them.

R2 Q1 & R2 Q4, clarity A2: \hat c denotes the continuous concept prediction probabilities. We will replace {}(k) with the subscript {}_k to denote k-th element of a set. In Eq. 6, the [] indicates concatenation; to avoid confusion with [] in Eq. (4) and (6), we will use concat(\cdot) instead. We will review all notations carefully in the revision.

R2 Q2, statistical basis and intuition of Eq. 6 A3: Our intuition was to identify concepts whose prediction values exhibit significant shifts. Based on the following observations, we chose to compare the variance and leverage ID mean in Eq.6:

If a concept has low variance in the source domain, the concept is likely discriminative for a particular class. Therefore, if the concept’s prediction fluctuates significantly in the test domain, it implies unreliability.

If a concept already exhibits high variance in the source domain, it’s likely not discriminative. Eq.6 inherently downweights such noisy because it’s compared to ID variance. When it’s still activated by Eq.6, it’s usually confusing across all classes, which should be masked.

We appreciate the suggestion of using formal tests like Welch’s t-test. While more rigorous, we retain Eq. (6) for its simplicity and effectiveness in achieving our goals, leaving more advanced statistics for future work.

R2 Q3, experiments A4: fine-tuning is done with image label only

R3 Q1, compared to fine-tuning A5: Our method identifies concepts that become unreliable and negatively affect OOD performance and keep /enhance those reliable concepts, thereby maintaining or even slightly improving ID performance and interpretability (Fig.3). In contrast, fine-tuning degrades ID performance significantly (e.g., from 99.65 to 69.92) despite achieving 75.83 F1 on OOD. Moreover, as noted in R1 Q1, fine-tuning severely harms intervention ability, whereas ours preserves it.

We kindly ask you to reconsider the value of our approach in this context.

R3 Q2, compared to unsupervised CBMs built on foundation models A6: Unsupervised CBMs like LaBo and LF-CBM leverage foundation models mainly to reduce annotation cost. However, since their label predictors are still trained on ID data, there’s limited evidence they generalize better to OOD. Additionally, their large concept banks (e.g., 50 concepts/class in LaBo) often contain noisy activations and reduce interpretability and often fail to support effective intervention. The problem of unfaithful CLIP performance and concepts of language-based CBMs are also discussed in existing works (e.g., Waffling Around for Performance: Visual Classification with Random Words and Broad Concepts, ICCV2023).

In contrast, our method improves OOD performance without retraining or added modules, while retaining the CBM’s original concept structure and user intervenability (as mentioned in A1). Though integrating CLIP at test-time might be promising, it requires further study and is beyond the scope of this work.

We hope the above responses address the main concerns of all reviewers and kindly ask you to consider raising your rating.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
My recommendation is based on:
1. vox populi: a 2-1 vote among three engaged reviewers. The authors are fortunate to have drawn this group.
2. A good author rebuttal.
3. The concerns raised by R3 about OOD usefulness are perhaps not answerable in this one paper, so I somewhat downweighted them.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

Training-free Test-time Improvement for Explainable Medical Image Classification

Author(s):