Abstract

Although explainability is essential in the clinical diagnosis, most deep learning models still function as black boxes without elucidating their decision-making process. In this study, we investigate the explainable model development that can mimic the decision-making process of human experts by fusing the domain knowledge of explicit diagnostic criteria. We introduce a simple yet effective framework, Explicd, towards Explainable language-informed criteria-based diagnosis. Explicd initiates its process by querying domain knowledge from either large language models (LLMs) or human experts to establish diagnostic criteria across various concept axes (e.g., color, shape, texture, or specific patterns of diseases). By leveraging a pretrained vision-language model, Explicd injects these criteria into the embedding space as knowledge anchors, thereby facilitating the learning of corresponding visual concepts within medical images. The final diagnostic outcome is determined based on the similarity scores between the encoded visual concepts and the textual criteria embeddings. Through extensive evaluation on five medical image classification benchmarks, Explicd has demonstrates its inherent explianability and extends to improve classification performance compared to traditional black-box models.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0117_paper.pdf

SharedIt Link: https://rdcu.be/dV53J

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72117-5_5

Supplementary Material: N/A

Link to the Code Repository

https://github.com/yhygao/Explicd

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Gao_Aligning_MICCAI2024,
        author = { Gao, Yunhe and Gu, Difei and Zhou, Mu and Metaxas, Dimitris},
        title = { { Aligning Human Knowledge with Visual Concepts Towards Explainable Medical Image Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {46 -- 56}
}

Reviews

Review #1

Please describe the contribution of the paper

The purpose of this paper is to propose a novel interpretability framework to solve the problem of interpretability in the process of medical image diagnosis by artificial intelligence. The framework integrates foundational visual language model(VLM), large language model(LLM), and human expert knowledge to construct key visual concepts through the LLM and human expert knowledge, and then trains the visual language model to complete the alignment of key concepts with image samples, so as to provide explanations in the process of classification. In this paper, experiments were carried out on medical image data sets of five different diseases and their performance was verified.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The framework, logic and theme of the article are good.
- The method proposed in the paper combines foundation models with human knowledge, which is a hot topic, and the experiment shows that the classification performance of the method is excellent.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
As far as I know, there are many similar works on matching text concepts with images by using large vision language models, such as:
- Oikarinen T, Das S, Nguyen L M, et al. Label-free concept bottleneck models[J]. arXiv preprint arXiv:2304.06129, 2023,
- Yuksekgonul M, Wang M, Zou J. Post-hoc concept bottleneck models[J]. arXiv preprint arXiv:2205.15480, 2022 and Zarlenga M E, Barbiero P, Ciravegna G, et al. Concept embedding models[C]//NeurIPS 2022-36th Conference on Neural Information Processing Systems. 2022. In the current version of the article, LaBo is the only explanatory method for author comparison. The author may consider using the above article to supplement the comparative experiment, which is more convincing.
As for the innovation of the method, in my opinion, the current method only adds a trainable visual concept token module on the basis of the framework of LaBo method, and connects with the visual language model through attension mechanism. This improvement enables the model to interpret at the visual level and obtain the heat map in the final interpreted image. However, in my opinion, there are still many points that can be improved on this basic framework. For example, the author has not modified the text encoder at present. Is it possible to ensure that the existing text encoder can fully recognize and understand special nouns in medical scenarios? In addition, in the process of text concept collection, LaBo introduces a text concept screening module to improve text quality, so how to detect and improve text quality for medical scenes is also an innovation point that can be added in the method of this paper.

The author mentions that the method made progress in both classification and interpretability, but in the part of interpretability experiment, the author only gives one explanatory example. To illustrate this point, I hope the author can provide more explanatory examples or give a quantitative evaluation of interpretability, and compare it with other similar interpretability methods.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

none
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

See main weaknesses.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Reject — could be rejected, dependent on rebuttal (3)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The major factors are technical novelty and experimental integrity. I am willing to upgrade my score depending on the rebuttal.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

This paper proposes an explainable AI model that enables models to mimic human decision-making by utilizing diagnostic criteria from either large language models (LLMs) or human experts. The paper also introduces a visual concept learning module and criteria anchor contrastive loss to align visual features with diagnostic criteria.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The authors assess their proposed method across five publicly available medical datasets, including various medical targets and modalities such as dermoscopic images, ultrasound images of breast masses, and chest X-ray images. In all cases, their method surpassed existing approaches, showcasing its ability to generalize. Additionally, the authors provide a comprehensive discussion on existing works, highlighting their challenges, and explain how their proposed method achieves the superior performance.
- The diagnostic criteria formulation can be gathered through LLMs, effectively bypassing the need for human attribute annotation.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Although the paper claims that domain knowledge can be queried from LLMs or human experts, the paper only experimented with knowledge from LLMs. The results presented in Table 1 are obtained using GPT4 to acquire domain knowledge, but what if human knowledge were utilized instead? Which approach would yield better results? Is GPT4 comparable (or even better than) humans? Since the paper claims that domain knowledge can come from humans as well, it should have demonstrated this aspect as part of the experiments.
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

What if the method uses GPT4-V to obtain interpretable diagnostic criteria directly from images? This would be a very interesting experiment.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Accept — should be accepted, independent of rebuttal (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The authors introduce a method for incorporating expert knowledge into model predictions for medical image classification tasks, eliminating the requirement for human annotations of image-based concepts or attributes. They validate their approach across five distinct benchmarks, demonstrating its broad applicability.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

Authors propose Explicd, a new framework towards explainable language-informed criteria-based diagnosis. Explicd essentially is a natural extension of concept bottleneck models proposed by Koh, et al. 2020. Compared to other concept bottleneck studies, Explicid introduces extra (learnable) visual tokens to transform the visual representations yielded by the original CLIP visual encoder into different visual concepts. Authors validate the method on five medical datasets, outperforming baseline LaBo (Yang, et al. 2022) and end-to-end ViT models.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Methodology: The idea of introducing additional visual tokens and further finetuning these tokens and visual encoder is interesting and novel.
- Effectiveness: Explicd’s main strength is that it outperforms end-to-end models such as ViT-B and ResNet-50 on all five tasks.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
- The collection of diagnostic criteria is not clear: I believe the performance of the proposed Explicd depends highly on the expressiveness and completeness of diagnostic criteria. However, the authors offer little insight into how they collect and prompt LLMs to generate useful diagnostic criteria. They only briefly report their prompting strategy in the case of skin lesions, i.e., prompting LLM from six different aspects: asymmetry, border, color, diameter, texture, and pattern. It’s not clear to me: How many criteria did you generate and use in your experiments? Are they all useful? How can you make sure those criteria were disentangled as you mentioned in section 3.1?
- Please provide more information on your diagnostic criteria in tasks like histopathology, fundus images, ultrasound, and chest X-rays.
- Heatmap visualization in Fig 2(b) needs a colorbar to present similarity scores.
- No statistical evaluation of results: paired tests would give statistical weight to the argument of “superiority” of the proposed method.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

In comparison to previous concept bottleneck models, the authors propose to further divide visual features (V(x)) into several subspaces {p_i}. It seems this learnable prompting technique improves the concept bottleneck models greatly. However, the motivation for this technical modification is not clearly conveyed in this paper. It would be helpful and important in future work to demonstrate why training learnable tokens {p_i} using contrastive learning will lead to better performance. I think this is quite important question.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Accept — should be accepted, independent of rebuttal (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

techncial innovation and overall performance
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Author Feedback

N/A

Meta-Review

Meta-review not available, early accepted paper.

back to top

Aligning Human Knowledge with Visual Concepts Towards Explainable Medical Image Classification

Author(s):