Abstract

Utilizing potent representations of the large vision-language models (VLMs) to accomplish various downstream tasks has attracted increasing attention. Within this research field, soft prompt learning has become a representative approach for efficiently adapting VLMs such as CLIP, to tasks like image classification. However, most existing prompt learning methods learn text tokens that are unexplainable, which cannot satisfy the stringent interpretability requirements of Explainable Artificial Intelligence (XAI) in high-stakes scenarios like healthcare. To address this issue, we propose a novel explainable prompt learning framework that leverages medical knowledge by aligning the semantics between images, learnable prompts, and clinical concept-driven prompts at multiple granularities. Moreover, our framework addresses the lack of valuable concept annotations by eliciting knowledge from large language models and offers both visual and textual explanations for the prompts. Extensive experiments and explainability analyses conducted on various datasets, with and without concept labels, demonstrate that our method simultaneously achieves superior diagnostic performance, flexibility, and interpretability, shedding light on the effectiveness of foundation models in facilitating XAI.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0077_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0077_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Bie_XCoOp_MICCAI2024,
        author = { Bie, Yequan and Luo, Luyang and Chen, Zhixuan and Chen, Hao},
        title = { { XCoOp: Explainable Prompt Learning for Computer-Aided Diagnosis via Concept-guided Context Optimization } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors proposed an explainable classification system based on prompt learning and visual language models and tested it on different datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Making use of clinical reports along with images to train the AI model.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Use of the trivial datasets. Reliability of GPT4 in healthcare reporting.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The manuscript is interesting; however, I have the following comments to improve it.

    1. The difference between clinical prompt vs. soft prompt is not clear.
    2. More details about the datasets should be provided including the number of samples in each category.
    3. What is the performance of ViT B/16 alone for classification without prompt learning?
    4. It is not clear how the explainable maps are generated.
    5. When the medical images are distinguishable from the naked eye, a simple ML model can do the task. Especially, the skin datasets. Why is your method required? It should be tested on more complex datasets.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Simple datasets to test the proposed methodology.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors propose an intriguing framework for soft prompt learning aimed at predicting the target class from a medical image. Although hard prompt learning and soft prompt learning each have distinct advantages and disadvantages, the authors seek to bridge this gap using clinically generated prompts from a commercial LLM (ChatGPT-4). In their experiments, the authors not only present classification performance but also demonstrate explainable results using public datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The most compelling aspect of this work is that the authors utilize a clinical prompt generated by a commercial LLM, which does not require any domain knowledge. This approach helps bridge the gap between hard and soft prompt learning, which are inherently complementary. Hard prompt learning necessitates manual craftsmanship and domain expertise, often incurring significant costs, whereas soft prompt learning employs randomly initialized tensor embeddings as input but lacks explainability. This work leverages the clinical prompt generated by the LLM as a hard prompt to enhance the explainability of the soft prompt.
    • The authors employ four public datasets to demonstrate the validity of their proposed method, thereby enhancing its reproducibility. This study aims to perform classifications based on medical images. Therefore, they utilize four datasets that can be categorized into two groups: those with concept labels and those without. Additionally, the datasets encompass diverse modalities, including skin photographs and X-rays.
    • They present extensive experimental results, covering not only the model’s classification performance but also an ablation study and explainability. The authors compare the classification performance with similar previous works, including three CoOp series and LASP, reporting AUC and accuracy metrics. Additionally, they conduct an ablation study to validate the necessity of each component of the method. Moreover, they analyze explainability from both the perspective of prompts with domain knowledge and through visual explanations.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • This paper does not detail the techniques of the proposed method, making it challenging for readers unfamiliar with soft prompt learning to understand the approach. Specifically, it is unclear what the learnable parameters in the method are. Readers are expected to be familiar with the concept that soft prompt learning involves training learnable embedding vectors, which are introduced as “These methods fix the parameters of the models and train the learnable tokens that serve as the input to the text encoder” — a description that does not directly pertain to soft prompt learning. Furthermore, readers must infer that the freezing icon and flame icon in Figure 1 represent frozen and learnable parameters, respectively.
    • The paper inconsistently uses certain terms and fails to provide definitions for some abbreviations. For example, what does CCP stand for in Table 2? It likely refers to clinical concept-driven prompts, but this is not explicitly stated.
    • The structure of the Ablation Study section needs reconsideration. The authors simply compare performance across different visual backbones (ViT family and ResNet family) in the final two sentences. Labeling this comparison as an ablation study and asserting the proposed method’s robustness is problematic, especially when the results in Figure 2 demonstrate a significant performance gap between ViTs and ResNets. The analysis seems somewhat farfetched.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    If the authors could provide more detailed information, specifically on soft prompt learning as a prerequisite, and clarify the trainable parameters of the proposed method, it would greatly enhance the paper’s clarity and reproducibility. I recommend that the implementation code be published.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the manuscript could be better written, the authors propose an interesting soft prompt learning framework and present extensive experimental results using four public datasets.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors well address the main concerns.



Review #3

  • Please describe the contribution of the paper

    In this paper, the authors propose to enhance soft prompts by utilizing medical concepts present within a dataset or obtained by prompting a large language model (LLM). They propose to perform alignment at token and prompt levels between “context-enhanced” clinical prompts and trainable soft prompts. Additionally, they perform local (patch-level) and global (image-level) alignment of images with trainable soft prompts. They conduct experiments on various downstream tasks and show the effectiveness of their proposed approach.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • The problem of aligning pre-trained models to clinically relevant concepts is an important area of research. • The authors conducted detailed experiments on various datasets, demonstrating the efficacy of the proposed approach while also conducting an explainability analysis. • To the best of my knowledge, the proposed approach to alignment seems novel.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • The authors use CLIP instead of domain-specific models such as BiomedicalCLIP and CheXzero. This could pose problems if the tokenizer is unaware of medical-related concepts. • The usage of concepts from an LLM, although beneficial, could introduce unintended bias or misinformation. • Some details about the explainability analysis are missing.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Would like to see the code made publicly available

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    • Do the authors have any thoughts about how their results might change if a domain-specific VLM is used instead of CLIP? • Can the authors clarify the visual explanation presented in Figure 4a? Specifically, how they were generated and what they signify for each subplot. • For the faithfulness explanation, how were the interventional prompts generated? How many samples were used for the AUC plot presented in Fig 3b? • Is it necessary to do local image prompt alignment? This is missing from the ablation study. It would be beneficial to see CCP + IPA with and without local alignment.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think the work makes an important contribution to the MICCAI community by aligning pre-trained models to medical concepts. However, I am not totally convinced that the way concepts are obtained is ideal. Also, I am limiting myself to a lower score to get additional explanations for some of the points I asked in the comments above.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I thank the authors for answering my questions during the rebuttal phase. After carefully reviewing the responses and other reviews, I would like to maintain my score. I am still unclear how effective the global + local image alignment is and the clinical prompt based on GPT4 may introduce unintended bias. Verifying these prompts with clinicians would be beneficial.




Author Feedback

We thank the reviewers for their thoughtful comments. We appreciate that the reviewers found our paper well-organized (R1,3,4), method interesting/novel (R1,3,4), experiments extensive/convincing (R3,4), and can make an important contribution to the community (R4).

  1. Prompt types & trainable parameters [R1,3]. Clinical prompts are hand-crafted/hard, fixed, combining class name and concepts, e.g., “a photo of melanoma, with irregular dots”. Soft prompts are trainable tokens that are input to the text encoder. Parameters of CLIP are fixed and won’t be trained. We will clarify the prerequisite of prompt learning and the icons in Fig.1 in the final version.
  2. Data details [R1]. Dataset details of category and setting are in Table A2 of the supplementary file. We will include the number of samples (e.g., Pneumonia dataset has 1583 normal and 4273 pneumonia images) in the final version.
  3. Baseline [R1]. Row 1 of Table 1 shows the CLIP baseline using ViT-B/16 w/o prompt learning. We also experimented and found our method can be comparable to the fully fine-tuned ViT alone and can improve explainablity with very few trainable parameters.
  4. Visualization [R1,4] in row 1 of Fig.4a is generated using the similarity between the local visual features of images and learned prompts of the class, where warm/cold colors indicate high/low similarity.
  5. Validation of LLM [R1,4]. We have carefully checked the correctness of the LLM output, and the generated concepts showed high consistency with medical report descriptions and annotations.
  6. Used Datasets [R1]. To diagnose skin lesions, trained dermatologists will first examine specific features that may indicate skin cancer (e.g., 7-point checklist). Dermoscopy and biopsy will then be used, with the sample examined by a pathologist to confirm cancer diagnosis like melanoma. Thus skin lesion diagnosis is not an easy task that can be distinguished by naked eyes. Besides, our goal is to improve not only performance but also explainability of the model, which is essential for AI-aided clinical diagnosis. The dermoscopic datasets we used are representative datasets with concept labels, contributing to XAI in healthcare-recent works based on skin datasets, such as Kim et al. “Transparent medical image…” (Nature Medicine 2024), also highlight the importance of the area. Moreover, our method is evaluated on 4 datasets including skin and CXR images, comprehensively demonstrating its effectiveness.
  7. Terms & Abbrev. [R3]. We will make the terms consistent. CCP denotes clinical concept-driven prompts, as in Table 2 caption.
  8. Ablation [R3]. The quick convergence (Fig.2) with different backbones shows the high efficiency of our method. The performance gap between ViTs and ResNets is exhibited in most prompt learning methods with CLIP (e.g., Table 2 of CoOp paper also shows a ~10% performance gap). We will modify the claim of robustness in the final version.
  9. Reproducibility [R3]. We will clarify the details and release our code.
  10. Domain-specific model [R4]. Our work aims to improve both diagnostic performance and explainability. Though CLIP’s tokenizer may not fully handle clinical concepts, our method is effective based on CLIP for medical image diagnosis, showing that it can bridge the gap between natural and medical domains. This is supported by improved performance and explainability, under fair comparisons with SOTAs that were also adapted from CLIP. We believe that medical VLMs could also improve performance, but this will not change our core conclusion, and we regard it as a promising future direction.
  11. Intervention [R4]. As the clinical prompts are category-wise, we can easily conduct counterfactual intervention by modifying the attributes (Fig.3). The test set (395 images) of Derm7pt was used for the AUC plot.
  12. Local IPA [R4] is necessary: The AUC of CCP+IPA (with local IPA) is higher than without, and better image-prompt similarity visualization can be obtained with local IPA.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors have tried to answer all the concerns from R3 and R4. The reviews from R1 in my opinion are poor. Based on reviews and rebuttal I recommend an accept.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The authors have tried to answer all the concerns from R3 and R4. The reviews from R1 in my opinion are poor. Based on reviews and rebuttal I recommend an accept.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Both R3 and R4 maintained their original positive scores after reading the rebuttal. R1 gave a negative initial score and did not respond to the rebuttal. Nevertheless, the rebuttal has appropriately answered R1’s main concern on the used datasets.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Both R3 and R4 maintained their original positive scores after reading the rebuttal. R1 gave a negative initial score and did not respond to the rebuttal. Nevertheless, the rebuttal has appropriately answered R1’s main concern on the used datasets.



back to top