Abstract

Pre-trained large vision-language models (VLMs) like CLIP have revolutionized visual representation learning using natural language as supervisions, and have demonstrated promising generalization ability. In this work, we propose ViP, a novel visual symptom-guided prompt learning framework for medical image analysis, which facilitates general knowledge transfer from CLIP. ViP consists of two key components: a visual symptom generator (VSG) and a dual-prompt network. Specifically, VSG aims to extract explicable visual symptoms from pre-trained large language models, while the dual-prompt network uses these visual symptoms to guide the training on two learnable prompt modules, i.e., context prompt and merge prompt, to better adapt our framework to medical image analysis via large VLMs. Extensive experimental results demonstrate that ViP can achieve competitive performance compared to the state-of-the-art methods on two challenging datasets. We provide the source code in the supplementary material.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1358_paper.pdf

SharedIt Link: https://rdcu.be/dV53K

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72117-5_6

Supplementary Material: N/A

Link to the Code Repository

https://github.com/xiaofang007/ViP

Link to the Dataset(s)

https://derm.cs.sfu.ca/Welcome.html https://data.mendeley.com/datasets/rscbjbr9sj/2

BibTex

@InProceedings{Fan_Aligning_MICCAI2024,
        author = { Fang, Xiao and Lin, Yi and Zhang, Dong and Cheng, Kwang-Ting and Chen, Hao},
        title = { { Aligning Medical Images with General Knowledge from Large Language Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {57 -- 67}
}


Reviews

Review #1

  • Please describe the contribution of the paper
    1. This work explores the zero shot learning task in medical image analysis, and reveal the significant impact of LLMs on prompt engineering.
    2. The work proposes ViP that leverages LLMs to generate visual symptoms in a scalable manner and employs two learnable prompt modules to facilitate knowledge transfer from CLIP to the medical domain.
    3. The work conducts extensive experiments on two datasets and the result demonstrates the strong generalization ability of ViP to medical image analysis.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This work proposes a novel visual symptom-guided prompt learning pipeline, ViP, which effectively transfers knowledge from VLMs to medical image analysis. The idea of leveraging pre-trained LLMs is effective and promising. Experimental results demonstrate the superior performance of the proposed method to the SOTA methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The proposed idea of transferring the knowledge of LLM to medical imaging category is novel and promising. It will be more helpful to address the limitation of this approach, for example, natural language is imprecise and ambiguous, lacking the quantative measurements, will these affect the performance of the proposed framework ?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The work can be reproduced by a graduate student with experience in LLM and DL.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The language has ambiguities, whether this affects the accuracy of the proposed method and how to improve the precision?
    2. It will be more helpful to show some more challenging cases in the experiments and add some analysis.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed idea of transferring the knowledge of LLM to medical imaging category is novel and promising, the experimental results demonstrate the effectiveness of the proposed framework. It will be more convincing if the authors address the imprecision and ambiguity of natural language.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper proposes ViP (Visual Symptom-Guided Prompt learning) a method to facilitate knowledge transfer from CLIP to the medical domain. The method exploits pre-trained LLMs in two ways: first, a LLM such as GPT-4 is used to generate a list of textual description of visual features of lesions. Then, a text encoe is trained to generate a prompt from this list of visual features. The textual features from the prompt are aligned with those of the image encoder, on which the model attend to produce the final diagnosis. Performances were evaluated on chest x-ray and melanoma classification in terms of performance and explainability.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper introduces an interesting and easy-to-reproduce method. Pretrained foundational models are gaining increasing traction and it is important to understand whether general-purpose models can contain useful medical information, in parallel to designing foundational models tailored to the medical domain.
    • Extracting visual symptoms, and computing the classification with respect to them, can yield greater explainability than standard classification network.
    • The manuscript is generally well written, although a few points in the methodology should be further clarified.
    • The method was carefully evaluated. Both supervised baseline and comparable SOTA techniques were tested, with mean and standard deviation
    • Ablation studies are provided to test, among others, the impact of medically relevant and irrelevant knowledge, showing that irrelevant (non-medical) knowledge is not detrimental to performance
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • It is not clear exactly how the model produces the final diagnosis, and how the diagnosis is attribute to symptoms (fig. 3 shows the similarity between visual symptoms and images, but not how the visual symptoms contribute to the final diagnosis)
    • Few details are given on the visual symptom generator and its practical implementation
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Code is provided as part of the supplementary material.

    Implementation details of the visual symptom generator, such as type of LLM used, version, etc. are lacking.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Experiments were conducted on ChestXray and melanoma, both of which are relatively common in medical datasets available on the web. It is probable that both GPT-4 and CLIP were exposed to this type of medical data. For instance, StableDiffusion demonstrated the capability to reconstruct ChestXray images (https://arxiv.org/pdf/2210.04133.pdf), but we could not replicate these results with other less prevalent modalities. While I lack data to substantiate this assertion, CLIP may suffer from the same issue, which could warrant investigation in the future.
    2. In Fig. 1, it remains unclear how the model generates melanoma classifications, given that melanoma itself is not a visual symptom. The figure appears to suggest that the visual symptom with the highest score is chosen for classification, but this interpretation contradicts the text which mention that information from multiple visual symptoms is aggregated using an attention mechanism.
    3. Eq. 2 is not necessary, as the self-attention equation is welll known from the literature
    4. To what degree do visual symptoms depend on the textual prompts provided to the Language Model (LLM)? Are the results reproducible across different runs/GPT versions?
    5. Figs. 1 and 3 are very small and are not very readable when printed.
    6. In Table 1, the performance gap between VIP and supervised performance is much higher for Derm7pt than the Pneumonia dataset. In fact, the higher F1 score for pneunomia is achieved by supervised training (check bold values in Table 1). The authors attribute this to the lower amount of training data, but it may be due to the domain, which is closer to RGB photographic images. An experiment in which the pneumonia dataset is downsampled could further clarify this aspect.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    A well-crafted paper with interesting results on a timely topic.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper presents ViP, a symptom-guided VLM that transfers knowledge from a pre-trained CLIP model to the medical image domain by a learnable context prompt module, and a learnable merge prompt module. The model also contains a visual symptom generator, that generates comprehensive text descriptions of visual symptoms for each disease category, and their alignment with the visual features of the images is calculated and used for classification. Using two open medical image datasets of lung disease and skin cancer, the authors show state-of-the art performance in disease classification.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • The proposed method shows above SOTA performance across two different datasets.

    • The CoP and MeP modules appear novel in this context. There is no need to train the LLM from scratch (oftentimes not feasible), only the smaller CoP and MeP, that helps generalize CLIP to the medical domain.

    • The proposed method leverages powerful pre-trained VLMs.

    • Comprehensive and clear ablation studies.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • There are some experimental details missing in the methods section. E.g. what were their values of k, d and M? Are their presented values ensemble averages? Etc.

    • Lacking comments on the usefulness of the proposed methods for different kinds of medical images/diseases with more or less simply summarized visual symptoms.

    • Legibility of figures is poor.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The datasets are open and the authors share their code which improves possible reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    • The authors start the method section by defining a label set C and have mentioned the datasets they use, but have not specified what the task is yet, i.e. that there are multiple disease categories to classify. Please clarify.

    • Only define abbreviations once on first use. There are several cases where something is defined several times, e.g. VSG, CoP, MeP.

    • There are explanations missing in the methods. Explain n, k, M. Furthermore, specify the values of these.

    • It is unclear if the train/test splits are the same as the SOTA methods the authors compare to (table 1). Furthermore, do other methods also use an ensemble of vision backbones? Are the comparisons fair? Please clarify.

    • Please clarify how the 9:1 train/val splits are used for the pneumonia dataset. Ensemble? Use best?

    • Are the mean and standard deviation in table 1 across the three vision backbones? Please clarify.

    • It would be useful to see some examples of the learnable tokens p of CoP. They seem very abstract currently.

    • Please clarify in the caption of Fig 3 II (or in the figure itself) that the examples are correctly classified by the proposed method, and incorrectly classified by CLIP.

    • The font size in figures 1, 3, 4 are unacceptably small, making legibility difficult. Please adjust.

    • Please comment on the usefulness of the proposed methods for images that may not be as easily summarized into color, shape, and texture as the datasets used. For example, the accuracy and F1 gains are much larger for derm than pneumonia, which seems expected given that derm images are more resembling of natural images due to e.g. being in color.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is clear and comprehensive, with relevant experiments and ablation studies to reach their conclusions. There seem to be no major experiments lacking. The topic of using combined vision language models in medicine is very interesting and on the rise in the community, and the presented results are impressive. The authors seem to have created a useful model that is feasible to train on a single GPU.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We appreciate all the reviewers for their valuable feedback. We have carefully addressed the issues and suggestions point by point, which will be elaborated on in the camera-ready version. Below we discuss some of the questions the reviewer raised. Abbreviations: Reviewer-R, Weakness-W, Constructive comments-C.

Visual Symptom Generator (VSG) (R1W1, R2W2). As written in the ‘Visual Symptom Generator’ subsection, we explain that VSG consists of a text-only prompt to query the general feature of a disease and an image-text prompt to refine the visual symptoms based on the available data. This helps improve the accuracy of generated visual symptoms. For example, the generated visual symptom “parts of the spot look glossy and reflective” in Fig. 1 is obviously not a visual feature of Fig. 3 (II)(d). By image-text prompt refinement, we discard this symptom from the description set generated by text-only prompt, resulting in more accurate descriptions.

Model diagnostic process (R2W1, R2C2). In Fig. 1, we first use an example of melanoma to demonstrate VSG is designed to convert disease labels to descriptive visual symptoms. Next, the generated visual symptoms are combined with context prompts (CoP), which will be used as input to the text encoder. The encoded text features are then aggregated using MeP. The aggregated feature in Fig. 1 corresponds to one of the s^c, and we present the visual symptom encoding process of melanoma for clarity. The final prediction is calculated based on the image feature and aggregated features of all classes (s^c_1, s^c_2…). We acknowledge that Fig. 1 might introduce confusion, and we will revise it accordingly.

Examples of learnable tokens (R3C7). CoP is designed to automatically learn the medical task context. We acknowledge that the learnable context tokens may appear abstract. There is several papers that explores the explainable prompt [1, 2], but it is beyond the scope of this work.

Importance of color features on CLIP learning (R2C6, R3C10). We appreciate the idea that the impressive performance boost observed in the Derm7pt dataset might be attributed to its similarity to natural image color pattern. However, the impact of different features on performance can vary across different tasks [3,4]. In future work, we will explore more datasets of different modalities to gain a deeper understanding of these factors.

[1] Visual-language prompt tuning with knowledge-guided context optimization. CVPR, 2023. [2] MICA: Towards Explainable Skin Lesion Diagnosis via Multi-Level Image-Concept Alignment. AAAI, 2024. [3] Artificial intelligence structural imaging techniques in visual pattern analysis and medical data understanding. Pattern Recognition, 2003. [4] Local binary patterns variants as texture descriptors for medical image analysis. Artificial Intelligence in Medicine, 2010.




Meta-Review

Meta-review not available, early accepted paper.



back to top