Abstract

In medical image analysis, the expertise scarcity and the high cost of data annotation limits the development of large artificial intelligence models. This paper investigates the potential of transfer learning with pre-trained vision-language models (VLMs) in this domain. Currently, VLMs still struggle to transfer to the underrepresented diseases with minimal presence and new diseases entirely absent from the pre-training dataset. We argue that effective adaptation of VLMs hinges on the nuanced representation learning of disease concepts. By capitalizing on the joint visual-linguistic capabilities of VLMs, we introduce disease-informed contextual prompting in a novel disease prototype learning framework. This approach enables VLMs to grasp the concepts of new disease effectively and efficiently, even with limited data. Extensive experiments across multiple image modalities showcase notable enhancements in performance compared to existing techniques.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0383_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/RPIDIAL/Disease-informed-VLM-Adaptation

Link to the Dataset(s)

https://warwick.ac.uk/fac/sci/dcs/research/tia/data/pannuke/ https://paperswithcode.com/dataset/covidx

BibTex

@InProceedings{Zha_Diseaseinformed_MICCAI2024,
        author = { Zhang, Jiajin and Wang, Ge and Kalra, Mannudeep K. and Yan, Pingkun},
        title = { { Disease-informed Adaptation of Vision-Language Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15011},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents an innovative approach to leverage pre-trained vision-language models (VLMs) for medical image analysis by introducing disease-informed contextual prompting within a disease prototype learning framework. This methodology aims to improve the adaptability of VLMs to underrepresented and novel diseases, addressing the challenge of expertise scarcity and expensive data annotation in the medical domain. The experiments demonstrate significant performance improvements across multiple image modalities.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The idea of enriching vision models with disease-specific contextual information through prompting is very interesting. This technique has shown promise in other domains, and its application in medical imaging could significantly enhance diagnostic accuracy and model adaptability.
    • The evaluation is thorough, utilizing multiple public datasets and including ablation studies. This robust testing framework not only confirms the effectiveness of the proposed methods but also adds credibility to the results.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The introduction omits discussion of some relevant state-of-the-art works, such as [1], which could provide a richer context for understanding the contributions and limitations of the proposed approach.
    • The paper does not discuss the computational complexity or the hardware requirements for the training and deployment of the proposed models. This information is crucial for assessing the feasibility of the approach in a clinical setting.
    • There is no discussion on the future directions and the limitations of the proposed method. Including such a discussion could help in understanding the scalability of the approach and its potential impact.

    [1] Li, Chunyuan, et al. “Llava-med: Training a large language-and-vision assistant for biomedicine in one day.” Advances in Neural Information Processing Systems 36 (2023).

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • It would be beneficial to discuss and differentiate the proposed method from relevant existing works like Li et al. [1]. Such a comparison can highlight the unique aspects and advantages of your approach.
    • The selection of specific adapters like CLIP over others like GLIP should be justified. Discussing why certain adapters were chosen and others were not, especially those that are considered state-of-the-art, would provide clarity and enhance the technical depth of the paper.
    • The rationale behind the choice of attributes for the prompt template in the DiCoP framework should be clarified. It would be insightful to know whether these attributes were vetted or suggested by medical experts or derived from other empirical evidence.
    • Expanding on the process for generating prompts and fine-tuning the VLMs would be valuable. How does this process scale to larger or more complex datasets? Addressing this would greatly benefit readers and potential users of your methodology. Minor:
    • Attention to detail in terms of language and presentation is crucial. I recommend a thorough proofreading to correct typographical and grammatical errors, ensuring the paper meets the high standards of publication.
    • Please ensure that all references are current and accurately cited, e.g. [16] was actually published at ICLR 2023. This improves the reliability and traceability of the work.
    • I want to point the authors to a recent work published CVPR [2], which has its similarities to the presented work, but was only available after the MICCAI submission deadline.
    • Finally, I want to thank the authors to actually providing statistical evidence for the “significant” performance improvement of the proposed work.

    [2] Huang, et al. „Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images“, CVPR (2024)

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper makes interesting contributions to the field of medical image analysis through the innovative use of vision-language models adapted by disease-informed prompts. The strengths of the paper, particularly the application of disease information in prompts and comprehensive evaluation, are compelling. However, improvements are needed in discussing state-of-the-art comparisons, computational details, and future directions. These enhancements would provide a clearer understanding of the method’s novelty, practical implications, and potential areas for further research. Overall, the paper is a promising candidate for acceptance, pending revisions that address the aforementioned points.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors propose a representation learning framework for vision-language models (VLMs) on new and rare diseases under a low-data regime for medical imaging classification tasks. It consists of a Disease-informed Contextual Prompting (DiCoP) component to bridge the concepts of new and rare diseases with established clinical knowledge, and a Disease Prototype Learning (DPL) to fine-tune the image encoder by imposing prototypes of diseases (e.g., geometric structure, location, shape, textual, etc). The experiments are conducted on two public medical imaging classification datasets, and the results of proposed method outperform current adapter-based baselines and prompting-based state-of-the-arts. Further ablation studies support the necessity of both proposed components.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Superior performance, especially in low-data regime.
    • Comprehensive experiments, ablation studies and hyper-parameter sensitivity analysis.
    • Interpretability with visualization of the learned latent space.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Concerns of hand-crafted prompts fixed for all samples.
    • Unclear use of clinical prompt in existing prompt-based methods.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The paper is a good read. I appreciate Figure 3 that demonstrates the trade-off between the size of the labeled dataset and the performance. As the expertise and high-quality annotated data are limited in the hospitals, it shed lights to the doctors about how many data points do they need for the fine-tuned model to perform reasonably well. The proposed method performs well especially when the availability of data is limited.

    The experiments are comprehensive. There are lots of components and hyper-parameters involved in the proposed method. The thorough ablation studies support the necessity of all proposed components, and the sensitivity analysis demonstrates the stability regardless of the choice of hyper-parameters to some extent.

    The paper adds to the interpretability by t-SNE visualization of the latent space to explain the superior performance of the proposed method owing to a better representation learning.

    One of the design choice is to use manually designed prompts generated by GPT-4 and verified by a clinical professional throughout the fine-tuning process for all samples of the same category. May the authors discuss about how would this approach adapt to outliers? For example, there is no one-size-fit-all texture, shape and location for each rare and new disease.

    Also, it was clear to me that when the authors compare with the prompt-based methods, do they also use the same clinical knowledge represented by the same hand-crafted prompts used in the proposed method? I believe the right prompts play an crucial role in VLMs.

    Minor Issues

    1. The disease-informed prompts in and above Equation 4 are not illustrated in Figure 1.
    2. In Section 3, PLIP is mentioned for the first time without any relevant context or explanation of what it is, so as BioViL.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is novel with compelling evidences, although some more clarification would be better. I’m leaning toward acceptance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposed a disease-informed contextual prompting in a novel disease protoype learning framework in pretrained vision-language model.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The development of novel architecture in pretraining vision-language model. The disease-informed contextual prompting is a way to incororating clinical knowledge and image features into VLM to improve its performance in medical domain 2) The disease prototype learning can also improve VLM to learn the image representations compared to original CLIP-based VLM which lacks geometric understandings. 3) In results, it shows that this new architecture can outpform CoOp, CLIP-Adater in PanNuke dataset and COVID- finding dataset.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Lack of diversity in the testing dataset. It might be more meaningful to show evaluation on other radiological imaging modality.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    None

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Overall, this paper introduces a noverl framework for transfer learning with VLM in medical domain. Current weakness is lack of diverse evaluation.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The strength in model architecture leads to a weak accept.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

N/A




Meta-Review

Meta-review not available, early accepted paper.



back to top