Abstract

In the field of medical decision-making, precise anomaly detection in medical imaging plays a pivotal role in aiding clinicians. However, previous work is reliant on large-scale datasets for training anomaly detection models, which increases the development cost. This paper first focuses on the task of medical image anomaly detection in the few-shot setting, which is critically significant for the medical field where data collection and annotation are both very expensive. We propose an innovative approach, MediCLIP, which adapts the CLIP model to few-shot medical image anomaly detection through self-supervised fine-tuning. Although CLIP, as a vision-language model, demonstrates outstanding zero-/few-shot performance on various downstream tasks, it still falls short in the anomaly detection of medical images. To address this, we design a series of medical image anomaly synthesis tasks to simulate common disease patterns in medical imaging, transferring the powerful generalization capabilities of CLIP to the task of medical image anomaly detection. When only few-shot normal medical images are provided, MediCLIP achieves state-of-the-art performance in anomaly detection and location compared to other methods. Extensive experiments on three distinct medical anomaly detection tasks have demonstrated the superiority of our approach. The code is available at https://github.com/cnulab/MediCLIP.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0333_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0333_supp.pdf

Link to the Code Repository

https://github.com/cnulab/MediCLIP

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Zha_MediCLIP_MICCAI2024,
        author = { Zhang, Ximiao and Xu, Min and Qiu, Dehui and Yan, Ruixin and Lang, Ning and Zhou, Xiuzhuang},
        title = { { MediCLIP: Adapting CLIP for Few-shot Medical Image Anomaly Detection } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15011},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes three types of image anomaly synthesis tasks to adapt CLIP for few-shot medical image anomaly detection. The proposed method is evaluated on three public benchmark datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The proposed anomaly synthesis tasks are straightforward yet with clinically relevant explanations.
    • The proposed method demonstrates superior performance to compared ones in the few-shot settings, despite its simplicity.
    • The paper is written clearly.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • My primary concern is the motivation of this work. The initial idea of medical image anomaly detection originated from the fact that in clinical practice, the majority of data is normal. Thus, designing methodologies that can utilize the vast amount of normal data for anomaly detection is meaningful. So, the reasons why the authors chose to limit their use of the normal data to a very tiny portion, need a strong verification.

    • Following the previous comment, I would like to see the upper-bound performance of the strongest non-few-shot anomaly detection methods in Tables 1 and 2, using more normal data for training, for reference. If the few-shot performance of the proposed method could match the upper bound, it would validate the motivation of this work.

    • Missing training details: How many epochs/iterations are the model trained? How to ensure convergence?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The submission does not mention open access to source code; the method description is generally clear except for the last comment on the main weaknesses.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Provide upper-bound performance in Tables 1 and 2.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is generally clear. The proposed method seems simple but effective. The experiments are comprehensive.

    What needs to be addressed in rebuttal: provide upper-bound performance and justify the motivation.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper focus on medical image anomaly detection in the few-shot setting. Authors only use few-shot normal images without any available anomaly images and pixel-level labels. More detailed, this paper use learnable prompt for medical tasks and adapt CLIP from classification to detection and localization by new- designed “adapter”. Extensive experiments on three distinct medical anomaly detection tasks demonstrated the superiority of their approach.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. high novelty: Authors only use few-shot normal images without any available anomaly images and pixel-level labels to simulate common disease patterns and transferring CLIP to the task of medical image anomaly detection.
    2. This paper is well-written and well-organized.
    3. Three experiments have been conducted and they provide sufficient support for hypothesis.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. lack some computation details of adaptor and and network structure details.
    2. lack comparison with [1] and [2]. [1] Huang, C., Jiang, A., Feng, J., Zhang, Y., Wang, X., & Wang, Y. (2024). Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images. arXiv preprint arXiv:2403.12570. [2] Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text. In EMNLP, 2022.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. “This paper first focuses on the task of medical image anomaly detection in the few-shot setting”, are you sure? see [1]
    2. “it still falls short in the anomaly detection of medical images.” what specific short?
    3. if you can compare with the recent works [1] and [2], your work will be more distinguished.

    [1] Huang, C., Jiang, A., Feng, J., Zhang, Y., Wang, X., & Wang, Y. (2024). Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images. arXiv preprint arXiv:2403.12570. [2] Zifeng Wang, Zhenbang Wu, Dinesh Agarwal, and Jimeng Sun. Medclip: Contrastive learning from unpaired medical images and text. In EMNLP, 2022.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper discusses a significant issue in the medical field, which is the difficulty in obtaining a dataset that contains real images with anomalies and corresponding pixel-level labels. To overcome this challenge, the authors have proposed a new method that utilizes only a few-shot normal images, without any available anomaly images or pixel-level labels. The approach simulates common disease patterns and transfers CLIP to the task of detecting anomalies in medical images. Although it has some minor drawbacks, the 8-page conference paper is comprehensive and innovative. Therefore, I believe we can accept it.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    In this paper, the authors use CLIP for Anomaly detection in medical images. Rather than using multiple thousands of images for anomaly detection as in most works, few images ranging from 4 to 32 are used. This work only requires normal images as inputs as opposed to other anomaly detection works that requires normal and abnormal images and introduces an approach of synthesizing abnormal images. Rather than using the standard prompts used for CLIP, the paper uses learnable prompts to learn the textual descriptions of the prompts for the images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Generating multiple variations of abnormal images from normal images: Rather than limiting the abnormal image generation, three approaches of generating abnormal images are introduced which can help improve the model performance through generating diverse image features.
    • Using learnable prompts: With the field of prompt engineering still evolving, the use of learnable prompts helps avoid a lot of the issues faced when using manual prompts especially for data that is generated synthetically.
    • Using adapters to combine the output of the vision encoder with the text encoder: Since the feature length and representation of the text and vision representations can vary, using an adapter to ensure the sizes match is an interesting approach to handle features of varied length or properties.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Use of abnormal images: Even though the paper synthesizes abnormal images, it would be interesting to see the scenarios where real abnormal images are available and how those would compare with only using the synthesized images or how both can be combined. This is to avoid throwing away any data that might be available during training.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Since the code for this paper relies on CLIP, and uses large datasets like CheXPert, an idea of the cost incurred for running this code would be a great addition. This provides an idea of what it takes to run the program for any other person interested in deploying the program especially on the cloud.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors have focused on an area that has a significant impact in the medical space, which is in the area of limited data especially for anomaly detection. To further cement the impact of this work, it would be nice to see some other areas of medical image analysis, where this approach helps provide improved performance.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper adapts CLIP to work for the medical domain. CLIP and other related Deep Learning approaches usually work on natural images, and usually have challenges in being deployed for medical images. The authors have not only deployed CLIP for medical images, but also solved a lot of the challenges faced like limited medical data, and the issue of describing medical images, which is different from natural images.

    The paper also interestingly provides approaches of synthesizing medical images in a way that provides improved performance. Synthesizing medical images for improved performance does not usually provide as much improved results as in natural images.

    With these few reasons, I believe this paper deserves to be accepted.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

Dear Area Chair and Reviewers,

Thank you for your valuable feedback. We have carefully considered your comments and addressed the key points below:

Comment 1: Lack of Discussion with [1,2] Response: We will include a discussion about [1,2] in the camera-ready revision. It is worth noting that the motivations behind MediCLIP and [1] are different. MediCLIP utilizes normal images and a set of carefully designed anomalous synthetic tasks to optimize CLIP for medical anomaly detection tasks. In contrast, [1] uses real anomaly images and corresponding masks.

Comment 2: Motivation and Upper Bound Performance Analysis Response: In practice, medical images of specific parts, such as the oral cavity, neck, lateral pelvis, and bending positions of the spine, are often difficult to obtain. Additionally, privacy concerns often restrict the collection of sufficient normal samples needed to train full-shot models. MediCLIP is suitable for few-shot scenarios with a lightweight design that is easy to train and deploy, making it ideal for a wide range of real-world applications. When the number of available normal images is 32, the anomaly detection performance of MediCLIP almost reaches the upper limit. This is attributed to MediCLIP’s very few learnable parameters, enabling it to generalize well even with a very limited number of available samples.

Comment 3: More Application Scenarios Response: In future work, we will explore scenarios where a small set of real abnormal samples is available to improve MediCLIP, enabling it to adapt to various real-world situations.

References: [1] Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images. [2] Medclip: Contrastive learning from unpaired medical images and text.




Meta-Review

Meta-review not available, early accepted paper.



back to top