Abstract

Medical vision-language models (Med-VLMs) trained on large datasets of medical image-text pairs and later fine-tuned for specific tasks have emerged as a mainstream paradigm in medical image analysis. However, recent studies have highlighted the susceptibility of these Med-VLMs to adversarial attacks, raising concerns about their safety and robustness. Randomized smoothing is a well-known technique for turning any classifier into a model that is certifiably robust to adversarial perturbations. However, this approach requires retraining the Med-VLM-based classifier so that it classifies well under Gaussian noise, which is often infeasible in practice. In this paper, we propose a novel framework called PromptSmooth to achieve efficient certified robustness of Med-VLMs by leveraging the concept of prompt learning. Given any pre-trained Med-VLM, PromptSmooth adapts it to handle Gaussian noise by learning textual prompts in a zero-shot or few-shot manner, achieving a delicate balance between accuracy and robustness, while minimizing the computational overhead. Moreover, PromptSmooth requires only a single model to handle multiple noise levels, which substantially reduces the computational cost compared to traditional methods that rely on training a separate model for each noise level. Comprehensive experiments based on three Med-VLMs and across six downstream datasets of various imaging modalities demonstrate the efficacy of PromptSmooth.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3451_paper.pdf

SharedIt Link: https://rdcu.be/dY6k3

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72390-2_65

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3451_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Hus_PromptSmooth_MICCAI2024,
        author = { Hussein, Noor and Shamshad, Fahad and Naseer, Muzammal and Nandakumar, Karthik},
        title = { { PromptSmooth: Certifying Robustness of Medical Vision-Language Models via Prompt Learning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {698 -- 708}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The text introduces a training method based on prompt learning to enhance the robustness of Medical Vision-Language Models in the presence of Gaussian noise. The article is well-organized, making it easily understandable and accessible to a wide audience.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The article is well-structured, written in a clear and easily understandable manner, and the experimental section achieves state-of-the-art results on two publicly available datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Firstly, I believe that the motivation of this article is unclear. The issue of robustness extends beyond just Gaussian noise. Additionally, the motivation behind using prompt learning to address this issue lacks a theoretical explanation, making it seem somewhat unreasonable. Secondly, I find the division of prompts into zero-shot and few-shot in the methods of this article to be unreasonable. It is challenging to determine in advance whether a given visual-language model has encountered a specific category. Thirdly, the experimental section seems to lack evidence of generalizability. Visual-language models are utilized across multiple tasks, yet the experiments were only validated on a single task.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1) Motivation: Firstly, the article lacks a clear definition of robustness and fails to provide intuitive examples illustrating the primary issues targeted by the text. Furthermore, the motivation behind using prompt-based learning to address robustness is unclear. Why is prompt-based learning necessary?

    2) Method: In the method section, the article mentions, “Our goal is to efficiently adapt Med-VLMs in data-limited scenarios to predict well under Gaussian noise.” However, there are various types of noise, and addressing Gaussian noise through prompt-based learning instead of image-based methods seems somewhat unreasonable. The author needs to clarify this accurately. Additionally, the settings of zero-shot and few-shot are unclear to the user; that is, users are unaware if the model has seen the category. It is essential to precisely use the two types of prompts mentioned in the article.

    3) Experiments: The experimental validation dataset is limited in quantity, and its representativeness is not clearly stated. Although the article mentions that “Medical Vision-Language Models (Med-VLMs) have significantly advanced the state-of-the-art across a broad spectrum of medical imaging tasks such as classification, segmentation, and detection,” these tasks are not verified in the experiments, failing to demonstrate the generalizability of the proposed method.

    4) Writing Details: In the abstract, “Medical Vision-Language Models” is written in lowercase, while in the introduction, it is capitalized. This inconsistency should be rectified for uniformity. In the introduction, the phrase “these defenses have consistently shown vulnerabilities to newer and more powerful adversarial attacks” lacks clarity regarding what these defenses refer to and lacks concrete examples illustrating the attacks and vulnerabilities. Since robustness is a concept without a clear definition, it would be beneficial to provide a clear, concrete example to illustrate the specific issues addressed in the article. The excessive use of various colors, italics, and bold text for annotations in the text hinders readability.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please refer to the “Weaknesses” and “Comments” sections above.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper introduces a technique to certify the robustness of Medical Vision Language Models (VLMs) against adversarial attacks. It employs a prompt learning approach, leveraging zero-shot and few-shot training with Gaussian noise perturbations. Through experiments using three VLMs and six medical datasets, the authors demonstrate that their method effectively secures robustness across various levels of perturbation noise.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed method for adversarial robustness, which utilizes prompt tuning and Gaussian perturbation, is innovative in the medical VLM domain. Since the encoder is frozen, while only training the prompts, the training is faster and requires minimal computational resources, making it highly practical. This method has been tested on three VLMs, and the results indicate that this strategy consistently achieves high certified accuracy even at elevated noise levels across all three VLMs. The ablation study effectively illustrates the impact of various hyperparameters such as context token size, training steps, and learning shots on certified accuracy. This method is evaluated in six datasets covering pathology datasets and x-ray datasets. The paper is well-structured, clearly written, and demonstrates a robust experimental setup.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper is well-written and makes a strong contribution; however, there are some concerns listed in comment no. 10, especially regarding the description of adversarial attack algorithms and the implementation details of the baseline methods.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I) The paper has not specified the types of adversarial attack algorithms used in the evaluations. The robustness against adversarial attacks is subjective to the types of attacks [1,2], and it remains unclear which attack algorithms the method is robust against. II) The implementation details for the baseline methods, Denoised Smoothing and Diffusion Smoothing, are missing. It is unclear how these methods were reported as the authors have mentioned that these methods are not tailored to VLM. If so, there is a concern regarding the fair comparison of these methods against PromptSmooth, which is specifically designed for VLM. Given that VLM benefits from a rich context of large training pairs while pretraining, comparing the adversarial robustness without using the same underlying backbone may be inappropriate.

    [1] Lecuyer, M., Atlidakis, V., Geambasu, R., Hsu, D., Jana, S.: Certified robustness to adversarial examples with differential privacy. In: 2019 IEEE symposium on security and privacy (SP). pp. 656–672. IEEE (2019)

    [2] Carlini, Nicholas, and David Wagner. “Towards evaluating the robustness of neural networks.” 2017 IEEE symposium on security and privacy (sp). IEEE, 2017.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is an interesting paper that introduces a novel technique to certify robustness against adversarial perturbations for medical VLM. Apart from the concerns noted in comment 10, this paper is solid, featuring impressive results and a thorough evaluation. I will increase the rating if the concerns are addressed.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors have justified my concern on type of adversarial attacks.



Review #3

  • Please describe the contribution of the paper

    In this manuscript, a novel method termed PromptSmooth was proposed to certify the robustness of medical VLMs. The method is founded on randomized smoothing and aims to improve prediction accuracy on Gaussian perturbed images by learning text prompts. Two ways of prompt learning were introduced: one for the few-shot scenario and the other for zero-shot scenario.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • The idea of learning prompts to certify robustness is simple to implement and computationally efficient. • The proposed method shows a considerable improvement in certifying robustness over existing methods. • The paper is well-written and very easy to follow.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • To improve the accuracy of the classifier under noise, fine-tuning parameters or training an additional denoiser have been proposed for CNN or ViT-based models. For VLMs, fine-tuning is often realised by learning prompts. Therefore, the proposed PromptSmooth is a natural and trivial extension of the existing approaches to VLMs.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    • In Figure 1, it would be clearer if the fire symbol is added to the “few-shot prompts” block as the prompts are learnable in this case.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is simple to implement, computationally efficient, and shows a considerable improvement in certifying robustness over existing methods.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The proposed technique, namely certifying robustness via learning prompts, is novel, simple to be implemented, and computationally efficient. The effectiveness of the method is demonstrated in the experiments, and the manuscript is easy to follow.




Author Feedback

We thank reviewers (R1, R3, R4) for positive feedback: well-written and easy to follow paper (R1, R3, R4), computationally efficient approach (R1,R4), SOTA results (R1,R3,R4), and extensive experiments in a highly practical setup (R4). We will make our code publicly available.

[Global Comment] Certified Robustness as a Defense Method Against Adversarial Attacks (R3, R4) The work of Cohen et al. [4] theoretically proves that “any classifier that performs well under Gaussian noise can be transformed into a new classifier that is certifiably robust to adversarial perturbations under the L2 norm”. This guarantee holds regardless of the adversarial attack type, provided the perturbation is within the certified radius. While various approaches exist to improve a classifier’s robustness to Gaussian noise, such as retraining with noise augmentation or prepending a denoiser, our work introduces a novel prompt-learning-based method to achieve this goal for large medical VLMs-based classifiers, providing robustness against any norm-based adversarial attack.

R1 On Trivial extension: While fine-tuning VLMs is often achieved by learning prompts, our application of prompting to achieve certified robustness for MedVLMs via randomized smoothing is novel and shows significant gains over existing baselines. On Fig. 1: We will add the fire symbol to the few-shot block in Fig. 1 for clarity.

R3 1-On Definition of Robustness and Why Gaussian Noise?: Following [4], we define robustness as resilience to norm-constrained adversarial attacks. Although there are various types of noise, adapting the classifier to Gaussian noise is an intermediate step in achieving certified robustness against any L2-norm adversarial attack, based on the randomized smoothing theory by [4] (see global comment on top). 2-Motivation of Prompt-learning and Zero-shot/Few-shot setting: We use prompt learning because it allows efficient adaptation of large MedVLMs against distribution shifts like Gaussian noise [31]. By leveraging prompt learning, we can effectively adapt MedVLMs to Gaussian noise, thereby enhancing their certified robustness. In the context of VLMs, zero-shot generalization refers to the setting where a pretrained VLM is turned into a classifier without supervised learning [18], while few-shot learning involves fine-tuning with a few labeled samples. Importantly, zero/few-shot learning does not depend on how the VLM is pretrained; the VLM may have encountered samples from a specific category during pre-training, albeit without labels [18,31]. In this work, users can utilize Zero-Shot PromptSmooth if they have only a single test sample. If a few labeled samples are available, Few-Shot PromptSmooth adapts the classifier to Gaussian noise using these samples. To further improve test-time robustness, Zero-Shot PromptSmooth can be used in conjunction with Few-Shot PromptSmooth, as discussed in the Method section. 3-Focus on Classification and Experiments: We highlighted the widespread adoption of MedVLMs across various medical imaging tasks in the first line of the introduction. However, as clarified in the Method section, our focus is on classification tasks and we will clarify it further. For classification, we conduct extensive experiments against baselines on 6 datasets from 2 medical domains using 3 MedVLMs. 4-Typo and On Robustness: We will fix the typo. Regarding robustness, see R3 (1).

R4 1-Types of Adversarial Attacks: Unlike empirical defenses, a certifiably robust adversarial defense is agnostic to norm-based adversarial attacks (see Global Comment). 2-On Fair Comparison. The baselines Denoised Smoothing and Diffusion Smoothing are model-agnostic and can be prepended to any classifier, including VLMs. In contrast, PromptSmooth achieves certified robustness through learnable prompt tokens, an inherent characteristic of VLMs. We ensured a fair comparison by applying the baseline methods to the same MedVLM backbone as PromptSmooth in our experiments.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I recommed rejection of this paper because I think the studied topic is not of major interest in medical image analysis in real-world clinical applications. The method may have some novel and technically appealing aspects. Some reviwers have challenged the flexibility of this method in practice given that it is limited to Gaussian noise. The authors reseponse by referring to the work of Cohen et al. But that does not convince me because that work shows that robustness to Gaussian noise implies robustness to other perturbations under L2 norm. Why should adversarial attacks be constrained under L2 norm?

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    I recommed rejection of this paper because I think the studied topic is not of major interest in medical image analysis in real-world clinical applications. The method may have some novel and technically appealing aspects. Some reviwers have challenged the flexibility of this method in practice given that it is limited to Gaussian noise. The authors reseponse by referring to the work of Cohen et al. But that does not convince me because that work shows that robustness to Gaussian noise implies robustness to other perturbations under L2 norm. Why should adversarial attacks be constrained under L2 norm?



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper introduces PromptSmooth, a new technique designed to enhance the robustness of medical Visual Language Models (VLMs). This method employs randomized smoothing to enhance prediction accuracy on images perturbed with Gaussian noise through the use of learned text prompts. The study outlines two distinct approaches to learning prompts: one tailored for few-shot learning scenarios and another for zero-shot contexts. The proposed technique of certifying robustness via learning prompts stands out for its novelty, simplicity in implementation, and computational efficiency. The effectiveness of this method is well demonstrated through experiments, and the manuscript is structured in a manner that is easy to follow.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper introduces PromptSmooth, a new technique designed to enhance the robustness of medical Visual Language Models (VLMs). This method employs randomized smoothing to enhance prediction accuracy on images perturbed with Gaussian noise through the use of learned text prompts. The study outlines two distinct approaches to learning prompts: one tailored for few-shot learning scenarios and another for zero-shot contexts. The proposed technique of certifying robustness via learning prompts stands out for its novelty, simplicity in implementation, and computational efficiency. The effectiveness of this method is well demonstrated through experiments, and the manuscript is structured in a manner that is easy to follow.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I would recommand acceptance for this work as it proposed a novel approach for efficiently adapting a zero-shot classifier based on a medical Vision-Language Model (Med-VLM) for adversarial robustness certification through prompt learning. The topic has novelty and will raise interesting discussion in the community.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    I would recommand acceptance for this work as it proposed a novel approach for efficiently adapting a zero-shot classifier based on a medical Vision-Language Model (Med-VLM) for adversarial robustness certification through prompt learning. The topic has novelty and will raise interesting discussion in the community.



back to top