Abstract

Recent advancements in Medical Vision-Language Models (VLMs) have significantly improved medical cross-modal task performance through large-scale contrastive pre-training. However, deploying these large models in clinical settings is hindered by their computational complexity and vulnerability to adversarial attacks. While knowledge distillation offers a solution by transferring knowledge to efficient student models, traditional methods usually ignore the robustness problem, leaving models susceptible to adversarial attacks. To address these challenges, we propose a novel Dynamic Gradient and Hierarchical Feature Alignment framework (DGHFA) for robust knowledge distillation. Our approach introduces a dynamic gradient calibration mechanism for balanced knowledge transfer and a hierarchical adversarial feature alignment framework to enhance robustness under adversarial attacks. Extensive experiments on two medical VLMs and downstream pathology and X-Ray datasets demonstrate that our method outperforms state-of-the-art approaches across multiple attack scenarios, achieving improvements of 2.3 and 1.7 percentage points in robust accuracy, respectively.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2324_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

CRC100K dataset: https://zenodo.org/records/1214456 RSNA dataset: https://www.kaggle.com/competitions/rsna-pneumonia-detection-challenge/data

BibTex

@InProceedings{XiaBoy_DGHFA_MICCAI2025,
        author = { Xiao, Boyi and Wu, Jianghao and Zhong, Lanfeng and Zou, Xiaoguang and Wu, Yuanquan and Wang, Guotai and Zhang, Shaoting},
        title = { { DGHFA: Dynamic Gradient and Hierarchical Feature Alignment for Robust Distillation of Medical VLMs } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {141 -- 151}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This manuscript proposes a robust knowledge distillation framework tailored for medical foundation models. It introduces a dynamic gradient calibration mechanism and hierarchical adversarial feature alignment for balanced robust knowledge transfer.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors present dynamic weighted gradient training strategy for robust feature alignment. Then, this manuscript present the hierarchical perturbation feature alignment for adversarial training.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The proposed Hierarchical Perturbation Feature Alignment (HPFA) is not clear enough.
    2. There are limitted compared methods. The compared methods are from CVPR 2023 and 2021. Recent methods should be compared.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The manuscript should be carefully revised before publication.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposes a method for robustly distilling knowledge from the original VLM image encoder (teacher network) to a smaller image encoder (student network), while retaining robustness against adversarial attacks. The method introduces a distillation loss that dynamically weights the training samples and a hierarchical feature alignment loss that trains to align the features of perturbed and clean samples. The results demonstrate improved performance compared to previous methods under four adversarial attacks.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is clearly written with a well-organized methods section. It provides a clear justification for introducing dynamic weighting and hierarchical feature alignment for robust distillation. The ablation study supports the benefits of the introduced components with evidence. The results are promising, achieving better performance than previous methods under different adversarial attacks.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) The paper quickly jumps to the problem without strongly justifying how and what type of adversarial vulnerabilities VLMs are exposed to, especially in clinical settings.

    2) Another concern is that the paper builds up the introduction by mentioning the requirement for efficient models and thus knowledge distillation from a larger teacher model to a smaller, more efficient model. However, the paper only focuses on the knowledge distillation of the image encoder. In a VLM, the text encoder has more parameters, yet the paper only distills the image encoder and does not provide a solid justification for not distilling the text encoder as well. Additionally, the method mentions distilling ViT-B16 and ViT Tiny to ResNet18. It would also make more sense to use a smaller student network like MobileNet.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    1) The paper mentions that previous methods require costly annotations to obtain perceptually aligned gradients, but it’s unclear why this is stated, as those methods also don’t collect annotations for perceptual priors.

    2) The paper mentions using ViT Tiny from CXR-CLIP. Does the author mean SwinTiny instead?

    3) The authors could improve the explanation of Equation 3, particularly how a higher risk results in assigning higher weights with τ, and why this is done with the teacher model. How does this translate to the robustness of the student model?

    4) Please report the values of γ and β used in the experimental section.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper reads well, and the proposed method seems to work effectively in the different adversarial attack settings, outperforming previous results. One concern is regarding the practicality of this approach in a real-world setting. If only the image encoder is distilled, without considering the text encoder, and even if it’s only to ResNet18, how much does it justify the claim of making the VLM parameters efficient?

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have addressed my concerns, and the paper can be further improved in the camera-ready version with additional clarity.



Review #3

  • Please describe the contribution of the paper

    The paper presents a Dynamic Gradient and Hierarchical Feature Alignment (DGHFA) framework that enhances the robustness and efficiency of knowledge distillation in medical Vision-Language Models by integrating dynamic gradient calibration, dual weighting strategies, and hierarchical adversarial feature alignment, demonstrating significant improvements in robust accuracy across two datasets.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The content of the paper is fluent and easy to understand.
    • The motivations and scenarios of this paper are good, and the transfer of visual language models trained in natural scenarios to professional medical scenarios is a hot challenge.
    • In the experimental part, the performance of the proposed method is compared with SOTA method on many metrics, which demonstrates the effectiveness of the designs.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • While the paper claims novelty in using teacher gradients for perceptual alignment, prior work (e.g., Perceptually Aligned Gradients [1]) has already established gradient-human perception relationships. The proposed dynamic gradient calibration appears to repurpose the teacher’s intrinsic gradients without fundamentally new mechanisms, differing mainly in avoiding annotations rather than introducing a novel alignment strategy.
    • The experiments use only pathological and X-ray datasets and lack validation of other medical imaging modalities like MRI. And thus, the generalizability of the method in medical data is weak.
    • In the introduction part, it seems that the author does not introduce the defects of all the comparison methods in the experimental part.
    • In the part of experiment, there is no specific value for γ in eq.(3).
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The motivation of the work is clear. The design of the method provides an effective solution to the problem. The experimental results are better than the SOTA methods.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We are glad that the Reviewers appreciate the motivation (R3), novelty (R1&R2), writing(R2&R3), experiments (R1&R2&R3). Our responses are as below:

Description of HPFA (R1) Different from traditional adversarial training methods that only use a single degree of perturbation to obtain adversarial samples, we used multiple degrees of perturbations (sampled at different points on the adversarial generation path), leading to hierarchical perturbation feature alignment. A training-time-varying weighting scheme is used to gradually increase weights of stronger perturbations.

Compared methods (R1) Our current evaluation includes widely recognized adversarial distillation baselines (TRADES, RSLAD, AdaAD). While recent methods like PeerAiD (CVPR 2024, DOI: 10.1109/CVPR52733.2024.02311) and SmaraAD (CVPR 2024, doi: 10.1109/CVPR52733.2024.02323) offer valuable insights, they present limitations for medical VLM distillation: PeerAiD requires modifying teacher model parameters, which is not directly align with the specific challenges of medical foundation model distillation; SmaraAD achieves 85.5% robust accuracy on CRC100k, still underperforming our method (86.3%), we will report the results on Table 1.

How and what type of adversarial vulnerabilities for VLMs (R2) We have revised the corresponding sentence in the Introduction to clarify that VLMs are vulnerable to adversarial perturbations that disrupt cross-modal alignment, particularly in clinical contexts where semantic precision is critical. As recent studies show, even small image perturbations can induce large semantic shifts in VLM outputs, posing safety risks in medical applications.

Distillation of ImageEncoder and student model (R2) We do not distill text encoder, because medical VLMs use fixed categories (10-100 classes), enabling pre-computation of text embeddings, avoiding recomputing for each image for testing For the student model, we used ResNet18 for fair comparison, and our method is also applicable to other lightweight models like MobileNet.

Difference from Previous Perceptual Alignment Methods (R2&R3) Our method differs from previous works like Perceptually Aligned Gradients (PAG) in both purpose and design. First, existing methods require auxiliary models (GANs/diffusion) pretrained on annotated data to extract perceptual gradients leading to large annotation cost. Our approach is free of manual annotation , which is of more practical value. Second, PAG aligns a model’s gradients to human perception, while we are the first to adapt gradient alignment to robust knowledge distillation from large model (particularly for VLMs) to a lightweight model. Thirdly, unlike existing works used a static weight, we introduce class-aware weighting and sample-specific adaptation for the gradient alignment.

Hyperparameter(R2&R3) γ in Eq. (3) is set to 0.1 and β in Eq. (6) is 2. These were fixed across all experiments.

Clarification on ViT Tiny and Eq. 3 (R2) The ViTtiny is indeed SwinTiny. In Eq. 3, Categories with higher adversarial risk (measured by teacher’s error on perturbed samples) automatically receive largerτ_c, making the student focus on the most vulnerable classes. Thus, the teacher provides stable risk estimation when the student is not well trained. By aligning student’s training focus with teacher’s risk assessment, we improve the student’s robustness, especially on the volnerable classes.

Dataset(R3) Although MRI/CT experiments are not included due to space constraints, we emphasize that the modality-agnostic design is a central aspect of our method, and we discuss this limitation transparently in the paper.

Introduce the defects of all the comparison methods (R3) We agree with the concern and will revise the sentence to explicitly highlight the limitations of TRADES, RSLAD, and AdaAD in terms of gradient-level alignment and dynamic robustness in VLM settings.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All the reviewers show positive tendency towards accept of this paper. However, the authors are encouraged to polish the manuscript for clearer explanations on the motivations behind the proposed method, the detailed implementations for modules in the methodolog, the definitions of the involved hyper-parameters, and also introduce more comparable SOTA methods and ablations on the text-side distillation (if possible).



back to top