Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Large Vision-Language Models (VLMs) capture rich multimodal knowledge through pretraining and demonstrate remarkable performance across various tasks. However, adapting these foundation models to medical image analysis through fine-tuning faces significant challenges, including constrained computing resources, data privacy concerns, and data heterogeneity. Federated Parameter-Efficient Fine-Tuning (PEFT) emerges as a promising solution, enabling multiple clinical institutions to collaboratively fine-tune VLMs with a small number of parameters. However, it still suffers from data heterogeneity across clients and high training memory requirements. In this work, we propose a personalized Federated Side-Tuning (pFedST) method. Specifically, we equip each client with a frozen pre-trained CLIP model and a lightweight, learnable, personalized side network for fine-tuning. Only a portion of the side network parameters participates in model aggregation, while the personalized LoRA modules within the side network address data heterogeneity with minimal additional parameters. Extensive experiments demonstrate that pFedST consistently outperforms 12 state-of-the-art methods across two real multi-center medical image classification tasks.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0548_paper.pdf

SharedIt Link: https://rdcu.be/eHxcZ

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05185-1_44

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

FedISIC dataset: https://github.com/owkin/FLamby FedDRG dataset: https://github.com/chehx/DGDR

BibTex

@InProceedings{CheJia_Personalized_MICCAI2025,
        author = { Chen, Jiayi AND Ma, Benteng AND Pan, Yongsheng AND Pu, Bin AND Cui, Hengfei AND Xia, Yong},
        title = { { Personalized Federated Side-Tuning for Medical Image Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15973},
        month = {September},
        page = {452 -- 462}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes a personalized Federated Side-Tuning method (pFedST), utilizing a frozen CLIP backbone network and a lightweight side network with personalized LoRA modules. This method aims to address data heterogeneity and high memory usage in federated environments by combining a globally shared model with client-specific adaptive techniques.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Introducing personalized LoRA modules in the side-tuning structure is innovative within the context of federated vision-language models.
2. Achieving gradient isolation from the main CLIP backbone network through side-tuning helps reduce memory usage.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The Personalized Side Network is primarily implemented through multi-head self-attention and weighted aggregation, leading me to suspect that this is merely an effect of module stacking, without theoretical discussion of its purpose.
2. Is the so-called personalization merely local parameter isolation? This seems inconsistent with personalized federated learning methods, and I need a reasonable explanation.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper proposes pFedST for personalized federated fine-tuning in medical image classification, but the method lacks overall novelty and has limitations in addressing the challenges. The paper’s logic is not strong enough, and the language expression needs further improvement.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper
1. The paper proposes a novel personalized federated PEFT model to address data heterogeneity and reduce training memory requirements.
2. The paper compares the proposed method against a wide range of SOTA approaches in terms of memory efficiency and performance on two real-world multi-center medical image classification datasets, achieving new SOTA results with lower memory usage.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is well-written and generally easy to follow.

The proposed problem—fine-tuning vision-language models (VLMs) in federated settings for medical image classification—is novel. To the best of my knowledge, the architectural design, which introduces LoRA layers in the down-projection and output layers of transformers to address data heterogeneity across clients, is novel. The training loss formulated in the paper is sound and well-suited to the personalized federated learning setting. The effectiveness of the proposed approach is thoroughly validated through the experiments and ablation studies. The method achieves new state-of-the-art performance on benchmark datasets.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The paper is overall well-written; however, some parts of the presentation could still be improved for clarity. For example, in Fig. 1(c), it is unclear whether the down-projection in the middle layers requires backpropagation. Additionally, in both Fig. 1 and Fig. 2, the pMHSA block contains both personalized and global components, yet it is colored as the “Personalized Side-Tuning”. This might cause confusion for readers. Improving the clarity and consistency of Figures 1 and 2—especially in alignment with the equations in Section 2.1—would help readers better understand the proposed architecture. Furthermore, since the paper lacks a related work or preliminary section, a brief overview or summary of the overall federated learning algorithm would be beneficial to help readers quickly grasp the overall context and motivation.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I recommend acceptance of this paper due to its novelty, sound methodological contributions, and strong empirical validation. The paper addresses an important and timely problem—fine-tuning vision-language models in federated settings for medical image classification—which, to the best of my knowledge, has not been thoroughly explored in the literature. The architectural design, including the use of LoRA layers to mitigate data heterogeneity, is original and well-motivated. Furthermore, the proposed training objective is appropriate for the personalized federated learning context. The authors provide comprehensive experimental results and ablation studies across real-world multi-center datasets, demonstrating clear improvements over existing state-of-the-art methods. Overall, the paper makes a meaningful contribution to both federated learning and medical imaging communities.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This paper proposed a federated parameter-efficient fine-tuning method pFedST for medical image analysis. Instead of using prompt-based and Adapter-based methods, pFedST incorporated several multi-head self-attention modules as a side-network to enhance the VLM model’s performance on medical image classification tasks.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper proposed a novel federated learning method to leverage the pre-trained VLM models. It proposed to use multi-head self-attention (MHSA) as a learnable module for domain knowledge learning. Additionally, it proposed to personalize partial parameters in the MHSA to improve the personalized performance. The design is interesting and efficient.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Regarding the rationale behind the proposed contrastive loss (L_cont): It is unclear why the global image features and personalized image features are treated as negative pairs. These features should capture complementary aspects of the data rather than being pushed apart. Intuitively, personalized image features should also be drawn closer to the corresponding text features, as they provide additional, individualized context that can enhance alignment with the ground truth. It would be nice if the paper provided additional theoretical analysis.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

From the architecture design and experiment results, I justify my recommendation.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We sincerely thank the reviewers and ACs for their time, effort and recognition of our work. Here are responses to their insightful suggestions and remaining concerns.

R2Q1: Clarification on Fig. 1 & 2 (1) We will update Fig. 1(c) to clarify that the down-projection in the middle layers requires backpropagation. (2) We will revise the legend in Fig. 1 and add one to Fig. 2 to avoid confusion. Specifically, yellow blocks are labeled as “global side-tuning” in Fig. 1, denoting global components across clients, while orange blocks indicate personalized ones. The combination of global and personalized parts forms the personalized side block.

R2Q2: Overview of FL methods We will include a brief overview of FL in the introduction, as follows: Federated learning enables clinical institutions to collaboratively train a global model without exchanging patient data. It typically involves three stages: local model training on private data, global model aggregation, and distribution of the updated global model to all clients.

R3Q1: Rationale behind contrastive loss (1) To improve disease diagnosis, both personalized and global image features should be aligned with text features of ground truth. In our approach, the alignment between personalized image features and their corresponding text features is enforced via the personalized cross-entropy loss defined in Eq. (7). Additionally, global image features are aligned with text features through contrastive learning, where they are treated as positive pairs and pulled closer in the feature space. (2) To enable personalized LoRA to better capture client-specific knowledge, we increase the dissimilarity between personalized and global image features by treating them as negative pairs and pushing them apart.

R4Q2: Explanation of personalization Personalized federated learning methods address data heterogeneity by customizing the global model for each local client. Conventional approaches typically partition the network into global and personalized modules [R1]. The global modules, such as feature extractors [R1], are aggregated across clients, while personalized components, including batch normalization layers [R2], high-frequency components [R3], and classification heads [R4], remain client-specific. In our approach, the global MHSA modules are aggregated to preserve common knowledge among them, while the personalized LoRA modules remain local to handle data heterogeneity. This design aligns with the principles of personalized federated learning and enables each client to benefit from both shared knowledge and tailored adjustments. [R1] Personalizing Federated Medical Image Segmentation via Local Calibration. ECCV 2022. [R2] FedBN: Federated Learning on Non-IID Features via Local Batch Normalization. ICLR 2021. [R3] Personalized Retrogress-Resilient Framework for Real-World Medical Federated Learning. MICCAI 2021. [R4] Exploiting shared representations for personalized federated learning. ICML 2021.

R4Q1: Motivation of pFedST Although foundation models have demonstrated remarkable performance across diverse tasks, fully fine-tuning them across clinical institutions remains impractical due to data privacy constraints, limited computational resources, and high communication costs. To address these challenges, we propose a personalized side network with the following advantages: (1) The side network reduces the number of trainable parameters to 2.69% and decreases GPU memory usage to 53.69% compared to full fine-tuning, which enables feasible and efficient fine-tuning across clinical institutions. (2) The personalized side network comprises global multi-head self-attention blocks to capture shared knowledge across clients, and lightweight personalized LoRA modules to encode client-specific features. (3) During model aggregation, only the global MHSA modules are aggregated to preserve common knowledge, while the personalized LoRA modules remain private to facilitate effective customization.

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

Personalized Federated Side-Tuning for Medical Image Classification

Author(s):