Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Adapting pretrained Vision Language Models like CLIP, for medical image analysis in federated learning (FL) offers cross-modal insights while preserving privacy. However, effective cross-domain federated adaptation requires intensive fine-tuning and knowledge sharing, challenging in low-resource medical practice due to the divergence between pretrained natural image and medical imagery. Moreover, the significant statistical heterogeneity (non-IID) of medical data exacerbates these challenges. To address these issues, this paper introduces a framework that tames CLIP for non-IID federated medical image classification. This develops client-specific personalized models by reinforcement and constrain local cross-modal alignment, enabling the models to integrate client-specific and globally common knowledge. This approach not only addresses non-IID challenges but also optimizes the trade-off between performance and efficiency. Extensive experiments on real-world medical image datasets confirm the effectiveness and superiority of our FedTCA.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0005_paper.pdf

SharedIt Link: https://rdcu.be/eHwUQ

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04971-1_50

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{CheShe_Restyled_MICCAI2025,
        author = { Chen, Shengchao AND Shu, Ting},
        title = { { Restyled, Tuning, and Alignment: Taming VLMs for Federated Non-IID Medical Image Analysis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {530 -- 540}
}

Reviews

Review #1

Please describe the contribution of the paper
1. This paper aims to adapt the CLIP model to medical diagnosis tasks via federated learning to preserve data privacy.
2. A Prompt Restyling strategy is presented to improve the quality of prompts by adding task, image, and domain-specific information.
3. Based on Optimal Transport, Twin Cross-domain Alignment is proposed to achieve better alignment between image patches and global/personalized text embeddings.
4. Experiments on four public datasets are conducted to evaluate the effectiveness of the proposed method.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Experiment results on four datasets show the excellent performance of the proposed method.
2. The clear framework diagram enhances the readability of the article.
3. The writing of the main ideas is clear.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The writing lacks rigor. The article presents many arbitrary and unsubstantiated claims and does not provide references to support them. For example, the authors claim that skewed data distributions can cause these models to overemphasize simple patterns and neglect complex minority data, increasing local bias and diminishing generalizability. The authors claim that FACMIC still grapples with trade-offs between global knowledge and local personalization, potentially introducing decision biases.
2. The Rationality of Prompt Restyling. (1) The authors introduce domain-specific knowledge: [Dataset Description], [Task Description], [Input Sample Statistics], and [Task Difficulty]. Except for Input Sample Statistics, the other knowledge is shared by all samples and so not useful for the model classification. The relationship between Input Sample Statistics (mentioned in the article) and target classes is unknown. (2) the design details of prompt learners are missing. (3) Based on FACMIC [24], the domain shift usually occurs on the image features. However, the authors seem to think that it happens on text embeddings.
3. The motivation of Twin Cross-modal Alignment is not clear. The authors think that non-IID data across clients can lead to a learning bias in personalized models where simpler majority representations are favored over more complex minority ones. Hence, the authors propose Twin Cross-modal Alignment to achieve a balance between global and personalized knowledge. Why does achieving a balance between global and personalized knowledge enable the model to reduce learning bias?
4. Ablation experiments cannot prove the effectiveness of the proposed submodules.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

Motivations and details of the submodules, as well as ablation experiments, need to be enhanced.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
1. Motivations of ideas are unclear.
2. The experiments cannot highlight the effectiveness of the proposed submodules.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have clarified my concerns.

Review #2

Please describe the contribution of the paper

This paper mainly explores how to reduce the bias between global and personalized knowledge when adapting the pretrained CLIP model to the medical domain. Specifically, it proposes a effective cross-domain adaptation strategy called Prompt Restyling, and introduces a Twin Cross-domain Alignment that effectively integrates both global and personalized insights. Extensive experiments on multiple datasets and comparisons with other methods demonstrate the effectiveness of this work.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. This paper proposes a effective cross-domain adaptation strategy called Prompt Restyling, and introduces a Twin Cross-domain Alignment that effectively integrates both global and personalized insights.
2. Extensive experiments on multiple datasets and comparisons with other methods demonstrate the effectiveness of this work.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Could the author explain in detail how the Prompt Generator and the corresponding Prompt Learner are implemented in the paper? Do they require manual intervention, and what is the optimization objective?
2. If the goal is to ensure that each client can incorporate both global and personalized insights, is uploading only the Prompt Learner sufficient? Considering that the parameter of the LoRA layers is also small, why not upload the LoRA parameters of the image or text encoder for aggregation ?
3. Since this is personalized federated learning and each client has a different model, how is performance compared in the end? Is it based on the best-performing client model, or the average performance across all client models? Does each client have its own corresponding test set?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
1. Could the author explain in detail how the Prompt Generator and the corresponding Prompt Learner are implemented in the paper? Do they require manual intervention, and what is the optimization objective?
2. If the goal is to ensure that each client can incorporate both global and personalized insights, is uploading only the Prompt Learner sufficient? Considering that the parameter of the LoRA layers is also small, why not upload the LoRA parameters of the image or text encoder for aggregation ?
3. Since this is personalized federated learning and each client has a different model, how is performance compared in the end? Is it based on the best-performing client model, or the average performance across all client models? Does each client have its own corresponding test set?
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

My concerns are well addressed.

Review #3

Please describe the contribution of the paper

The main contributions can be summarized into three parts: 1) they introduced a prompt restyling module to refine prompts with informative, context-aware, and domain-specific alternatives for better cross-domain adaptation. 2) they proposed Twin Cross-domain Alignment, a approach achieves forward-looking visual text alignment by effectively integrating personalization and global insights. 3) They evaluated FedTCA on four medical datasets (2D and 3D) comparing with 11 FL approaches.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1.The paper is easy to follow, with clear pipelines, experimental results, etc.

2.The idea is interesting.

3.The experimental results show that FedTCA outperforms standard federated learning (FL) approaches such as FedAVG and recent parameter-efficient fine-tuning (PEFT) approaches such as FACMIC.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

1.Please revise the paper carefully. I found some typos. For example, Table 3 should be Figure 3. Fig. 1 should be Fig. 1 shows…

2.Clarification on prompt restyling. The paper states that the input sample has a minimum of 17.21, a maximum of 124.1, and a median of 59. Please clarify what 17.21, 124.1, and 59 represent. Also, the definition of task difficulty is straightforward. I suggest the authors develop a dynamic or adaptative strategy for assigning difficulty based on some criterias such as adaptation effects or zero-shot test accuracy.

3.The results shown in each Table are global test accuracy or averaged local client accuracy? Please clarify.

4.Please provide the total loss function used in this study. Or is the $\mathcal{L}_{TCA}$ the only loss you used for training? (I found cross-entropy loss in the paper, but it is unclear.) Please clarify.

5.(Not mandatory). Since the paper itself claims low computational/communication overhead, I would like to see the total computational and communication cost if possible.

6.Suggestions for Fig.1. I suggest the authors use more distinct colors to represent each step (e.g. upload and download).
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

Please see weaknesses.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I recommend weak acceptance based on technical merit, comprehensive experiments and analysis.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The author’s rebuttal addressed most of my concerns. After carefully reading the comments and rebuttals of the other reviewers, I prefer to maintain my score of “weak accept.”

Author Feedback

(R: Reviewer; W: Weakness; Res: Response)

R1@W1[Prompt design, training objective] Res: Prompt Generator creates structured prompts using a fixed template (descriptions, difficulty, input stats), with only minimal manual setup. Prompt Learner tokenizes these prompts and assigns a learnable vector to each token. It is jointly optimized via TCA to align visual and textual features, enabling efficient adaptation to medical domains.

R1@W2[Why only upload Prompt Leaner] Res: Though lightweight, LoRA encodes client-specific semantics; uploading it may dilute personalization. The smaller Prompt Learner suffices for global alignment while minimizing communication and preserving privacy.

R1@W3[Evaluation] Res: All clients share the same model structure but maintain individualized weights. Performance is reported as the average Acc. across all clients, each evaluated on its own local test set, ensuring a comprehensive assessment under non-IID data.

R2@W1[Lacks rigorous claims] Res: We respectfully disagree. The tendency of models to overfit to dominant patterns under skewed data is a well-known issue in personalized FL. For example, when most clients contain common diseases (e.g., pneumonia) and only a few include rare ones (e.g., leukemia), local models often overfit to dominant patterns while neglecting underrepresented features—a issue mentioned in multiple works [Refs 3, 8, 22]. For FACMIC, it adapts a unified global CLIP model without explicit personalization. Thus, our remark about its trade-off between global generalization and local adaptability reflects its inherent design rather than an unsupported claim.

R2@W2[Prompt Restyling effectiveness and assumptions] Res: (1) CLIP’s classification relies on image-text similarity, so prompt semantics directly affect performance. While some components in our restyled prompts are shared, they encode essential medical-domain priors that improve text embedding quality and guide alignment. Input stat further introduce instance-level variability. (2) Prompt Learner tokenizes the prompt and assigns a learnable vector to each token, enabling context-aware representation learning. (3) We do NOT assume domain shift in the text space; rather, prompts act as stable anchors to guide image-side adaptation, supported by LoRA in both encoders and empirical performance.

R2@W3[Motivation of TCA] Res: TCA mitigates client-specific overfitting and improves generalization by aligning visual features with both global and personalized prompts via optimal transport—preserving local relevance while sharing semantics. This reduces non-IID bias (see Respone to R2@W1), with consistent gains in Tabs. 1,2 supporting its effectiveness.

R2@W4[Ablation insufficient] Res: Each core component of FedTCA is supported by targeted experiments: LoRA and TCA via ablations (Tabs. 4, 5), and Prompt Restyling via superior performance over prompt-based FL methods (Tab. 2), together providing empirical support for our design.

R3@W1[Typos] Res: We’ll fix them.

R3@W2[Input stats] Res: The values 17.21, 124.1, and 59 denote the min, max, and median pixel intensities of the mini-batch, used to enrich prompt context. Task difficulty is currently based on non-IID severity, but we agree dynamic criteria are promising and worth future exploration.

R3@W3[Results type] Res: Results are reported as the average Acc. across all clients, each evaluated on its own local test set.

R3@W4[Clarify loss function] Res: Yes, the loss function is Eq.4, which reformulates CLIP contrastive loss via optimal transport, jointly optimizing alignment and classification without extra cross-entropy—consistent with CLIP-style training.

R3@W5[Computation/communication cost] Res: FedTCA updates only ~1.7% of the model parameters (1.46M vs. 86M) during local updating (LoRA + Prompt Learner), and transmits only the lightweight Global Prompt Learner (~0.08M, <0.1%) to the server per round.

R3@W6[Improve figure] Res: We’ll revise Fig. 1 with clearer color.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

Restyled, Tuning, and Alignment: Taming VLMs for Federated Non-IID Medical Image Analysis

Author(s):