Abstract

Data imbalance presents a significant challenge for the application of federated learning in medical image analysis. To address this challenge, we propose FedSDC, an innovative federated approach designed to effectively tackle the issue of data imbalance, as well as heterogeneity in distributed federated learning environments. The proposed FedSDC framework comprises a shared body network and multiple task-specific head networks. By incorporating a shuffle-diversity collaborative strategy, FedSDC effectively addresses data imbalanc and heterogeneity challenges while improving cross-client generalization. Furthermore, training multiple heads under this strategy enables ensemble predictions, which enhances decision stability and accuracy. To balance efficiency and performance, FedSDC employs the sparse-head scheme during inference phase. Extensive experiments on medical image classification tasks validate that FedSDC achieves state-of-the-art results under imbalanced and heterogeneous data conditions. The source code will be available at https://github.com/wpnine/FedSDC.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3920_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/wpnine/FedSDC

Link to the Dataset(s)

N/A

BibTex

@InProceedings{GaoWen_ShuffleDiversity_MICCAI2025,
        author = { Gao, Wenpeng and Lan, Liantao and Liu, Yumeng and Wang, Ruxin and Fan, Xiaomao},
        title = { { Shuffle-Diversity Collaborative Federated Learning for Imbalanced Medical Image Analysis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15973},
        month = {September},
        page = {584 -- 594}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a strategy to address class imbalance in federated learning. Specifically, the proposed method, Shuffle-Diversity, decomposes the backbone model (ResNet) into a feature extractor and a classifier head. The feature extractors from all clients are aggregated by the server using FedAvg, while the classifier heads are randomly shuffled and sent to other clients. In the end, the final model consists of the aggregated feature extractor and an ensemble of all classifier heads (one from each client).

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The idea, although simple, has the potential to evolve into a well-researched direction.
    • The paper is clearly written and easy to follow, and the code has been shared anonymously.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The main idea of randomly sending classifier heads is quite simple and lacks theoretical justification. Simply sharing the heads can lead to catastrophic forgetting—for example, when Head A is fine-tuned on Node B, it may completely forget its previous distribution. Furthermore, the proposed strategy does not achieve state-of-the-art results.

    • The results are rather incremental or on par with existing methods (Table 1) and do not strongly support the paper’s motivation. The method appears to perform well only in the NIID-1 setting, while improvements in other scenarios are negligible.

    • Why is FedRep, a method closely related to the proposed work, not included in the comparison? Moreover, the comparison with FedRep in the introduction seems unfair—FedRep is a personalized federated learning method that produces N models (one per client), whereas the proposed method outputs a single aggregated model. How were the results for FedRep computed in Figure 1? Was an ensemble used?

    • The ablation study explores the effect of activating the Diversity and Shuffle components of the proposed approach. While it is clear what Shuffle entails (randomly sending classifier heads to other clients), the meaning and implementation of Diversity remain unclear.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Dear authors,
    I appreciated the idea, but the paper in its current form is not yet ready for publication in a top-tier conference. I would recommend increasing the efforts on the results side, with more insights and deeper analysis to better support the proposed method.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    I appreciated the efforts done in the rebuttal but I will keep my rating as I am still not convinced by the paper



Review #2

  • Please describe the contribution of the paper

    This paper proposes a federated learning framework that consists of a shared body network and multiple task-specific head networks. It introduces a shuffle-diversity collaborative strategy to address data imbalance and improve cross-client generalization. During inference, ensemble predictions from multiple heads are used to further enhance accuracy.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well-structured and clearly written. The motivation is logical, the methodology is well-explained, and the figures effectively support understanding.
    2. The experimental design is reasonable, and the authors conduct comprehensive experiments to validate their approach.
    3. The authors have released their code, which supports reproducibility and facilitates future research.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The proposed method feels incremental. Compared to classical FL personalization settings—where a shared body network is combined with personalized head networks—the main contribution here appears to be the shuffling of head networks during training and the ensemble of heads during inference. Both are commonly used techniques for performance improvement, and their combination does not bring significant methodological novelty.
    2. Regarding the shuffling strategy, which is highlighted as a main contribution, I have two concerns:
      • First, why does random head reassignment promote exposure to diverse decision boundaries rather than introduce out-of-distribution noise that could be harmful to the local client? Intuitively, if the randomly assigned head is too dissimilar to the client’s data distribution, it might degrade performance. How does the method mitigate this risk?
      • Second, I am concerned about the convergence behavior of the proposed approach. Could the randomness introduced by shuffling lead to instability or convergence failure during training?
    3. As discussed in the introduction, there is an inherent trade-off between generalization and personalization in federated learning. A common approach is to let the body network capture global, low-level features, while head networks handle client-specific personalization. In this paper, it is unclear why ensembling multiple personalized heads at inference improves performance. Given that the global body network is already in place, why would head models from other clients, which are trained on different data distributions, benefit a specific client during inference?
    4. In Table 2 (ablation study), the text mentions experiments on Diversity, Shuffle, and Ensemble modules, but the table itself appears to only report results for Diversity and Shuffle. It would be helpful to clarify this and explicitly define the experimental setups for both “Diversity” and “Shuffle” in a single sentence each. Minor Issues:
      • Citation of the Matek-19 dataset is missed when it is first introduced in the introduction section.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The methodological contributions are relatively modest, and several aspects of the proposed modules remain insufficiently explained. In particular, I am concerned that the shuffle-head strategy may lead to instability in practical applications and potentially undermine personalization performance. While the paper is well-executed, the level of novelty is limited, placing it around the borderline of acceptance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I am generally satisfied with the authors’ response. As I mentioned previously, while the paper may lack some degree of novelty or excitement, it is solid in terms of writing and experimental validation. Given that the authors have released their anonymized code, I am willing to raise my score to borderline accept, as the availability of code facilitates potential follow-up work.



Review #3

  • Please describe the contribution of the paper

    The paper proposes FedSDC, an innovative federated approach designed to effectively tackle the issue of data imbalance as well as heterogeneity in distributed federated learning environments. By incorporating a shuffle-diversity collaborative strategy, FedSDC effectively addresses data imbalance and heterogeneity challenges while improving cross-client generalization. Furthermore, FedSDC training multiple heads enables ensemble predictions, which enhances decision stability and accuracy. The proposed FedSDC effectively blends the strengths of both generalized federated learning and personalized federated learning, which combines both generalization and personalization.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper presents an original and well-designed shuffle-diversity collaborative strategy to mitigate the category imbalance issue in federated learning, which is both meaningful and highly relevant to practical applications. The theoretical derivations are rigorous, and the intuitive visualizations help to clearly illustrate the core concepts. Experimental results show that the proposed method delivers remarkable performance improvements over existing methods.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    This paper is both novel and practical. However, there are several limitations that should be addressed:

    1. The paper proposes a novel FedSDC framework that involves randomly allocating model heads. It would be valuable to discuss whether this random allocation might lead to significant performance variations due to different random seed settings.
    2. In Section 2.2, the authors mention that body networks are aggregated using FedAvgM. Providing a clear explanation of why FedAvgM was selected over the more commonly used FedAvg would help readers better understand the design choices behind the method.
    3. Also in Section 2.2, the description of how the method dynamically selects a sparse subset of heads for testing based on validation-phase micro-F1 scores could be made clearer with additional details.
    4. In Table 1, the performance of the recent federated learning method ISFL appears to be relatively poor. A brief explanation of these experimental results would improve the article’s data analysis and provide helpful context for readers.
    5. In Table 2, it seems that the ablation study results are based on the NIID-2 dataset. However, the performance of FedSDC in this table appears inconsistent with the results presented in Table 1. Clarifying this inconsistency would enhance the credibility and clarity of the experimental section.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    My overall score for this paper is based on its originality, practical relevance, and strong empirical results. The paper proposes a novel shuffle-diversity collaborative strategy that meaningfully addresses the category imbalance problem in federated learning. The theoretical analysis is thorough, the visual explanations are clear, and the proposed method is shown to outperform existing methods through comprehensive experiments.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The paper proposes FedSDC, an innovative federated approach designed to effectively tackle the issue of data imbalance as well as heterogeneity in distributed federated learning environments. By incorporating a shuffle-diversity collaborative strategy, FedSDC effectively addresses data imbalance and heterogeneity challenges while improving cross-client generalization. Furthermore, FedSDC training multiple heads enables ensemble predictions, which enhances decision stability and accuracy. The proposed FedSDC effectively blends the strengths of both generalized federated learning and personalized federated learning, which combines both generalization and personalization. The methodology is innovative, the experiments are thorough, and the results are promising. However, the method requires a direct and detailed explanation of the impact of random number seeds on the method.




Author Feedback

We thank the reviewers for their valuable feedback. Our Responses (R) are as follows.

Reviewer#1(Q1)&Reviewer#2(Q1)&Reviewer#3(Q2-3): Questioned our method’s stability, theory, and performance–citing seed-dependent variance, forgetting from head sharing, distribution mismatch from shuffling, and unclear gains from ensembling. R: We clarify that: (1) Medical image data at each client naturally exhibits inter-center similarity, preventing severe forgetting. (2) Our FedSDC’s one-body-multi-head architecture supports stable convergence, as established by Theorem 1 in Reference 6. (3) We adopt momentum in local updates and server aggregation to improve stability, supported by Subsection 3.1 [1] and Theorem 1 [2]. (4) Head shuffling enables exposure to diverse decision boundaries, while ensemble inference with sparsely selected heads leverages their complementary knowledge to enhance robustness and generalization beyond local distributions, as shown in Fig. 1. [1] Momentum Benefits Non-IID Federated Learning Simply and Provably. ICLR 2024 [2] On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization. ICML 2019

Reviewer#3(Q1): Main contributions (shuffling and ensemble) are incremental and lack novelty. R: We are the first to integrate both Shuffle and Diversity mechanisms into a one-bod-multi-head architecture, resulting in a novel GFL model. Unlike traditional GFL frameworks, FedSDC leverages multi-head ensembles to deliver more robust predictions in non-IID medical image classification while fully exploiting the benefits of distributed training. Moreover, FedSDC’s ensemble strategy can be adapted to specific scenarios, offering greater flexibility and enhanced practical value.

Reviewer#2(Q2): Performance improvements are marginal and mainly limited to NIID-1. R: FedSDC performs strongly in both NIID-1 and the NIID-2 settings. Despite the saturated performance (nearly 90% micro-F1) of baselines in NIID-2, FedSDC’s additional 2% gain highlights its effectiveness even under class-missing and imbalanced conditions.

Reviewer#2(Q3): FedRep comparison is unclear and potentially unfair; ensemble details in Figure 1 are missing. R: FedRep was used in the Introduction as a motivational example to show the potential of Shuffle-Diversity Collaboration for improving generalization. In early tests, ensemble inference on FedRep yielded low micro-F1 scores (55.71% on NIID-1, 77.32% on NIID-2). Since FedRep is a PFL method producing per-client models, it was excluded from the main performance table.

Reviewer#1(Q2): Missing justification for using FedAvgM instead of standard FedAvg. Reviewer#1(Q3): Sparse head selection strategy needs clearer explanation Reviewer#2(Q4) & Reviewer#3(Q4): Definitions and setup of Diversity and Shuffle modules need clarification. R: We will clarify sparse head selection, the Diversity and Shuffle modules, and our use of FedAvgM in the camera-ready version. We chose FedAvgM for its proven robustness and faster convergence under non-IID conditions (Reference 9). The Diversity module applies two linear layers separated by dropout, assigning each head a distinct compression ratio. The Shuffle module randomly assigns heads to different clients during training; in the Shuffle-only ablation, each head is a single linear layer.

Reviewer#1(Q4): Poor performance of ISFL in Table 1 lacks explanation and context. R: The ISFL’s weak performance may stem from its reliance on partial validation data for importance sampling and its use of FedAvg, which underperforms in non-IID settings compared to FedAvgM.

Reviewer#1(Q5): Inconsistency in dataset labels between Table 1 and Table 2 requires clarification. Reviewer #3(Q5): Missing citation for Matek-19 dataset in the introduction. R: We have addressed your feedback by correcting the NNID-2 label inconsistencies and adding the missing citation for the Matek-19 dataset.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors have clearly clarified the issues raised by the reviewers. The explanations in the rebuttal look reasonable and correct to me.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviews exhibit a diversity of opinions regarding the novelty and experimental evaluation, though there is a consensus on the quality of the writing. The idea may appear simple, but this does not necessarily detract from its novelty. The authors addressed concerns regarding performance and comparisons in their rebuttal, and some of the reviewers’ major concerns have been resolved. However, reviewers remain somewhat unconvinced due to the empty shared code folder. Therefore, I recommend this paper for borderline acceptance.



back to top