List of Papers Browse by Subject Areas Author List
Abstract
Federated Learning (FL) is a distributed machine learning paradigm enabling collaborative model training across decentralized clients while preserving data privacy. In this paper, we revisit the stability of the vanilla FedAvg method under diverse conditions. Despite its conceptual simplicity, FedAvg exhibits remarkably stable performance compared to more advanced FL techniques. Our experiments assess the performance of various FL methods on blood cell and skin lesion classification tasks using Vision Transformer (ViT). Additionally, we evaluate the impact of different representative classification models and analyze sensitivity to hyperparameter variations. The results consistently demonstrate that, regardless of dataset, classification model employed, or hyperparameter settings, FedAvg maintains robust performance. Given its stability, robust performance without the need for extensive hyperparameter tuning, FedAvg is a safe and efficient choice for FL deployments in resource-constrained hospitals handling medical data. These findings highlight the value of the vanilla FedAvg as a reliable baseline for clinical practice.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3882_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/yjlee22/vanillafl
Link to the Dataset(s)
N/A
BibTex
@InProceedings{LeeYou_Revisit_MICCAI2025,
author = { Lee, Youngjoon and Gong, Jinu and Choi, Sun and Kang, Joonhyuk},
title = { { Revisit the Stability of Vanilla Federated Learning Under Diverse Conditions } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15973},
month = {September},
page = {541 -- 550}
}
Reviews
Review #1
- Please describe the contribution of the paper
This paper systematically evaluates the stability of the classic federated learning method FedAvg across various models, data heterogeneity, and hyperparameter settings, finding that it remains robust without requiring complex tuning, highlighting its practical value as an efficient and reliable baseline in medical scenarios.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
This paper explores a highly practical yet insufficiently investigated question—whether simple vanilla FedAvg is sufficient to support federated learning in real-world medical applications, particularly in resource-constrained environments.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The work lacks methodological innovation, relying on benchmarking with FedAvg. Although the results are valuable, it lacks innovative contributions.
- The experimental work is overly simplistic, and I don’t believe it qualifies for submission to this conference.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(1) Strong Reject — must be rejected due to major flaws
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
FedAvg is a foundational algorithm in federated learning, and while reiterating its stability and robustness is useful, the empirical study on FedAvg is commendable. However, such work lacks any innovation; could further reflection based on FedAvg, combined with practical medical scenarios, lead to more insightful ideas?
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Reject
- [Post rebuttal] Please justify your final decision from above.
As stated by the authors, benchmarking FL methods, model variants, and hyperparameter configurations can provide comprehensive guidance for medical FL, and we believe the open-sourced code will offer value to other researchers.
Review #2
- Please describe the contribution of the paper
The paper challenges the belief that complex methods with multiple hyperparameters are needed in federated learning to ensure fast and robust convergence. In particular, the paper addresses the question of evaluating the convergence speed and final model performance of vanilla FedAVG versus more modern and complex optimization techniques.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Major strength 1 The paper analyzes a wide range of federated optimization techniques. Moreover, the paper analyzes the impact of the underlying predictive model, an aspect which is often overlooked when evaluating different FL algorithms. Furthermore, the authors consider multiple values for the hyperparameters of different techniques, thus enabling a fair comparison (best versus best) while at the same time providing insight into the effect of these hyperparameters.
Major strength 2 This is a very personal consideration, but the paper really speaks to my own personal experience in applying federated learning. My empirical experience aligns with the authors’ conclusion, i.e. that the added complexity and hyperparameters of advanced federated optimization techniques are somewhat rarely compensated by any gains in performance and/or convergence improvements.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Major weakness 1 While the authors have conducted extensive benchmarking in terms of the underlying model architecture, and over a large number of federated optimization strategies, the paper would have greatly benefitted from a larger number of benchmark datasets, especially considering that several federated-specific benchmarks are available. For example, part of the Flamby suite could have been used, or the FeTS challenge dataset. In my opinion, the lack of diversity in tasks (both experiments presented in the paper are classification tasks) is a major weakness. Along these lines, but of minor importance to me, is the choice of optimization strategies. I was surprised to find neither SCAFFOLD nor FedNOVA, which can be considered state of the art in the non-iid scenarios (at least, until a couple of years ago). I wouldn’t necessarily suggest including these in the paper, but I would have liked to see a short (one sentence) discussion on why they were not included.
Major weakness 2 The presented results lack any measure of variability. I understand that such measures may be less relevant when trying to prove a negative (it’s slightly more complicated to design a statistical test proving non-inferiority), but they’re still important and meaningful. The authors make a couple of statements such as “marginally better performance”, but without a measure of variability this language is difficult to interpret. Some cross-validation or repeated measures are sorely needed in this kind of paper to properly support the conclusions. Furthermore, the paper could do with a bit more detail on how the test sets were constructed and evaluated: is there a test set for each client? Or a single, centralized, held-out test set? How was it sampled? I would have loved to see a separate test set, from a different data source, but I don’t think that it is possible in this case.
Major weakness 3 The authors make an attempt to simulate heterogeneity (label skew only) by using a standard approach based on the Dirichlet distribution. However, with an alpha value of 0 and a much larger number of clients than the number of classes, I am uncertain how “representative” this heterogeneous distribution will actually turn out to be. Since the major weakness of FedAVG is in dealing with heterogeneous data, I would like to see this section greatly expanded, both with a study using different values for alpha as well as simulating different kinds of heterogeneity. In this direction, using a benchmark that was natively designed for federated learning, as I suggested above, could really help. This leads me to another point, which is that I have the impression that the test sets were randomly sampled in an i.i.d. way from the training sets. This means that the degree of heterogeneity during training is not reflected in the testing phase, which ultimately means that this paper is not fully testing the generalization capabilities of the different approaches.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Papers that evaluate the state of the art with extensive and meaningful benchmarking are sorely needed in federated learning. I really appreciated the sobering tone and message of this study. However, the purpose of such work should not be to find a silver bullet solution for all scenarios, but rather to identify the nuances of each use case and guide researchers in making the correct choice of algorithm, best suited to their specific needs. This paper fails to do that, by benchmarking on a small number of datasets/tasks (two, neither of which were split according to data sources, which is the most natural split when simulating FL) and by failing to properly account for all different types of heterogeneity in FL scenarios.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
- The rebuttal, and 2 out of 3 reviewers, confirmed the impact of this work. All reviewers also agreed that the work is relevant and significant for the medical imaging community.
- The decision to exclude older but well-established aggregation methods (such as e.g. SCAFFOLD and FedNOVA) has been properly motivated. Even though I would like to see this comparison in future research with a broader scope (potentially a journal publication), I think the authors’ motivation for not including them in this manuscript is acceptable
- The limited number of benchmark datasets/tasks cannot be adressed in the rebuttal, as per MICCAI guidelines. As the authors also acknowledged, it would be interesting to see this in future research. Even though this remains a weakness of this study, I do not think it warrants rejection.
- The comment about lack of variability (and related comment about statistical testing from reviewer 3) has been partially adressed by claiming that the scope of this work is more qualitative than quantitative. I accept this justification, but would recommend the authors to stress this point in their discussion or conclusions.
- The authors have improved some vague wordings
Review #3
- Please describe the contribution of the paper
The authors explore the performance of various FL methods across blood cell and skin lesion classification tasks. Their results show that while more advanced techniques outperform FedAvg in most scenarios, the relative stability of FedAvg across various hyperparameter settings makes it a robust choice for FL with medical images.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The authors conduct extensive empirical analysis across a wide spectrum of FL methods, from simpler methods (like FedAvg) to more advanced distributed optimization methods (like FedProx, etc.)
- The comprehensive analysis across various classification tasks, hyperparameter settings, model architectures, and IID vs non-IID settings strengthens the paper’s findings.
- The inclusion of time cost (in addition to accuracy) makes the motivation clear as FedAvg consistently faster due to its simplicity.
- The findings are highly relevant for medical imaging, where FL is necessitated due to data privacy regulation to collaboratively train cross-institutional models.
- The paper reads very well and the figures/tables are well organized and clear.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- It is not clear why the experiments were only performed on ViTs. A wide selection of CNNs and ViTs would provide a representative sample of commonly used architectures and help strengthen the paper’s conclusions.
- The description of experiment settings (section 3.1) is limited. Details such as hyperparameter settings, number of local epochs, and others should be included for reproducibility.
- Inclusion of statistical comparisons (e.g., t-tests) would help strengthen the findings.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not provide sufficient information for reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- Highlighting best top-1 test accuracy in Table 3 would improve clarity.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(5) Accept — should be accepted, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The extensive multifactoral analysis to demonstrate FedAvg’s stability when compared to more advanced methods is a significant finding for applying FL principles to facilitate cross-institutional model training.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The authors have addressed my concerns, providing clarity behind their choice of only ViT (vs CNNs, etc.) and hyperparameter settings. I recommend acceptance, based upon my previously mentioned strengths and overall contribution to the community.
Author Feedback
First of all, we sincerely thank the reviewers for their overall positive evaluation and insightful feedback. We emphasize that, according to the rebuttal guidelines, “new/additional experimental results in the rebuttal are not allowed, and breaking this rule is grounds for automatic desk rejection. It is, however, allowed to amend the presentation of existing results.” Response to Reviewer 1:Thank you for your valuable comments. We selected the blood-cell and skin-lesion datasets, both commonly used in FL, and confined the study to classification tasks to ensure a consistent evaluation metric and highlight algorithmic gaps. However, we fully agree that incorporating additional datasets such as those provided by Flamby and FeTS could significantly enhance the quality and scope of our work. Although conducting new experiments during the rebuttal period is restricted, we will extend our work to include these datasets in future research. Regarding algorithm selection, while SCAFFOLD (ICML ‘20) and FedNOVA (NIPS ‘20) are indeed excellent methods, we specifically focused our analysis on more recent FL techniques such as FedSAM (ICML ‘22), FedSpeed (ICLR ‘23), FedSMOO (ICML ‘23), and FedGAMMA (IEEE TNNLS ‘24) to represent the latest advancements. Thank you for clearly pointing out the lack of variability measures. To address ambiguity in our descriptions, we revised vague statements such as “marginally better performance” to precise numerical metrics (e.g., “0.29% better accuracy and converged 1 round faster”). Additionally, we ensured reproducibility by using fixed seed across all comparisons. Detailed settings will be fully available upon acceptance via code release. We sincerely appreciate your insightful suggestion regarding separate test sets from different data sources, as it indeed provides new research insights. We fully agree with your comments on simulating heterogeneity. While our current work primarily utilized label skew with a specific Dirichlet distribution, future research will explore a broader range of heterogeneity scenarios with FL-specific benchmarks, such as FeTS dataset. To clarify, our training and test sets are distinct; training datasets were distributed among clients following a label skew approach, while predefined test sets remained centralized and untouched. Response to Reviewer 3: We appreciate your thoughtful comments regarding the choice of model architectures. Our decision to focus on ViTs was primarily driven by their established effectiveness in medical AI classification tasks. However, as suggested, we agree that incorporating CNNs would significantly enhance the comprehensiveness of our findings and will consider these architectures in our future research. In addition, the number of local epochs used was 5 and we will publicly release detailed hyperparameter settings through our code repository upon acceptance. Lastly, we fully agree with the valuable recommendation on including statistical comparisons to strengthen our findings. While our current scope was qualitative, we plan to integrate statistical tests (e.g., t-tests) in future studies. Additionally, as per the optional comment provided, we have highlighted the best top-1 test accuracy results in Table 3. Response to Reviewer 4: Thank you for your comment. In response, we highlight that our comprehensive benchmarking of FL methods, model variants, and hyperparameter configurations delivers practical guidance for clinical FL deployments, which the other reviewers recognized as a major strength. By focusing on relevant, resource-constrained hospital scenarios, we demonstrated FedAvg’s practical benefits without complex tuning. While recent FL research focuses on performance and speed, our experiments show hyperparameter sensitivity can undermine these benefits in real-world medical applications. Through our work, we hope to raise awareness of this trade-off and advocate for simpler, more practical FL methods in healthcare.
Meta-Review
Meta-review #1
- Your recommendation
Invite for Rebuttal
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #3
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
Two reviewers rated this submission positively, while another reviewer leaned towards rejection. The paper falls into the evaluation track, so I underweight the requirements for novelty. I believe it is beneficial to open source the code. This is a borderline acceptance paper in my batch. I raised another concern in confidential comments to the PC; please incorporate that into the PC’s final decision.