Abstract

Improving the fairness of federated learning (FL) benefits healthy and sustainable collaboration, especially for medical applications. However, existing fair FL methods ignore the specific characteristics of medical FL applications, i.e., domain shift among the datasets from different hospitals. In this work, we propose Fed-LWR to improve performance fairness from the perspective of feature shift, a key issue influencing the performance of medical FL systems caused by domain shift. Specifically, we dynamically perceive the bias of the global model across all hospitals by estimating the layer-wise difference in feature representations between local and global models. To minimize global divergence, we assign higher weights to hospitals with larger differences. The estimated client weights help us to re-aggregate the local models per layer to obtain a fairer global model. We evaluate our method on two widely used federated medical image segmentation benchmarks. The results demonstrate that our method achieves better and fairer performance compared with several state-of-the-art fair FL methods.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1953_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/IAMJackYan/Fed-LWR

Link to the Dataset(s)

https://liuquande.github.io/SAML/ https://github.com/emma-sjwang/Dofe

BibTex

@InProceedings{Yan_ANew_MICCAI2024,
        author = { Yan, Yunlu and Zhu, Lei and Li, Yuexiang and Xu, Xinxing and Goh, Rick Siow Mong and Liu, Yong and Khan, Salman and Feng, Chun-Mei},
        title = { { A New Perspective to Boost Performance Fairness For Medical Federated Learning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposed a layer-wise re-weighting method, namely Fed-LWR, to improve the performance fairness of medical federated learning (FL) methods (i.e., pursuing high averaged performance while keeping low performance deviation across the clients in the federation). The main contribution of this work compared with the vanilla FL methods comes from the layer-wise re-weighting strategy that can aggregate the locally trained models layer-by-layer with dynamical weights calculated by layer-wise centered kernel alignment (CKA) similarity, which is claimed to be able to improve the FL model’s performance fairness.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is well-written and clearly presented, which surpassed the majority of the papers in my reviewing list.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Difference (or relationship) between the fairness issue and non-iid data (domain shift) issue in FL. It is unclear what is the difference (or relationship) between studied fairness issue and the well-known non-iid data (domain shift) issue in the context of FL. What is the exact definition of “fairness” in the context of FL? Since the author attribute the performance fairness issue to the feature shift or domain shift of the client datasets, many prior works (such as FedProx, FedDyn, and FedBN) targeting at the domain shift (or non-iid data) issue could be working for the performance fairness issue. What’s the advantage of the proposed method compared with the aforementioned methods?

    • Significance of FL performance fairness. I have a major concern regarding the significance of the fairness issue studied in this paper. Given the previous weakness, I deeply doubt whether the FL fairness issue is caused by the same reason as the domain shift (or non-iid data) issue. In other words, they are actually the same problem but with different names. If so, the significance of this paper should be seriously concerned since there are already a large number of prior works focusing on the domain shift (or non-iid data) issue in FL.

    • Incomprehensive comparison of the proposed method with baselines. Regarding the similarity between the aforementioned domain shift (or non-iid data) issue and the fairness issue studied in this paper, the proposed method should be compared with the baseline methods designed for dealing with the non-iid data in FL. However, the current experimental setting only include some competing method specially designed for handling FL fairness issue.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Page 5, Section 3.1: It is unclear how many images are there in the two datasets used for evaluation.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is generally presented in good quality. However, given the concerns regarding the foundation of study (such as the significance of the study of fairness in FL and its difference with the prior works), I tend to render a weak reject first on this paper and expect to see a solid rebuttal from the authors to change my recommendation.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    Thanks for the authors’ efforts in addressing my concerns and comments. I still insist on my opinion that the faired FL and the non-iid FL are fundamentally solving the same problem. One chooses to train one single model (the global model to fit all client data). The other one chooses to personalize the global model to multiple local models for better local performance. The reason I have concern on the problem definition is that, nowadays, we see many papers are trying to create “new concepts” or “new problem” using the long-standing problems rather than solving them. This may complicate the existing problem but provide less contribution to solve it. As a consequence, I choose to keep my score 3 (Weak reject) unchanged.



Review #2

  • Please describe the contribution of the paper
    1. The paper contributes to the field of federated learning in healthcare by proposing a Layer-Wise Re-weighting method (Fed-LWR) that addresses performance fairness across different hospitals. It dynamically adjusts the aggregation of local models based on layer-wise feature representation differences to mitigate the domain shift problems inherent in medical federated learning.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Novelty in Addressing Feature Shifts: The paper introduces the Fed-LWR method that enhances fairness by dynamically weighing hospitals in the federated learning model based on the layer-wise feature representation differences. This approach specifically targets feature shifts, a common but under-addressed problem in medical FL applications.
    2. Layer-Wise Re-Aggregation: Unlike previous methods that assign a single weight to each client, this paper proposes a layer-wise re-aggregation strategy, which is a more granular and potentially more effective approach than whole-model weighting.
    3. Clinical Relevance: The research has a strong clinical application perspective, as it aims to make the federated learning models in healthcare more equitable across different institutions.
    4. Potential for Broad Impact: By focusing on a common challenge of domain shift in medical federated learning, the methodology has the potential to be extended to other domains facing similar challenges.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Limited Task Evaluation: The paper’s evaluation is primarily focused on image segmentation tasks, which may not fully demonstrate the method’s applicability across a wider range of medical tasks.
    2. High Variance in Results: Some experimental results exhibit high variance, which, when considering the mean plus variance, might indicate performance below that of baseline methods. Also, when the model has high variance, its performance might be inconsistent across different datasets or federated settings.
    3. Complexity in Implementation: The proposed method adds complexity to the federated learning process, which might be a barrier to practical implementation in real-world healthcare settings.
    4. Clinical Integration Unproven: While the method shows promise, the paper does not provide evidence of clinical integration or feasibility in actual healthcare workflows.
    5. Comparison with State-of-the-Art: There may be a lack of comparison with the very latest federated learning methods that could have emerged since the referenced works, potentially leaving out important benchmarks due to only one baseline newer than 2021.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Complexity Analysis: While innovative, the introduction of Fed-LWR also adds to the algorithmic complexity. A formal complexity analysis would be valuable to understand the trade-offs between performance improvement and added computational load. Statistical Significance: To solidify the claim of superiority over existing methods, conducting statistical tests to verify the significance of the performance gains observed would be beneficial. This would provide a more robust framework for comparing your approach against baselines and help justify the high variance problem mentioned before. Potential limits: A section discussing potential limitations and future research directions could provide a balanced view and guide subsequent efforts in this area. Clinical integration: A detailed discussion of the method’s integration into clinical practice, perhaps through a pilot study or collaboration with medical practitioners, could highlight the work’s clinical translatability. In summary, there are key areas that necessitate further clarification or improvement: High Variance in Results: The experimental results exhibit high variance, suggesting potential overfitting or instability in model performance across different datasets. This raises questions about the generalizability of the proposed method. Lack of Complexity Analysis: For practical deployment, it is essential to understand the trade-offs between the method’s benefits and the computational resources it requires. The paper needs to address this currently. Need for Statistical Testing: The claims of superiority over existing methods would be strengthened by the inclusion of statistical tests to confirm the significance of the performance improvements. Clinical Integration: The paper needs to detail how the Fed-LWR method could be integrated into clinical workflows, which is a crucial step for clinical translation and adoption. With these factors, my rating acknowledges the paper’s contributions. With appropriate responses to the concerns raised, particularly around variance, complexity, and statistical significance, the paper could be strongly positioned within the MICCAI community.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Innovation: The paper tackles the under-explored issue of performance fairness in FL by introducing a novel layer-wise re-weighting method (Fed-LWR). This represents a significant advancement in the field, addressing the domain shift problem in medical FL with a fine-grained approach. Methodological Rigor: The proposed method is empirically validated on two medical image segmentation benchmarks, showing not only improvements in fairness but also in accuracy. This dual improvement is particularly noteworthy and contributes to the paper’s strength. Relevance to Clinical Practice: The paper seeks to make FL applications more equitable across institutions, which has direct implications for health equity and the practical deployment of AI in healthcare.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper addresses reducing unfairness in federated learning by weighting the aggregation according to feature difference from the global model. This is motivated by the fact that the greater the feature difference between the local and global model, the less good a fit it is likely to be for the local data. The difference is calculated using CKA layerwise, to account for different degrees of feature difference across model depth. The approach is considered for segmentation, and applied to two segmentation datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The question of fairness in federated learning is of importance to the community and the use of feature difference is well motivated
    • The paper is well written and interesting to read
    • The approach is compared to a sensible range of comparison methods for two segmentation tasks
    • mean and std performance is reported
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The datasets only exhibit heterogenity in terms of population distribution not scanner differences as would be common for real multisite imaging data. Thus, it is unclear how the approach would perform in this setting.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    implementation details for replication are clearly stated

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Future work should consider scanner differences as discussed above
    • While admittedly quite different, it would be interesting to consider how the work relates to fairprune: https://arxiv.org/abs/2203.02110
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Good paper, clearly presented results sufficient for a MICCAI publication

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    Good paper, clearly presented results sufficient for a MICCAI publication, addressed my concern in the rebuttal




Author Feedback

We sincerely thank R1&R5 liked our clear writing and R3 agreed with the contribution of our work. We address the concerns raised by the reviewers below. Q1: Scanner Difference [R1]: Thanks for your suggestion. Actually, the used datasets involve the setting, where four clients of RIF use 4 different scanners. Q2: Fairprune [R1]: Thanks for your suggestion. Although this work indeed differs significantly from ours, the insight of pruning by saliency is very interesting, and we will discuss it in revised manuscript and explore its feasibility for FL in future work. Q3: High Variance in Results [R3]: There may be some misunderstandings. Avg. and Std. are the average and standard deviation of the model performance across all clients. Therefore, we cannot directly add them together to evaluate the performance of the method. The goal of performance fairness (see Definition 1) in FL is to maintain a high Avg. while achieving lower Std.. Thus, the results on two datasets demonstrate that Fed-LWR achieves the better and fairer performance compared to existing methods. Q4: Lack of Complexity Analysis [R3]: We quantify the complexity of different methods based on the training time/round[3], and the results on ProstateMRI are as follows: FedAvg(239s) < Ditto (305s) < qFedAvg(322s) < CFFL(341s)< Fed-LWR(383s) < FedCE(394s) < CGSV(487s). This indicates that the training overhead introduced by Fed-LWR’s fairness mechanism is acceptable, considering its superior performance. Q5: Statistical Testing [R3]: Based on your suggestion, we conducted two additional independent experiments on ProstateMRI using different random seeds for all methods. Then, we conducted paired t-tests between the 3-trait results of all baselines and our Fed-LWR. The results indicate that the p-values for the Avg. and Std. of all methods are less than 0.05, demonstrating the statistical significance of the improvements yielded by our Fed-LWR over previous methods in terms of accuracy and fairness. Q6: Clinical Integration [R3]: Thanks for the comments. We plan to employ Fed-LWR on multiple cooperating hospitals to collaboratively train breast ultrasound lesion segmentation model. The fairness mechanism of Fed-LWR will ensure that the model performs fairly on all hospitals, promoting long-term, healthy collaboration. Q7: Relationship and Difference [R5]: In FL, fairness issues are associated with non-IID data problems as different data distribution can lead to unfair model performance [16]. However, fair FL and non-IID FL are two different research directions for the diverse needs of users, i.e., better fairness or better accuracy. Non-IID FL focuses on addressing the performance degradation caused by non-IID data; thus they will overfit local data to improve their performances (high Avg.). Such improvements may decrease the fairness of the model (high Std.). This conflicts with the requirements of fair FL, i.e., high (Avg.) and low (Std.). Finally, we must clarify that the goal of Fed-LWR is not to address the domain shift issue. Instead, it perceives feature differences to adjust the weight of clients to achieve better global consideration, thereby improving fairness of global model. Q8: Advantages [R5]: Compared to the aforementioned methods, Fed-LWR is specifically designed to address fairness issues. This ensures the model to perform fairly across different clients and better meet the user’ requirements for performance fairness. Q9: Significance of Fair FL [R5]: Fairness is a crucial evaluation for FL systems, and previous studies [14,16] has confirmed the importance of fairness. An unfair FL system can diminish users’ enthusiasm, thereby hindering the sustainable and healthy cooperation. Q10: Comparison [R5]: We compared Fed-LWR with FedProx and FedBN on ProstateMRI dataset, and their results (Avg., Std.) are as follows: FedProx (87.27, 3.78) FedBN (91.62, 5.41). Our Fed-LWR outperforms them in terms of accuracy and fairness, demonstrating its superiority.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper addresses reducing unfairness in federated learning by weighting the aggregation according to feature difference from the global model. The reviewers are generally in favor of the paper, especially with major concerns addressed by the rebuttal. The authors shall carefully address the remaining concerns in their final version.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper addresses reducing unfairness in federated learning by weighting the aggregation according to feature difference from the global model. The reviewers are generally in favor of the paper, especially with major concerns addressed by the rebuttal. The authors shall carefully address the remaining concerns in their final version.



back to top