Abstract

Federated Learning (FL) is a rising approach towards collaborative and privacy-preserving machine learning where large-scale medical datasets remain localized to each client. However, the issue of data heterogeneity among clients often compels local models to diverge, leading to suboptimal global models. To mitigate the impact of data heterogeneity on FL performance, we start with analyzing how FL training influence FL performance by decomposing the global loss into three terms: local loss, distribution shift loss and aggregation loss. Remarkably, our loss decomposition reveals that existing local training-based FL methods attempt to further reduce the distribution shift loss, while the global aggregation-based FL methods propose better aggregation strategies to reduce the aggregation loss. Nevertheless, a comprehensive joint effort to minimize all three terms is currently limited in the literature, leading to subpar performance when dealing with data heterogeneity challenges. To fill this gap, we propose a novel FL method based on global loss decomposition, called FedLD, to jointly reduce these three loss terms. Our FedLD involves a margin control regularization in local training to reduce the distribution shift loss, and a principal gradient-based server aggregation strategy to reduce the aggregation loss. Notably, under different levels of data heterogeneity, our strategies achieve better and more robust performance on retinal and chest X-ray classification compared to other FL algorithms. Our code is available at https://github.com/Zeng-Shuang/FedLD.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1348_paper.pdf

SharedIt Link: https://rdcu.be/dV55w

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72117-5_66

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1348_supp.pdf

Link to the Code Repository

https://github.com/Zeng-Shuang/FedLD

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Zen_Tackling_MICCAI2024,
        author = { Zeng, Shuang and Guo, Pengxin and Wang, Shuai and Wang, Jianbo and Zhou, Yuyin and Qu, Liangqiong},
        title = { { Tackling Data Heterogeneity in Federated Learning via Loss Decomposition } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {707 -- 717}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper addresses the challenging issue of non-i.i.d. data across clients in Federated Learning (FL) for medical imaging. The authors decompose the global loss function in FL into three components: (1) Local Loss, (2) Distribution Shift Loss, and (3) Aggregation Loss. The analysis explores how current FL training strategies impact these components, finding that local-training-based methods primarily reduce the Distribution Shift Loss, while global aggregation-based FL methods focus on minimizing the Aggregation Loss. The paper highlights a significant gap: there are no existing methods that simultaneously minimize all three loss components. To address this, the authors propose a novel approach, FedLD, which incorporates two additional strategies: (1) L2 regularization on the logits to mitigate Distribution Shift Loss and (2) Principal Gradient-based server aggregation to minimize Aggregation Loss. FedLD is evaluated on 2 medical datasesets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper introduces a novel idea to decompose the global loss function into three distinct components and provides an analysis framework of the challenges in current Federated Learning methods when it comes to minimizing the global loss function.
    • The authors propose a novel method to address gradient conflicts in Federated Learning by performing Singular Value Decomposition on the gradient matrix. They utilize the derived eigenvalues and eigenvectors to construct principal gradients, which are then used to form a principal coordinate system. Local gradients are then projected onto it. The method is theoretically well described.
    • The proposed method outperforms existing state-of-the-art approaches
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The use of L2 regularization on output logits to reduce spurious correlations in federated learning, a central idea in the paper, was previously proposed in [1]. [1] was not cited.
    • The paper frequently references Figure 2 to illustrate how their method reduces local training loss, distribution shift loss, and aggregation loss. However, this figure only depicts losses from a single round of Federated Learning. It lacks clarity on which round (beginning or end) is represented, and it does not show how these losses evolve throughout the entire training process. Moreover, it is unclear whether the depicted losses are from training, validation, or testing datasets.
    • The paper claims that “Remarkably, our loss decomposition reveals that existing local training-based FL methods attempt to reduce the distribution shift loss, while the global aggregation-based FL methods propose better aggregation strategies to reduce the aggregation loss.”. However, this claim is not supported by empirical evidence or theoretical analysis in the paper. Figure 2 only provides a comparison between “novel” local training/server aggregation and the authors’ method, without examining how advanced state-of-the-art methods specifically influence distribution shift loss or aggregation loss.
    • The decomposition of the loss function and its derivation process are not clearly explained. 
    • Despite categorising and introducing several state-of-the-art Federated Learning methods in the introduction, the experimental section compares the proposed method with different, previously unmentioned methods (FedBN, FedPAC, FedGH). A comparison with an effective “global aggregation” strategy is missing in the experiments. This difference makes it difficult to assess the effectiveness of the proposed method.
    • The paper does not mention the concept of personalized FL, which focuses on developing individualized models per client. Yet, it compares its results with personalized FL methods like FedPAC and FedBN (FedBN could be considered a form of personalized FL since it maintains client-specific batch normalization parameters).
    • The paper does not distinguish between distribution shifts in the label space and the feature space. This distinction is important for effectively addressing data heterogeneity in Federated Learning environments, especially when the key components of the experiments depend on the “heterogeneity” level of the data.

    [1] Nguyen et al. - FedSR: A Simple and Effective Domain Generalization Method for Federated Learning – NeurIPS 2022

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors have uploaded an anonymized version of their code. They also provide download links for the datasets used in their experiments. Additionally, the authors detail the hyperparameters used in their study within the text.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The general idea presented in the paper, particularly the decomposition of the loss function into specific components, is interesting. Especially, how the paper handles conflicting gradients in Federated Learning. However, the paper lacks empirical or theoretical evidence to support its claim that existing ‘local training’ methods primarily minimize the distribution shift loss and that efficient ‘global aggregation’ strategies mainly reduce the aggregation loss. It would strengthen the paper if experiments were conducted to verify these claims, especially since it categorizes these two main approaches.

    Additionally, more detailed information about the datasets used—such as the type of heterogeneity they exhibit (e.g., distribution shifts in feature space due to images being captured with different devices, or shifts in label space), the number of clinics from which the data originates, and label imbalances across clients—would enhance the understanding of the experiments. This information is important, since it is the focus of the paper, and it would be more useful than citing another paper that describes the datasets and how the level of heterogeneity is constructed.

    Including a figure that illustrates the evolution of the specific losses throughout an entire training run would strengthen the claim that the proposed method effectively minimizes all three components of the loss.

    A minor point: In the introduction, the citation of the paper ‘Representation Learning on Graphs: Methods and Applications’ appears unrelated. The connection between this citation and the topics of medical imaging and federated learning is unclear and confusing.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    In general, the paper tackles an important issue in Federated Learning and introduces an interesting framework to analyze this challenge. However, the experiments conducted to explore the effects of each loss component are limited—only one round of Federated Learning is shown for a single dataset. Furthermore, the paper lacks comparative analysis between the proposed method and state-of-the-art methods regarding the impact on each loss component, which is a main claim of the paper. Although the method reportedly improves upon existing state-of-the-art approaches, the paper does not convincingly demonstrate how the two introduced components—L2 regularization and principal gradient-based server aggregation—effectively minimize the respective loss components.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    federated training and parameter aggregation is focused in the paper. Training loss is focused in training. Global loss is analyzed by dividing it into three distinct components: local loss, distribution shift loss, and aggregation loss. Aggregation algorithm is FedLD algorithm, which employs margin control regularization (l2) and a principal gradient-based server aggregation strategy. stable features are selected in training to reduce the distribution shift loss. During aggregation, eigen decomposition is performed to assign the weight of colabs.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    global loss decomposition is significant to address the data heterogenity challenge, stable features selection makes the global model robust, weighted average of the colas is an established approach but it complements the loss decomposition and stable feature selection. The performance comparison on 2 different datasets shows that the algorithm is robust.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    comparison with aggregation algorithm called similarity weighted aggregation (SimAgg) would be better.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    the code is made public which is really good

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    the paper is well written with good technical details. It is really good that code is public. performance comparison with aggregation algorithm called similarity weighted aggregation (SimAgg) would be better.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    technical details validation on different datasets

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper
    1. Differing from existing methods, this paper analyzes reasons for performance degradation from the perspective of loss decomposition in heterogeneous federated learning.
    2. Based on the analysis, the authors introduce a margin control regularization term and a gradient deconflicting aggregation strategy to tackle data heterogeneity.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well-written, easy to understand and follow.
    2. Intuitively, the analysis of the performance degradation from the perspective of loss decomposition is reasonable.
    3. Mitigating shortcut learning in local training to reduce the distribution shift loss is a reasonable solution since it can improve the generalization ability of local models.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The reason that deconflicting gradients can reduce aggregation loss is not well-explained.
    2. Directly flipping the direction of some eigenvectors may have significant impact of gradients.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please see the weaknesses of the paper

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Good writing, comprehensive experiments and reasonable motivation.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank reviewers for constructive comments and respond to individual comments as follows.

R1 -Comparison with aggregation algorithm SimAgg Though we have added FedGH [31], which uses gradient harmonization as the aggregation method, comparisons with more aggregation methods would be better.

R3 -How deconflicting gradients reduce aggregation loss As detailed in Remark 1, the λz in our SVD indicates the curvature of the loss in the gradient direction vz . Our SVD allows us to discard insignificant directions and focus only on those with large curvatures (principal gradients). Thus the deconflicting gradient updates the model in a better direction that reduces global loss.

-Impact of flipping the direction of eigenvectors on gradients The eigenvectors obtained by SVD have no directions, with both v and -v being eigenvectors. Thus to ensure loss reduction, we calibrate eigenvector directions using the mean of local gradients, which typically represent loss-decreasing directions. Our experimental results also showed that this calibration improves convergence.

R4 -L2 regularization was used in FedSR Thanks for sharing the paper. Our margin control does resemble FedSR’s L2 regularization but with different motivations. While L2 regularization aims to align representation distributions from different clients with a common reference, our approach is inspired by shortcut learning, where large margins lead to reliance on shortcut features rather than stable ones. We can employ various regularization techniques, like evaluating log-loss on a margin multiplied by a decreasing function or setting output logits thresholds to penalize large margins, in addition to L2 regularization. We’ll cite FedSR and provide a comparison in our paper.

-Lacks a thorough examination of global loss decomposition We are grateful for the insightful recommendations, which align with our ongoing and future work. In this paper, we employ global loss decomposition to inspire two methods that jointly minimize loss terms. We test their effectiveness through a straightforward, single-round experiment on the training dataset. Our primary aim for global loss decomposition is to offer a tool for analyzing FL training processes. As the study of loss decomposition is in its early stages, a more comprehensive evaluation and comparison of loss terms and methods that enhance its reliability is left as future work.

-Derivation process of Loss decomposition The loss decomposition comes from the gap between the global model’s loss on the global dataset ( L(w), which we aim to minimize) and the averaged loss for local models on local datasets (local loss, minimized by local training). This gap, represented as L(w) - Local Loss, stems from two factors: 1) the difference in loss surfaces across local datasets, leading to Distribution shift loss; 2) the variation in loss between local and global models on the global dataset, resulting in Aggregation loss. Then, we get L(w) - Local Loss = Distribution shift loss + Aggregation loss, which is exactly Eq.(2).

-Why compare personalized FL? We compared personalized FL because 1) these methods are also used to tackle data heterogeneity; 2) their performance is evaluated on common test datasets, similar to general FL methods; and 3) FedPAC achieves SOTA performance on non-IID data.

-Distribution shifts in the label or feature space? We exhibit our results with label shifts constructed by Dirichlet distribution for the Retina dataset with α 100,0.1,0.5 for split1,2,3 respectively, each of which consists of 5 clients. COVID-FL is a real-world federated dataset that exhibits both shifts in label and feature distributions. We will add more details to our paper.

-Unrelated citation [6] is an example of when gradient differences fail to capture model bias in representation learning on graphs.




Meta-Review

Meta-review not available, early accepted paper.



back to top