Abstract

Federated learning is one popular paradigm to train a joint model in a distributed, privacy-preserving environment. But partial annotations pose an obstacle meaning that categories of labels are heterogeneous over clients. We propose to learn a joint backbone in a federated manner, while each site receives its own multi-label segmentation head. By using Bayesian techniques we observe that the different segmentation heads although only trained on the individual client’s labels also learn information about the other labels not present at the respective site. This information is encoded in their predictive uncertainty. To obtain a final prediction we leverage this uncertainty and perform a weighted averaging of the ensemble of distributed segmentation heads, which allows us to segment “locally unknown” structures. With our method, which we refer to as FUNAvg, we are even on-par with the models trained and tested on the same dataset on average. The code is publicly available at https://github.com/Cardio-AI/FUNAvg.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1396_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1396_supp.pdf

Link to the Code Repository

https://github.com/Cardio-AI/FUNAvg

Link to the Dataset(s)

http://medicaldecathlon.com/ https://www.synapse.org/Synapse:syn3193805/wiki/89480 https://chaos.grand-challenge.org/Data/ https://github.com/JunMa11/AbdomenCT-1K https://learn2reg.grand-challenge.org/Learn2Reg2021/ https://zenodo.org/records/10047292 https://amos22.grand-challenge.org/

BibTex

@InProceedings{Töl_FUNAvg_MICCAI2024,
        author = { Tölle, Malte and Navarro, Fernando and Eble, Sebastian and Wolf, Ivo and Menze, Bjoern and Engelhardt, Sandy},
        title = { { FUNAvg: Federated Uncertainty Weighted Averaging for Datasets with Diverse Labels } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes to learn a joint backbone in a federated manner while each site receives its own multi-label segmentation head. A Bayesian segmentation network is utilized to estimate the segmentation uncertainty and an ensemble method is used to combine the multi-site knowledge.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1)Federated learning with heterogeneous annotations is worth studying.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) The organization of this paper is lacking. The introduction fails to adequately introduce related works and motivation, the method section primarily focuses on uncertainty usage without providing an overview of the federated learning pipeline, and the result and discussion section should be separated. (2) The writing quality detracts from comprehension. It is unclear how utilizing a Bayesian network addresses the issue of heterogeneous annotations, whereas personalized segmentation heads are already prevalent in the community. (3) The novelty of the proposed method is limited, as uncertainty estimation, personalized segmentation heads, and Bayesian networks are commonly employed in medical image segmentation models. (4) Formula errors are prevalent, with italicized text requiring a colon before the main text.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    (1)Revise the introduction section to clearly articulate the motivation and proposed solution. (2)Conduct experimental comparisons with relevant methods. (3)Provide an analysis of the computational and transmission costs of the proposed method in comparison to related approaches. (4)Clarify how the method handles datasets with non-overlapping classes. (5)Clearly explain the color representation in Fig. 5. (6)Detail how pixels with identical class probabilities are classified after averaging multiple predictions.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    please refer to the weakness.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision

    I have read the rebuttal, and the comments of the other reviewers. The authors answered some comments I raised in my review.



Review #2

  • Please describe the contribution of the paper

    The paper introduces FUNAvg, a federated learning approach for medical image (radiology) segmentation by training a joint backbone across distributed sites, while each hospital/dataset is equipped with its own multi-label segmentation head. By employing Bayesian techniques the model becomes more robust. This information is utilized via a weighted averaging of predictions from all heads, allowing accurate segmentation of structures not locally labeled, achieving superior performance compared to traditional models trained on homogeneous datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This approach presents a remarkably straightforward yet effective method to enhance performance in heterogeneous settings.

    • The evaluation is solid, complemented by compelling qualitative figures.

    • Additionally, the introduction is exceptionally well-crafted.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The intuition on this should be much more prominent and clearly stated. – What is the reasoning for subtracting the uncertainty from BG, not anything else, like multiplication?

    • While MC dropout has shown great calibration in segmentation task, it is hardly interpretable as it results in noisy and grid aligned artifacts in uncertainty maps, i.e, see Prob. U-net by Kohl et al.

    • ECE is a weak/bad metric for measuring the calibration of the segmentation task, as a lot depends on bucket size M (which is missing), and plain Accuracy is not the main interest for the downstream task as it overly weights the background class. Balanced Accuracy/Dice (macro) or mIoU would be better metrics to use within the Calibration Error. I suggest looking up Adaptive Calibration Error, Expected Segmentation Calibration Error, or Generalised Energy Distance.

    • Weak related work … there are only 24 papers cited, but not a single paper on practical uncertainty estimation for medical segmentation tasks or more novel federated learning Algorithms. Even though most works have been done on centralized systems for uncertainty estimation, topics as close as VI with Multiple Heads have been published for centralized systems on medical image segmentation. Further, the field of uncertainty-aware federated learning definitely has received some more relevant papers. – Some papers may miss information to be correctly cited. Please check: 4, 5, 9, 11, 20, 22, 24

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Description of the method is definitely good and with the promised code (hopefully high quality) it should be excellent in reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Not all components of equations are readily defined i.e. for Equation (1): Theta has been defined two pages before. Further, what is T? What is f^{theta}? Also, what does the symbol on the square mean something? (it is unclear what a square convoluted is)?

    • “The segmentation heads seem to have learned that there might be a structure” - I would argue this could also be that the representations shared by the other clients push these structures into consideration even, so the respective client does not learn them. → Thanks for the intuition, but further investigation would be appreciated in the future, i.e, evaluating the observation under different depths of the heads. Until then, this is an empty claim.

    • “To obtain valid probabilities me must only adjust…” sounds off. Please rephrase.
    • “We evaluate this model on the same and then on all other datasets (row 1 and 2 of Tab. 1).” Unclear what on all other datasets’ means? All other datasets model evaluated on that test set or the average of that model to all other datasets?

    • Fig 1 is cluttered with crossing lines, pinching/almost overlapping boxes, unreadable text, and distracting parts… I love the idea of the figure, but it could receive some love.
    • Fig.4 could need a “readability/prettiness” update.
    • Fig.5 FedUWAvg = FUNAvg? This point might be limited for the rebuttal at miccai, but in the 8 years since the FedAVG publication, many other adaptions have been proposed. Would recommend something from the last 2–3 years when going for a resubmission elsewhere.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Besides the weaknesses I think this paper has the potential to be accepted given some minor tweaks to the paper i.e. related work. The evaluation could profit from better comparisons, but I understand the difficulty of running and evaluating on so many datasets, that’s why I would put that aside.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    Thank you for your clarifications.



Review #3

  • Please describe the contribution of the paper

    The paper introduces Uncertainty Weighted Averaging (UNAvg) to address distributed datasets with diverse labels. The key contribution is UNAvg, which reweighs the predictive probability of a segmentation head based on its uncertainty, with this uncertainty encoding un-annotated structures learned during training. Although applicable in both centralized and federated learning (FL) settings, the method’s performance in FL outperforms centralized training, as supported by the paper’s observations and discussions. However, the paper could benefit from improved figure and writing organization, as the current presentation makes it challenging to grasp its contributions.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper’s methodology is simple yet effectively leverages uncertainty in the model heads to find ‘locally unknown’ structures in a novel way.
    2. The discussion on why Federated Learning (FL) outperforms centralized training is particularly insightful. It is substantiated by observations that predictions in a centralized setting tend to be overconfident, and FL can help mitigate this issue.
    3. The results and analysis are comprehensive.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The paper lacks a discussion and comparison with the baseline method, CondistilFL [1], which also addresses scenarios with partially annotated data.
    2. The presentation of the material is lacking. The organization of figures disrupts the flow of reading; for example, Figure 5 is mentioned first but appears last. Figure 3 is found on page 5 but is referenced on page 7.
    3. Figure 4(b) has a confusing color scheme in the legend, where both DD and UB are represented by the same color, orange, making it unclear which is which. The claim that “the improvement of our proposed FUNAvg is larger for underrepresented labels” would benefit from concrete examples and supporting Dice values, rather than requiring readers to toggle between Figures 3 and 4(b) to verify this claim.

    [1] Wang, P. et al. (2023). ConDistFL: Conditional Distillation for Federated Learning from Partially Annotated Data. In: Celebi, M.E., et al. Medical Image Computing and Computer Assisted Intervention – MICCAI 2023 Workshops. MICCAI 2023.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The paper’s methodology is straightforward and can be easily reproduced; however, I urge the authors to provide the promised code along with the specific data splits used in their experiments.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1) Please correct the legend in Figure 4(b) where ‘DD’ and ‘UB’ are currently represented by the same color. Differentiating these is needed for clarity and interpretation. 2) The uncertainty quantification equations require definitions that stand alone, rather than depending solely on the referenced literature. This would make the paper self-contained, reducing the inconvenience of having to refer to external sources for basic explanations. 3) If possible, provide metrics for individual classes within each dataset in a supplementary. 4) There is an incomplete sentence in the second line of the abstract: “But partial annotations pose an …”

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper’s key insights, discussion, and results offer valuable content for the MICCAI community; however, presentation issues and incomplete explanations prevent a higher score.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors have adequately addressed my concerns.




Author Feedback

Thank the reviewers (R) for their critical assessment and insightful suggestions. We want to clarify some major critiques as follows: (R1,R3,R4) Missing comparison to existing methods: We previously experimented with Marginal Loss, one key component of ConDistFL. However, we observed poor performances, which we did not report so far: Marginal Loss led the network to only learn structures present at many clients (liver & spleen), highlighting a drawback of earlier partially annotated FL works, which involved datasets with a single easy-to-segment label. In contrast, we have many labels only available at certain clients. (R1) Highlighting Contribution: We show that for more diverse label distributions the problem can be elegantly mitigated by using Bayesian techniques. Properly utilizing uncertainty improves performances, especially for under-represented structures with the same computational and communication load as FedAvg. The introduction is adjusted accordingly. (R1,R3,R4) Missing related Work: We provide a more in-depth explanation of previous work on uncertainty-aware FL (Linser et al., 2021; Boughorbel et al., 2019) and uncertainty quantification with a multi-head UNet in the centralized setting (Fuchs et al., 2022) and added standard works on uncertainty estimation in medical image segmentation (Kohl et al., 2019; Kendall et al., 2017). (R3) Why are untrained structures visible in uncertainty? We rewrote our paragraph in the discussion to explain why we think that the different segmentation heads learn distinct structures in their uncertainty. In the federated case, two heads with non-overlapping labels might use the same feature maps, while in the central case, the heads tend to use different ones. (R3) Intuition behind uncertainty reweighting unclear: We apologize for any confusion about subtracting the uncertainty, which likely stemmed from Fig. 2a, which we have adjusted for clarity. We use the multiplication $(1-u)p_bg$ , where $u$ is the uncertainty and $p_bg$ the background logits. The idea is that $u$ represents the probability of “something” being present, making $1-u$ the probability of “nothing”. The background channel encodes the probability of nothing and is reweighted by $1-u$, which acts as a form of quality measure. (R3) Evaluation Strategy: We performed a 80-20% train-test split at each client. For having a baseline, we trained on the single clients (no FL) and in the intra-client scenario, we evaluate on the same client. In the inter-client scenario, we use the same model and test it for a specific label on the test split of all other clients where this target label is available. We adjusted the text to improve understanding. (R3) Calibration Metrics: Although not explicitly stated, we did use the ESCE, which we will clarify. We calculated a per-class-averaged ESCE to address unbalanced size of labels (e.g. many background pixels) and will add this to the results: CenAvg 22.0±26.1 CUNAvg 16.5±22.1 FedAvg 21.3±22.0 FUNAvg 12.4±10.4 (R3) MC Dropout might suffer from grid-aligned artifacts: We use the sum of the uncertainties from all clients, which resembles an ensemble-like uncertainty prediction, improving the uncertainty estimates. (R1,R3) Uncertainty Formula: We added a brief description to Eq. 1. T is the number of MC sampling steps, $\otimes$ denotes the outer product, $f$ denotes the network with parameters $\hat{\theta}_t$ in step $t$ for input $x^*$. (R1,R3,R4) Clarity of presentation: We added a table in the appendix for the performance of each segmentation head on each structure across all datasets and give examples in the text (R4). We separated the results and discussion section (R1). We removed intersecting arrows and reorganized the text for better readability in Fig. 1. In Fig. 4b & 5, we unified the color coding so that each organ is now represented by a distinct color. We added information in the bibliography (R3). We rewrote ambiguous phrases and repositioned the figures (R3,R4).




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper received two positive scores and one negative score from the reviewers. While there are some limitations in this work, such as the experimental design, the overall insights presented would likely be of interest to the research community.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper received two positive scores and one negative score from the reviewers. While there are some limitations in this work, such as the experimental design, the overall insights presented would likely be of interest to the research community.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top