Abstract

Federated learning has emerged as a compelling paradigm for medical image segmentation, particularly in light of increasing privacy concerns. However, most of the existing research relies on relatively stringent assumptions regarding the uniformity and completeness of annotations across clients. Contrary to this, this paper highlights a prevalent challenge in medical practice: incomplete annotations. Such annotations can introduce incorrectly labeled pixels, potentially undermining the performance of neural networks in supervised learning. To tackle this issue, we introduce a novel solution, named FedIA. Our insight is to conceptualize incomplete annotations as noisy data (i.e., low-quality data), with a focus on mitigating their adverse effects. We begin by evaluating the completeness of annotations at the client level using a designed indicator. Subsequently, we enhance the influence of clients with more comprehensive annotations and implement corrections for incomplete ones, thereby ensuring that models are trained on accurate data. Our method’s effectiveness is validated through its superior performance on two extensively used medical image segmentation datasets, outperforming existing solutions. The code is available at https://github.com/HUSTxyy/FedIA.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1182_paper.pdf

SharedIt Link: https://rdcu.be/dV54v

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72117-5_35

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1182_supp.pdf

Link to the Code Repository

https://github.com/HUSTxyy/FedIA

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Xia_FedIA_MICCAI2024,
        author = { Xiang, Yangyang and Wu, Nannan and Yu, Li and Yang, Xin and Cheng, Kwang-Ting and Yan, Zengqiang},
        title = { { FedIA: Federated Medical Image Segmentation with Heterogeneous Annotation Completeness } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {373 -- 382}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors present an approach to enable federation segmentation in the presence of incomplete segmentation across clients. The approach is based on reweighting the aggregation based on the predicted level of segmentation completeness from the early trained model, and adaptive correction to the labels to maximise the amount of data involved in the model training. The approach is demonstrated on two datasets: a MS lesion dataset and a COVID lung lesion dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The explored question is important for maximizing the application of federated learning in practice, and the paper provides a well-motivated discussion on this topic.
    • The decision to aggregate based on estimated completeness and current loss is well motivated and intuitive.
    • The approach is tested on two different datasets and demonstrates very strong performance in the low supervision setting.
    • Extensive experiments are conducted across various levels of supervision completeness, along with ablation studies.
    • The method is compared to a strong selection of baseline methods.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The definiton of completeness is dependent on the precence of multiple regions to be segmented such as lesions. Thus, im unsure of its applicability to the segmentation of for instance structures like the hippocampus, where some sites could segment only certain slices or subregions. How the method could be applied needs to be discussed or it needs to be discussed as a limitation of the approach
    • Similarly, random lesions are removed but this is likely not always realistic. It is possible that small lesions would be ignored or the segmentation would focus on lesions in certain regions of the brain depending on the condition of interest and thus the missing labels may be more systematic. What impact would this have on the approach?
    • Although an ablation study explores the effect of T for the COVID datasets, it is unclear how the user would know how to choose this value for a new dataset. I think it would be helpful to consider the performance/segmentation completeness versus loss
    • Only dice is presented but the completeness is motivated on missing lesion segmentations so recall is needed to assess whether this is truly overcome.
    • Some details that would help understanding of the method are missing (see below)
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Most information required for replication have been provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The use of number of regions as opposed to area of regions needs to be motivated, especially as area would be more generally applicable across segmentation tasks
    • The data is split in to 2d slices, are all slices considered or only those containing lesion?
    • It would be interesting to see the effect of removing the loss and annotation completeness rates from the reweighting as an additional ablation study.
    • What measure of confidence is used and what is the impact of the choice of threshold?
    • What is the impact of lambda, how was it chosen?
    • Where did the equation for completeness rate come from? Why are they different across the datasets? How sensitive is the result to this choice?
    • Table 1: would be good to report recall in addtion to dice. Id recommend removing from, replacing with the reference to make space in the table
    • Fig 3. Add the setting that lead to the result to the caption and which row of the table this corresponds to. Dice scores for each image would also help.
    • Was the same impact of changing T seen for MS?
    • In places the manuscript needs rewording for clarity, for instance but not limited to: first sentence of 2.1 is not a sentence, first sentence of incomplete annotation generation is not a sentence.
    • The manscript claims to use two datasets but then the MS is formed of 2 datasets, rephrase for clarity.
    • It would be helpful to define early model for those unfamiliar with the annotation completeness literature.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents an interesting idea, but a few issues need clarifying particularly around general applicability across segmentation tasks.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper proposed a novel method that takles the missing annotations in single label senario across clients and achieves improvements in segmentation accuracy. The proposed methods includes an annotation completeness estimation that is based on the segmentation results of the global model, client-wise adaptation correction that

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is clear and easy to follow.
    2. The idea of annotation completeness is less studied and should raise more attention.
    3. The use of trending line to estimate the confidence of having missing annotations is interesting.
    4. A strong evaluation with datasets containing different organs (lungs and brain) and also CT and MRI.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Although the authors indicated that the assumption of FedA3I is not suitable for the problem settings in this paper, it is benefitial to include experiments that compare the proposed method against FedA3I.
    2. Authors should explain why fit the trend line to a linear equation.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Please include FedA3I in your experiments.
    2. Please also explain the choice of linear function as the trend line.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well written and raise an interesting question in federated learning that have not been explored. Also, the proposed method utilizing trend line on training history to evaluate uncertainty is interesting. However, the lack of explanation of using linear equation for the trend line should be addressed in the rebuttal.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The proposed work focuses on developing a federated medical image segmentation pipeline (FedIA), with the goal of mininizing the impact of annotation incompletness at the client level. Incomplete annotations are treated as noisy data and a dynamic correction approach for clients affected by the highest incompleteness is developed. The authors validate their approach on two different medical image segmentation datasets, MS and LUNG, and compare its performance against seven other Federated Learning strategies.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well structured and well written. The most relevant strength is that the paper assesses a very relevant issue in Federated Learning, namely data heterogeneity at the client level, specifically in the terms of annotation incompletness. The proposed solution to tackle this issue is interesting and supported by the literature. The figure explaining the methodology is clear and well inserted in the text. The experimental set-up is thorough and well designed. I particularly appreciated the use of parameter m to create experiments with increasing annotaion incompleteness. The methodology is tested against a representative number of federated aggregation methods, including methods specifically designed to mitigate data heterogeneity. The ablation study is well conducted, showing the impact of each novel module proposed by the authors.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There are some aspects of the methodology that can be explained further. 1) In paragraph 2.3, it is stated “the observed loss is lower when annotation completeness is elevated. Consequently, the server prioritizes clients exhibiting lower losses, effectively reducing the negative effects of imprecise estimation of a_k on the weighting process, potentially arising from inappropriate selection of T”. Is this only assessed empirically? 2) In paragraph 2.4, it is not entirely clear how the authors derive equation 6. It is indeed the first-order polynomial function, but I would like to have more details behind the intuition to use this as criterion to decide which clients need corrections. 3) In paragraph 2.4, it is stated that the client only corrects annotations for which its model output has confidence above a certain threshold setting of 0.8. Maybe I missed it, but could the authors detail how the confidence is computed? 4) What motivated the decision to only consider the false negative lesions?

    The paper would have also benefited from an additional experiment: annotation incompleteness can affect performance of the models trained in a centralised setting as well. It would have been interesting to see this additional configuration, to also provide an upper bound for the other results.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Thank you for your submission. The writing can benefit from some minor corrections:

    1) Deep learning models are “robust” not “resilient”. This term is becoming more and more used, since ChatGPT uses it a lot, but it is not really a synonym accepted in technical language. 2) In subsection 2.1, the following sentence is missing the main verb “The completeness ratio …, indicating the proportion of marked lesions to the total actual lesions within D_k, which remains identical among samples in D_k but differs across clients. 3) In subsection 3.1, the following sentence is missing the main verb “Two real-world multiple sclerosis datasets focusing on the segmentation of white matter lesions (WML) in 3D magnetic resonance (MR) brain images, denoted as MS, including MSSEG-1 [1] and PubMRI [5]. 4) In paragraph “Incomplete Annotation Generation”, subsection 3.1, the following sentence is missing the main verb “Given that when doctors or other professionals label multi-lesion data, they tend to label one lesion at the 3D volume level before annotationg another.

    Additionally, in Figure 3 it is not explained what the colour green represents (it is explained in the supplementary material).

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper addresses a relevant challenge for the federated learning community at MICCAI and proposes an interesting solution. It addresses a complex but realistc scenario that must be taken into consideration when developing and designing federated learning segmentation pipelines.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We sincerely thank all reviewers for their insightful feedback. For writing mistakes, we have revised the manuscript according to their suggestions. Due to space limit, some main concerns are addressed as follows:

Reviewer #3 Q1: Can the method be applied to annotations with certain slices or subregions? A1: Our method is tailored for incomplete annotations concerning subregions, but not specifically for slices. Recognizing the relevance of this issue in medical practice, we plan to explore it further in our future research. Q2: Randomly removing lesions may not be realistic. A2: In this paper, we make the first effort to address incomplete annotations in the scope of FL, even under specific assumptions of unlabeled lesions. In the future, we will delve into more realistic settings. Q3: How to choose T. A3: In the experiments, we show that our method is somewhat robust and does not depend heavily on the tuning of T. However, your suggestion is well-taken; an approach with adaptive T is better, and our method needs to be improved accordingly. Q4: Only dice is presented. A4: Qualitative results (Fig. 3) demonstrate that FedIA recalls all lesions with fewer false positives, leading to the best segmentation performance, whereas other methods suffer from extensive false negatives or fail to segment any lesions. Following your suggestion, we will consider adding recall as a new metric. Q5: The use of number of regions. A5: The annotation may include a large lesion but miss many small ones, leading to a low completeness rate. Q6: All slices? A6: Only slices with lesions are considered. Q7: The effect of two elements in reweighting function. A7: We consider adding an ablation study in future work. Q8: Confidence and threshold. A8: The confidence means the predicted probability; the feature map goes through the linear layer and then softmax to obtain the probability. We conducted an ablation study in the supplementary material about threshold. Q9: The impact of lambda. A9: We haven’t conducted experiments to evaluate the impact of lambda and we chose its value by observing changes in IoU in the early phase. Q10: The equation for completeness rate. A10: The equation is an estimation designed by us. It varies across datasets due to different annotation levels. The computed results of the equation are not exactly equal to the actual annotation completeness but indicate which client’s annotation is more complete. Q11: The impact of changing T for MS A11: MS is more sensitive to the changing T, but it can also maintain stability to a certain extent.

Reviewer #4 Q1: Loss is only assessed empirically. A1: The assessment is not solely empirical. Our setting is IID; the low completeness means a high noise level. In many early studies, loss was used to measure noise level, confirming the relationship between the two. Q2: First-order polynomial function. A2: We derived the fitting with inspiration from the CVPR paper ADELE, which uses an exponential function. Considering that the IoU increases more slowly in FL compared to CL and that the first-order function is the simplest, we chose a first-order polynomial function, which performs better than an exponential function and fits the early values well in practice. Q3: Output confidence. A3: Please see Q8 in Reviewer #3. Q4: Only consider the false negative lesions. A4: A critical problem of multi-lesion annotation is the labor-intensive process. We aim to develop a well-performing model with incomplete data and focus on handling the incompleteness issue. In our setting, only false negative lesions exist without false positive lesions, highlighting the incompleteness problem.

Reviewer #5 Q1: FedA3I. A1: In our initial consideration, FedA3I focuses on the noise near the boundary, where the annotation is complete. Thus, we did not include it in our experiment. We will consider including FedA3I in our experiments. Q2: The choice of a linear function. A2: Please see Q2 in Reviewer #4.




Meta-Review

Meta-review not available, early accepted paper.



back to top