Abstract

Federated learning enables collaborative knowledge acquisition among clinical institutions while preserving data privacy. However, feature heterogeneity across institutions can compromise the global model’s performance and generalization capability. Existing methods often adjust aggregation weights dynamically to improve the global model’s generalization but rely heavily on the local models’ performance or reliability, excluding an explicit measure of the generalization gap arising from deploying the global model across varied local datasets. To address this issue, we propose FedEvi, a method that adjusts the aggregation weights based on the generalization gap between the global model and each local dataset and the reliability of local models. We utilize a Dirichlet-based evidential model to disentangle the uncertainty representation of each local model and the global model into epistemic uncertainty and aleatoric uncertainty. Then, we quantify the global generalization gap using the epistemic uncertainty of the global model and assess the reliability of each local model using its aleatoric uncertainty. Afterward, we design aggregation weights using the global generalization gap and local reliability. Comprehensive experimentation reveals that FedEvi consistently surpasses 12 state-of-the-art methods across three real-world multi-center medical image segmentation tasks, demonstrating the effectiveness of FedEvi in bolstering the generalization capacity of the global model in heterogeneous federated scenarios. The code will be available at https://github.com/JiayiChen815/FedEvi.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2717_paper.pdf

SharedIt Link: https://rdcu.be/dV54u

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72117-5_34

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2717_supp.pdf

Link to the Code Repository

https://github.com/JiayiChen815/FedEvi

Link to the Dataset(s)

https://drive.google.com/file/d/1sf0W4QmQn-rY7P-OJMVZn7Hf50jD-w/view?usp=drive_link https://liuquande.github.io/SAML/ https://drive.google.com/file/d/1p33nsWQaiZMAgsruDoJLyatoq5XAH-TH/view https://zenodo.org/records/6325549

BibTex

@InProceedings{Che_FedEvi_MICCAI2024,
        author = { Chen, Jiayi and Ma, Benteng and Cui, Hengfei and Xia, Yong},
        title = { { FedEvi: Improving Federated Medical Image Segmentation via Evidential Weight Aggregation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {361 -- 372}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes a medical image segmentation method using the evidential weight aggregation method. The proposed approach utilizes the federated learning strategy for training a model from the medical data. The authors introduce FedEvi based on the Dirichlet-based evidential model. FedEvi shows comparative performance over existing state-of-the-art methods.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is easy to understand. FedEvi with the Dirichlet method seems a novel approach. Experimental results show that FedEvi achieves competitive performance over existing methods.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

The biggest concern in this paper is the performance of the papers being compared. It needs to be explained how the experimental results of the other methodologies compared to FedEvi were obtained. Most of the results seem to be taken by the authors’ implementation, but I would like to know the exact explanation. In particular, there are general methods for federate learning in the previous papers, and I wonder how they applied them to the segmentation task. Figure 1 could be drawn more intuitively. There is no visual difference between the surrogate global model and the global model. The local training arrow is not visible in Figure 1.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

No
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

I would like to see a more precise explanation of the experimental comparison. Also, the figures could be more intuitive.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper proposes a novel approach with good results. A paper presentation is also acceptable. However, there is critical concern regarding the experimental results of previous papers.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The paper introduces a method of weight aggregation called FedEvi. Uncertainity is divided into epistemic (related to model knowledge) and aleatoric (inherent data uncertainty) based on dirichlet model. The weights are dynamically allotted to colabs based on the global generalization gap and local reliability. FedEvi increases the aggregation weights for clients that have significant generalization gaps in the global model and high reliability in their local models.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Performance comparison on 3 different datasets and 12 methods is good. Dynamic weight adjustment of collaborators in federated settings has been used earlier and this paper also follows the strategy. Uncertainity fragmentation into aleatoric and epistemic is explained well.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Comparison with Similarity Weight Aggregation (SimAgg) method would have been better. https://link.springer.com/chapter/10.1007/978-3-031-09002-8_40

Discussion on model performance on iid and non-iid data would be a nice addition.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

if there is code available on public repository that would be better
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

Comparison with Similarity Weight Aggregation (SimAgg) method would have been better. https://link.springer.com/chapter/10.1007/978-3-031-09002-8_40

Discussion on model performance on iid and non-iid data would be a nice addition.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Accept — should be accepted, independent of rebuttal (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

paper is well written, depth of model evaluation, technical explanations in the paper
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper
1. Employ the Dirichlet-basedevidential model to disentangle overall uncertainty into epistemic and aleatoric components, thereby providing a detailed uncertainty representation.
2. Propose FedEvi, a novel aggregation-based FL method, based on the global generalization gap and local reliability. The global generalization gap is measured using the epistemic uncertainty within the surrogate global model, while the local reliability is evaluated through the aleatoric uncertainty within local models.
3. FedEvi conspicuously outperforms 12 state-of-the-art FL methods on three real multi-center medical image segmentation datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Propose a novel framework to conduct federated learning via a dynamic aggregation weight assignment controlled by global generalization gap and local reliablility, which are represented by epistemic uncertainty and aleotoric uncertainty.
2. A comprehensive experiment on three data sets and conduct abative studies to show component effectiveness and hyper-parameter impact.
3. Some visualizations in appendix.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. Some errors in equation 10, should the second terms in KL loss be “given 0 not given 1” based on the description of reducing incorrect class prediction evidences?
2. Fundus data set is collected from 5 centers, why there’s 6 clients in both table results and visualizations?
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

I think it is reproducible given the comprehensive and reasonable reported experiment results.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

I wonder why only use epistemic uncertainty to measure the generalization gap of global model, a recommendation is to combine global model’s segmentation accuracy (or dice/hd95) with the epistemic uncertainty, only using uncertainty itself is known to be fragile and not that convincing.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Accept — should be accepted, independent of rebuttal (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
1. Novel and logical proposed FL framework
2. Clear presentation
3. Comprehensive experiment results
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Author Feedback

We sincerely thank all reviewers and ACs for their recognition of the novelty, performance, and presentation of this work. Here are responses to their invaluable suggestions and remaining concerns.

R1Q1: Comparison with SimAgg As suggested, we will compare FedEvi with SimAgg on the three federated segmentation tasks and summarize the comparison results in the camera-ready version.

R1Q2: Performance on IID and Non-IID data The datasets utilized in our study belong to Non-IID data. We will further partition the Kvasir dataset (client1 from the polyp dataset) into 4 equal-sized subsets to simulate IID data from 4 centers. The comparison results will be provided in the journal version.

R2Q1: Experimental details FedDG and FedCE are designed for federated segmentation tasks, while others are task-agnostic. The implemental details are as follows. FedProx: It constrains the distance between local and global model parameters in local training. FedProto: We followed FedSeg (CVPR23) to calculate local categorical prototypes, thereby aligning global and local prototypes. Besides, clients and the server also communicate model parameters alongside prototypes for fair comparison. FedSAM: It adds a small perturbation to local models in local training to enhance the generalization capability of the global model. FedBR: We introduced the projection layer following the encoder of 2D-UNet for both global and local models, enabling feature projection and contrastive learning. FedLAW: It constructs a proxy dataset for optimizing the aggregation weights on it. FedGA: We used the change in Dice loss of the validation set to measure the generalization gap to adjust aggregation weights. L-DAWA: It adjusts layer-wise aggregation weights based on the parameter divergence between the global and local models. FedUAA: We first calculated pixel-wise uncertainty and Youden index and then determined aggregation weights based on the averaged Youden index of each client.

R2Q2: Revision of Fig.1 Thanks for your constructive suggestion! We have revised Fig.1 in the camera-ready version to clarify the distinction between the surrogate global model and the global model, along with a clear depiction of local training.

R3Q1: Error in Eq.10 The Dirichlet parameter is linked to evidence by α = e + 1 (EDL, NIPS18). By reducing the evidence of incorrect predictions, the corresponding Dirichlet parameter decreases to its minimum of 1. Therefore, it should be KL[Dir(ρ|α)||Dir(ρ|1)] rather than KL[Dir(ρ|α)||Dir(ρ|0)].

R3Q2: #Client in fundus dataset Thanks for pointing out the writing error! To ensure clarity and avoid potential confusion, we have revised the number of clients to 6 and modified the data source names of client3 and client4 to REFUGE(Zeiss) and REFUGE(Canon) in the camera-ready version.

R3Q3: Measurement of global generalization gap As recommended, we will analyze the impact of the global performance on the global generalization gap in the journal version. Specifically, we will conduct an additional experiment by multiplying the global epistemic uncertainty with the Dice loss on the validation set to evaluate the global generalization gap.

Meta-Review

Meta-review not available, early accepted paper.

back to top

FedEvi: Improving Federated Medical Image Segmentation via Evidential Weight Aggregation

Author(s):