Abstract

Cross-silo federated learning (FL) enables decentralized organizations to collaboratively train models while preserving data privacy and has made significant progress in medical image classification. One common assumption is task homogeneity where each client has access to all classes during training. However, in clinical practice, given a multi-label classification task, constrained by the level of medical knowledge and the prevalence of diseases, each institution may diagnose only partial categories, resulting in task heterogeneity. How to pursue effective multi-label medical image classification under task heterogeneity is under-explored. In this paper, we first formulate such a realistic label missing setting in the multi-label FL domain and propose a two-stage method FedMLP to combat class missing from two aspects: pseudo label tagging and global knowledge learning. The former utilizes a warmed-up model to generate class prototypes and select samples with high confidence to supplement missing labels, while the latter uses a global model as a teacher for consistency regularization to prevent forgetting missing class knowledge. Experiments on two publicly-available medical datasets validate the superiority of FedMLP against the state-of-the-art both federated semi-supervised and noisy label learning approaches under task heterogeneity. Code is available at https://github.com/szbonaldo/FedMLP.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1176_paper.pdf

SharedIt Link: https://rdcu.be/dV54x

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72117-5_37

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1176_supp.pdf

Link to the Code Repository

https://github.com/szbonaldo/FedMLP

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Sun_FedMLP_MICCAI2024,
        author = { Sun, Zhaobin and Wu, Nannan and Shi, Junjie and Yu, Li and Cheng, Kwang-Ting and Yan, Zengqiang},
        title = { { FedMLP: Federated Multi-Label Medical Image Classification under Task Heterogeneity } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {394 -- 404}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The manuscript investigates an approach to tackle the task heterogeneity problem. Task heterogeneity was described as not all institutions in the federating learning setup have all label categories, they would miss few labels. The manuscript further creates an experiment for multi-class classification using federated learning with such missing label categories. The approach used to learn the missing labels was performed using 2 stage approach. Firstly, pseudo label tagging, here a warmed-up model was used to generate class prototypes and select confident samples. Lastly, global knowledge learning, a global teacher model was used to regularize the training to avoid forgetting of missing labels. The model was evaluated on publicly available medical datasets and was compared against the federated semi-supervised learning and noisy label learning.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    There are some notable strengths of the manuscripts,

    1. The author have clearly mentioned the problem of task homogeneity in the multi-classification scenario in a federated learning system.

    2. The 2 stage approach, Federated Multi-label learning with partial annotation (FedMLP) was used to tackle these task homogeneity problem.

    3. First stage used, Model warm-up to create a class prototype, also used weighted partial class loss function to train the model. Self-adaptive Threshold was used to select pseudo labels for further training. Lastly, consistency regularization was used to minimize the information loss between global and local model.

    4. The authors have evaluated this approach using public datasets. The proposed FedMLP approach was compared with SOTA, FSSL and FNLL methods. The results show good results compared to the other approaches.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There are few questions/comments regarding the manuscripts,

    1. Figure 2, was a little difficult to understand and follow. The red color legend mentions missing class in caption, where as it mentions as negative class in the legends.

    2. Figure 2, Also, the Active class and negative class was shown using 2 empty boxes of color green and red respectively, but none of the boxes in the Figure 2 were using these legends.

    3. Section 2.3, in the line, ‘we use prototypes to detect and choose confidential samples …’, do you mean confident samples.
    4. During the federated training, the local prototypes of each class from each client were sent to the server. Since these prototypes involve class specific information for each client. It would be good to know from the authors regarding data leakage regarding the class specific information sent to server, and also indirectly sending the same information to other clients.
    5. Section 3.1 Experimental Setup, the training dataset was equally distributed among clients. It would be good to know from the authors, why was the training data distributed equally among clients. As, in federated learning scenario, the datasets available for training will not be of equal distribution.
    6. Section 3.1, Section Partial Label generation, Randomly, equal number of categories were removed from each clients. It would be good to know from the author why was equal number of categories removed from each client.
    7. Section 3.3 ablation study. Table 4. was referred for ablation study. There is no table 4 in the manuscript for ablation study. This will have to be edited to Table 3.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?
    1. The data distribution used for training and testing the 2 public datasets was not provided. Specifically for removing the equal number of labels from each client. This information would limit the reproducibility of similar results as described in the manuscript.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The manuscript was well written to explain the problem of task heterogeneity. The scenario of having few labels at some of the clients in the federated learning scenario.
    2. Few things that could help in improving the manuscript was randomly distributing the dataset among clients and, randomly removing labels from few clients.
    3. Figure 2, was difficult to interpret, It would be good idea to make figure 2 more clear in terms of interpretation.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The manuscript was well written to explain the problem of task heterogeneity. The scenario of having few labels at some of the clients in the federated learning scenario. This problem was resolved by 2 stage approach.
    2. The approaches used such as model warm-up to create class prototypes, self-adaptive threshold to select pseudo-labels and using consistency regularization between global and local model were some of the highlights of the manuscript.
    3. Few of the things that could be better, data distribution was done equally. Also, data distributed among each client was not mentioned.
    4. Clarification about data leakage during training while sharing the local prototype, would bring more clarity to this approach.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper
    1. This paper first formulates a realistic label missing setting in the multi-label FL domain.
    2. This paper proposes a two-stage method, called FedMLP, to solve label missing in multi-label FL.
    3. The proposed method outperforms prior works on two public datasets.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well-written and easy to understand and follow.
    2. Transforming the missing label problem into semi-supervised learning makes sense.
    3. Experiments are conducted on two big datasets.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Related work is not covered extensively. Many works focusing on label distribution skew problem are neglected. [1] Federated Learning with Label Distribution Skew via Logits Calibration [2] CalFAT: Calibrated Federated Adversarial Training with Label Skewness [3] Towards Addressing Label Skews in One-Shot Federated Learning
    2. Logit adjustment techniques have been applied to FL. [1] Federated Learning with Label Distribution Skew via Logits Calibration [2] CalFAT: Calibrated Federated Adversarial Training with Label Skewness
    3. In the MLD module, there are too many (four) hyper-parameters. How did authors select these hyper-parameters?
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Please see the weaknesses of the paper
    2. The neglected related works on label skewness should be added for comparison.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the techniques used in this paper are not very novel, transforming the missing label problem into semi-supervised learning is reasonable.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper proposes a new setting for Federated Learning which captures the real world scenarios where some clients preferentially label only certain classes of interest and leave others unlabelled.

    They pose it as a FMLL problem and propose an intuitive methodology building upon the techniques such as prototype learning, pseudo labeling and consistency regularization.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The formulation is novel and the models the real world scenarios where labeling is very costly and the doctors/hospitals label only the diseases of interest to them.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The experimental settings are not very clearly explained and justified. The assumptions regarding the test set is not clear.

    The dataset used are synthetically generated and the authors decide to remove some classes and remove class labels artificially. It is not fully clear how well the techniques will work on real-world federated datasets.

    The idea of “hot” and “cool” is intuitive, but not clear how it is modeled in the experiments.

    I think that the test set at each hospital should also be biased towards the classes they consider important and have labeled.

    There are real-world federated datasets available now where the name of the hospitals where each medical image originated is known. For example: https://github.com/owkin/FLamby

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    There are lot of complex ideas mentioned in the paper and the dataset construction is also very complex. It would be difficult to reproduce. Variance of the results has not been mentioned.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. It is not clear why the authors call it federated multi-label learning and not federated multi-task learning? Federated Multi-Task learning is an established field.

    The experimental settings are not very clearly explained and justified. The assumptions regarding the test set is not clear.

    The dataset used are synthetically generated and the authors decide to remove some classes and remove class labels artificially. It is not fully clear how well the techniques will work on real-world federated datasets.

    The idea of “hot” and “cool” is intuitive, but not clear how it is modeled in the experiments.

    I think that the test set at each hospital should also be biased towards the classes they consider important and have labeled.

    There are real-world federated datasets available now where the name of the hospitals where each medical image originated is known. For example: https://github.com/owkin/FLamby

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed setting is novel and realistic, but the experiments are not very clearly described for reproducibility.

    It would have been stronger, if the authors used real-world federated benchmarks.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We sincerely thank all reviewers for their insightful feedback. Due to space limit, some main concerns/misunderstandings are addressed as follows:

Reviewer#3 Q1&2: Fig. 2 is confusing. A: In our paper, missing classes and non-missing classes mean negative classes and active classes, respectively. The ambiguity might arise due to “negative” can also represent the negative samples of a class, which are indicated by 0 in the diagram. The negative class and active class in the legend are denoted only by color and are not related to the number of boxes. To avoid misunderstanding, we will correct in the camera-ready version. Q3&7: Text description issue. A: We will correct these errors. Thank you. Q4: Data leakage problem. A: Previous studies [1] have shown that multiple averaging of features can better protect privacy. In our method, the features of each class are first averaged locally, and then the class prototypes are averaged again after being uploaded to the server. This process makes it nearly impossible to reconstruct the original samples. [1]Rethinking federated learning with domain shift: A prototype view Q5&6: Regarding the issue of experimental details. A: The paper focuses on FL scenarios under task heterogeneity. To simplify the problem, we make an assumption on data distribution, and explored performance improvements under different task missing rates.

Reviewer#6 Q1: Related work on label distribution skew is not covered extensively. A: We will consider to cite related works. However, we should note that this paper explores task heterogeneity but not general data heterogeneity. Q2: Logit adjustment techniques have been applied to FL. A: Although logit adjustment techniques have been previously utilized, we employ this technique as it effectively addresses the new challenge we present. Q3: In the MLD module, there are too many (four) hyper-parameters. How did authors select these hyper-parameters? A: In our experiments, we find that fine-tuning the parameters has little impact on overall performance due to the adaptation to class difficulty. However, due to page limitations, we opt for a more moderate value and do not present ablation experiments on the hyper-parameters.

Reviewer#7 Q1: The experimental settings are not very clear. A: To simplify the problem, the dataset is randomly partitioned into IID subsets. The missing rate of classes for each client is consistent, and the results are demonstrated on an unbiased test set. Q2: It is not clear how well the techniques will work on real-world federated datasets. There are real-world federated datasets available now where the name of the hospitals where each medical image originated is known. For example: https://github.com/owkin/FLamby A: This is a problem worth exploring, and we will investigate it in future work. Q3: The idea of “hot” and “cool” is intuitive, but not clear how it is modeled in the experiments. A: We indeed propose “hot” and “cool” categories based on intuition, broadly referring to categories that are more and less recognizable by clients. Details can be found in Fig. 3 in the supplementary material. The second and fifth categories have the lowest annotation rates and can be considered as “cool” categories, while the first and third categories have the highest annotation rates and can be considered as “hot” categories. Q4: I think that the test set at each hospital should also be biased towards the classes they consider important and have labeled. A: Actually, the global test set represents to the aggregation of all local test set; thus both settings are equivalent. Q5: It is not clear why the authors call it federated multi-label learning and not federated multi-task learning? A: In deed, these two concepts are different. In multi-task learning, a loss function has multiple terms (i.e., objectives), which is different to our focus.




Meta-Review

Meta-review not available, early accepted paper.



back to top