Abstract

Deep learning has been extensively used in various medi- cal scenarios. However, the data-hungry nature of deep learning poses significant challenges in the medical domain, where data is often private, scarce, and imbalanced. Federated learning emerges as a solution to this paradox. Federated learning aims to collaborate multiple data owners (i.e., clients) for training a unified model without requiring clients to share their private data with others. In this study, we propose an innovative framework called SiFT (Serial Framework with Textual guidance) for federated learning. In our framework, the model is trained in a cyclic sequential manner inspired by the study of continual learning. In particular, with a continual learning strategy which employs a long-term model and a short-term model to emulate human’s long-term and short-term memory, class knowledge across clients can be effectively accumulated through the serial learning process. In addition, one pre-trained biomedical language model is utilized to guide the training of the short-term model by embedding textual prior knowledge of each image class into the classifier head. Experimental evaluations on three public medical image datasets demonstrate that the proposed SiFT achieves superior performance with lower communication cost compared to traditional federated learning methods. The source code is available at https://openi.pcl.ac.cn/OpenMedIA/SiFT.git.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2156_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2156_supp.pdf

Link to the Code Repository

https://openi.pcl.ac.cn/OpenMedIA/SiFT.git

Link to the Dataset(s)

https://www.kaggle.com/datasets/kmader/skin-cancer-mnist-ham10000?resource=download https://zenodo.org/records/10519652/files/organcmnist_224.npz?download=1 https://zenodo.org/records/10519652/files/organsmnist_224.npz?download=1

BibTex

@InProceedings{Li_SiFT_MICCAI2024,
        author = { Li, Xuyang and Zhang, Weizhuo and Yu, Yue and Zheng, Wei-Shi and Zhang, Tong and Wang, Ruixuan},
        title = { { SiFT: A Serial Framework with Textual Guidance for Federated Learning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces a Serial Framework with Textual guidance (SiFT) for federated learning. The proposed method trains the model in a cyclic sequential manner using two complementary models. The short-term model is trained with the guidance of a pretrained biomedical language model, while the long-term model is updated using EMA (Exponential Moving Average). The authors conducted experiments on three medical datasets with simulated data distributions and reported the results.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well-written and easy to follow Good experimental results.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) The authors claim to propose a new serial framework for federated learning (FL). However, serial FL frameworks have already existed and their challenges have been well-studied in previous works (Ref1, Ref2, Ref3). However, none of these similar works are referred to or mentioned in the paper. (2) Given the existence of serial FL frameworks, the paper differentiates itself by introducing a dual model approach, where a short-term model is trained, and a global model is updated with EMA. This approach is similar to the student-teacher model. However, the EMA concept has been well-explored in Ref3. The authors should acknowledge and discuss these existing works. (3) Given the existing of serial model and usage of EMA for serial FL, then training the short-term model with a combination of a biomedical language model might not be sufficient for a MICCAI paper. In fact, the integration of language models with FL has been well-explored in parallel-based FL strategies, such as in Ref5 and Ref6. (4) The comparison in the paper appears unfair, as all the compared methods use pure ResNet18, while SiFT relies on both the larger BioLinkBERT-large model and ResNet18. It is unclear whether the performance improvement comes from the proposed model itself or the usage of the language model. The authors should compare their method with related language model-based FL models. Additionally, the performance of central hosting training with pure ResNet and ResNet+LLM should be reported to clarify the source of performance improvement. (5) It is unclear why the proposed method significantly outperforms the compared methods on the IID setting (HAM10000) but shows limited improvement on the heterogeneous setting (Dir 0.3). The authors should provide insights or explanations for this observation and report the performance of IID on OrganCMNISt and OrganSMNIST datasets. Also, similar to comment 4, it should be better to add the centrally hosting performance here.

    [Ref1]. Distributed deep learning networks among institutions for medical imaging [Ref2]. Rethinking architecture design for tackling data heterogeneity in federated learning [Ref3]. Addressing catastrophic forgetting for medical domain expansion [Ref4]. Accounting for data variability in multi-institutional distributed deep learning for medical imaging [Ref5]. FEDPROMPT: COMMUNICATION-EFFICIENT AND PRIVACY-PRESERVING PROMPT TUNING IN FEDERATED LEARNING [Ref6] FedCLIP: Fast Generalization and Personalization for CLIP in Federated Learning

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    As shown in weakness section. The authors should acknowledge and discuss the existing related works for serial FL and LLM based FL methods. Additionlly, more fair comparisons should be reported in the paper.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper exhibits similarities with existing works, but it lacks references to and discussions about these methods. Additionally, the comparisons presented in the paper are not entirely fair, which raises concerns regarding the evaluation of the proposed approach.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The reviews have addressed most of my concerns.



Review #2

  • Please describe the contribution of the paper

    The authors propose a cyclic sequential federated learning (FL) framework called SiFT for image classification tasks. Inspired by continual learning, they employ a long-term and a short-term model to accumulate new knowledge sequentially while preserving the old knowledge. Furthermore, they utilize a pre-trained biomedical language model to encode the textual prior so that the short-term model can learn better features. They empirically evaluate SiFT on three public 2D medical imaging classification datasets, and the results outperform existing baselines.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Interesting analogy from complementary learning system (CLS) theory to inspire “short-term” and “long-term” models.
    • Novel use of the frozen weight matrix computed from a biomedical pretrained language model as classification head to enforce the learning of feature extractor.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Incomplete evidences for efficiency due to sequential nature of proposed framework.
    • Incomplete experiments in ablation studies.
    • Concerns of handling distribution shifts by sequentially training.
    • Questionable usefulness if only consider drifts in labels.
    • Missing related work about “ring”-like FL setup, which has been proposed multiple times.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The paper is a good read. The inspiration of CLS theory fits the motivation of using short-term model to learn from the data at each round on a single client, and distill the knowledge to the long-term model that doesn’t completely forget knowledge learned from previous rounds. In addition, the ablation study confirms the effectiveness of replacing a trainable classification head with a frozen weight matrix computed from a pretrained biomedical language model.

    However, the nature of sequential learning beats up the nice property of FL inherited from distributed training – running in parallel. The evaluation in Table 2 shows the number of FL rounds required, but does not consider the elapsed real time in practices.

    Besides Figure 4(b) and 4(c), I believe another setup is required to have demonstrate the true value of the textual guidance, which is to initiate the weight matrix as in SiFT but allows this weight matrix to be trainable.

    If two clients have vast distribution shifts, loading the weights from a client to the other would pose difficulties for the models for both clients to converge. The phenomenon is not so obvious here because the authors were using the datasets with only label distribution skewness. However, when it moves to real federated learning scenarios in hospitals, the real concern of distribution shift comes from the data space because of different imaging modality, vendors, and configurations. The authors froze the classification head, essentially put the burden to handle data distribution in the feature extractor. I would imagine the performance would degrade a lot for SiFT. There is a way to find out which is to use OrganAMNIST, OrganCMNIST, OrganSMNIST as three clients instead of splitting OrganCMNIST into 10 clients.

    The authors fail to disclose the related work of ring-like and decentralized FL setup in the literature and how SiFT differentiate from them. There are a lot of them just by a quick search.

    Minor Issues:

    1. In Figure 2 (b), BioLinkBERT is an LM not LLM.
    2. In Figure 3 (middle and right), it would be better to draw the standard deviation as shaded area so that readers could see if there would be overlaps.
    3. The argument for Figure 4(d), that a random projector is better than a trainable projector due to more trainable parameters prone to overfitting, doesn’t make sense to me.
    4. It would be better to make it clear that the number of rounds to reach desired balanced accuracy in both Table 2 and Figure 3 (left) are from a single client or from all clients.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper certainly has its strengths and is interesting to read, but there are still many major weaknesses that need to be resolved.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    Thanks the authors for putting together a response. After careful consideration, I maintained my score.

    I understand in a cross-silo setting (e.g., a few hospital as authors indicated) SFL may converge faster in terms of the number of FL rounds, but a SFL FL round in theory takes longer than a classic FL round (in parallel) where a round is defined by the author in the main text as “one federated learning round refers to the completion of a learning iteration where all clients have participated”. You have a plus, and you have a minus, so a comparison of real time will help here. “Dir 0.3” added to the distribution shift and caused class imbalance, but I’m referring to the distribution shift in the image space. For example, two datasets can have similar class distribution but vastly different data distribution in image space (e.g., T1, T2, FLARE). My concern is any sequential or ring-like FL’s performance in such cases. The authors’ contributions to the field are valuable, and I believe that with revisions, their work has the potential for a strong submission.



Review #3

  • Please describe the contribution of the paper

    This paper introduces an innovative approach to federated learning that draws inspiration from continual learning techniques. The proposed method eliminates the need for a central agent responsible for updating the model by instead training short-term and long-term models cyclically in series across different centers. The long-term model captures information common to all centers, while the short-term model is trained on each center’s specific data and subsequently used to update the long-term model.

    A key aspect of the proposed approach is the incorporation of a pre-trained language model to embed prior knowledge about the image classes. This embedded information is then utilized to assist in training the short-term model, enhancing its ability to learn from the center-specific data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The model exhibits two notable methodological strengths. First, it elegantly extends the concept of a complementary learning system from continual learning to federated learning. Second, it effectively utilizes the guidance provided by the pretrained language model as a train-free classifier head. These two aspects represent significant advancements from existing research and, as demonstrated in the paper, yield impressive performance compared to current strategies, particularly when class distributions vary between centers.

    In terms of results, the proposed framework offers substantial improvements in communication efficiency and convergence speed while significantly outperforming existing methodologies in the studied settings.

    The authors also conducted a comprehensive ablation study, further reinforcing the validity of the results and providing insights into the interactions between the model’s various components.

    Moreover, the paper is well-structured, easy to understand, and highly relevant to the medical community, making it a valuable contribution to the field.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    While the paper introduces a novel framework for federated learning and proposes a new model, it would have been beneficial to explore the framework’s performance with different model backbones beyond ResNet18. By investigating the framework’s compatibility and effectiveness with various architectures, the authors could have provided a more comprehensive understanding of its generalizability and robustness.

    Evaluating the proposed framework with a diverse range of model backbones would have strengthened the paper’s findings and offered valuable insights into its potential for widespread adoption across different neural network architectures commonly used in medical imaging applications.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    None.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    It would be beneficial to expand on how specifically t_c can be chosen and how it was selected in the paper setting. This part seems crucial, but little attention is given to how t_c is constructed and what alternatives exist. Even simple ideas, such as “asking ChatGPT to give a long description about the pathology” and then using that embedding as input, might greatly affect the performance. It would have been interesting to see a more in-depth analysis of this aspect, exploring different strategies for generating t_c and their impact on the model’s learning capabilities and generalization.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I recommend giving this paper an “accept” score. While some aspects of the analysis could have been more thorough, particularly regarding the selection and impact of t_c, the text embedding representing prior knowledge, the paper is overall excellent. The authors propose a novel and innovative approach to federated learning, demonstrating significant improvements in efficiency and performance. The paper is well-structured, easy to follow, and highly relevant to the medical community. Despite minor shortcomings, the strengths of this work make it a valuable contribution to the field.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

Q1(R#1):Missing related work. A1:Thanks for pointing out the issue. CWT(Cyclical Weight Transfer) in Ref1 is a serial framework(SF), but it only works on (near) IID scenarios. Ref2 does not propose new SF, but focus on effect of model backbones on federated learning(FL). While Ref4 provides strategies for CWT to handle variability in sample size & label distribution, they may not well handle non-IID scenarios(evidence obtained, but not reported due to rebuttal policy). In contrast, our SF can better handle non-IID scenarios and achieves SOTA performance. Ref3 focuses on domain expansion and irrelvant to SF. Compared to Ref1-4, our SF is novel in (1) whole framework design (i.e., memory module & classifier head), (2) applying a continual learning strategy for FL, (3) utilizing textual guidance to help visual learning in SF. Such discussions will be added.

Q2(R#1):EMA is well-explored in Ref3 and used for serial FL (SFL). A2:Ref3 is for domain expansion rather than FL and only considers two domains(clients). Also, EMA in Ref3 is only for BN stat update, different from the purpose of long-term model update in our work. So, EMA has not been used for serial FL. Importantly, one novelty here is in applying a dual-model memory for SFL(not EMA).

Q3(R#1):About language model (LM) with FL (Ref5-6). A3:The purpose of LM is different. Ref5 is parallel-based FL (PFL) of soft prompts built on pretrained LM and only for NLP tasks.Ref6 is PFL of visual adapters built on CLIP. Our framework uses LM to provide textual prior for guiding vision model learning in a serial FL. Also, using LM for model training is not the only novelty (listed in A1).

Q4(R#1):Comparison appears unfair. A4: We clarify our model during inference is ResNet18 (without LM). BioLinkBERT is only used to help guide ResNet18 training, which is one of our novelties. Ablation study in Fig 4 shows both usage of LM and other components help. Our LM usage can be extended to parallel FL(as future work), supporting generality of our method. Other LM+FL baselines will be compared.

Q5(R#1):More comparisons & discussions on tests. A5:HAM10000 is highly class-imbalanced while other datasets are not. Such factor & dataset bias may cause difference in performance b/t IID & non-IID settings. Improvement with Dir0.3 is limited probably due to severe non-IID distributions (i.e., some classes missing at each client) with Dir0.3. Similar positive findings are expected on OrganSMNIST and OrganCMNIST with IID setting(not reported here due to policy). Centrally hosting performance will be added.

Q6(R#3):Incomplete evidence for efficiency. A6:Our serial FL (SFL) is for scenarios where only a few clients (e.g., hospitals) exist and each client has rich computation resource. In this case, significantly fewer rounds in SFL means less real time compared to parallel FL with many more rounds. Real time infor will be added. Also, reduction in SFL rounds can largely reduce communication cost.

Q7(R#3):Incomplete ablation study. A7:Thanks! The suggested ablation is expected to degrade performance(Evidence obtained but not reported due to rebuttal policy).

Q8(R#3):About distribution shifts (DS) or label drift (LD). A8:Dir0.3 setting already causes vast DS(heavily class imbalanced, some classes missing at each client, so not just LD), and ours still converges much faster than others(Table 2, last col). Suggested tests will be done(future work), with positive results expected(since frozen classifier head embeds textual prior that is independent of modalities etc.).

Q9(R#3):Missing related work. A9:Please see A1 above. Ours is different from related work, so our contributions still hold.

Q10(R#4):To try with other backbones, & discuss t_c. A10:ResNet18 was adopted following prior work. We expect to obtain positive results on other backbones. t_c is simply class name in tests, & better performance is expected with rich description as suggested(future work).

Other comments are adopted to refine paper.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The AC prefers to reject this borderline paper because: 1. it likely requires significant revision, as the author promised in the rebuttal; 2. the justification for efficiency does not address the concerns.

    1a The rebuttal promises to add information and discussion, raising concerns about a major revision from the original submission.

    1b As reviewers noted, related works are not well discussed in the original submission. Although the rebuttal tried to justify the differences, adding this discussion to the manuscript would be beneficial.

    2 Additionally, the efficiency concern is valid in practice. The authors introduced a new scenario: “Our serial FL (SFL) is for scenarios where only a few clients (e.g., hospitals) exist and each client has rich computation resource. “ This was not clearly stated in the original submission and also limited its practical impact.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The AC prefers to reject this borderline paper because: 1. it likely requires significant revision, as the author promised in the rebuttal; 2. the justification for efficiency does not address the concerns.

    1a The rebuttal promises to add information and discussion, raising concerns about a major revision from the original submission.

    1b As reviewers noted, related works are not well discussed in the original submission. Although the rebuttal tried to justify the differences, adding this discussion to the manuscript would be beneficial.

    2 Additionally, the efficiency concern is valid in practice. The authors introduced a new scenario: “Our serial FL (SFL) is for scenarios where only a few clients (e.g., hospitals) exist and each client has rich computation resource. “ This was not clearly stated in the original submission and also limited its practical impact.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper has answered and addressed the comments raised by the reviewers and will address other comments relating to the paper’s refinement.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper has answered and addressed the comments raised by the reviewers and will address other comments relating to the paper’s refinement.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    As compared with common parallel federated learning, cyclic training, although not new, is a far less discussed area. This work is done following this serial collaboration track, and proposed some interesting ideas. Even though as reviewers raised, there are some insufficient discussions and information, I do want to recognize the value of the base idea and method design. Thus I would recommend acceptance - and I hope authors would do a thorough revision to fully address the concerns from AC and reviewers.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    As compared with common parallel federated learning, cyclic training, although not new, is a far less discussed area. This work is done following this serial collaboration track, and proposed some interesting ideas. Even though as reviewers raised, there are some insufficient discussions and information, I do want to recognize the value of the base idea and method design. Thus I would recommend acceptance - and I hope authors would do a thorough revision to fully address the concerns from AC and reviewers.



back to top