Abstract

For medical imaging AI models to be clinically impactful, they must generalize. However, this goal is hindered by \emph{(i)} diverse types of distribution shifts, such as temporal, demographic, and label shifts, and \emph{(ii)} limited diversity in datasets that are siloed within single medical institutions. While these limitations have spurred interest in federated learning, current evaluation benchmarks fail to evaluate different shifts simultaneously. However, in real healthcare settings, multiple types of shifts co-exist, yet their impact on medical imaging performance remains unstudied. In response, we introduce FedMedICL, a unified framework and benchmark to holistically evaluate federated medical imaging challenges, simultaneously capturing label, demographic, and temporal distribution shifts. We comprehensively evaluate several popular methods on six diverse medical imaging datasets (totaling 550 GPU hours). Furthermore, we use FedMedICL to simulate COVID-19 propagation across hospitals and evaluate whether methods can adapt to pandemic changes in disease prevalence. We find that a simple batch balancing technique surpasses advanced methods in average performance across FedMedICL experiments. This finding questions the applicability of results from previous, narrow benchmarks in real-world medical settings. Code is available at: \url{https://github.com/m1k2zoo/FedMedICL}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2266_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2266_supp.pdf

Link to the Code Repository

https://github.com/m1k2zoo/FedMedICL

Link to the Dataset(s)

https://stanfordmlgroup.github.io/competitions/chexpert/ https://github.com/ieee8023/covid-chestxray-dataset https://github.com/mattgroh/fitzpatrick17k https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/DBW86T https://figshare.com/articles/dataset/PAPILA/14798004/1 https://stanfordaimi.azurewebsites.net/datasets/3263e34a-252e-460f-8f63-d585a9bfecfc

BibTex

@InProceedings{Alh_FedMedICL_MICCAI2024,
        author = { Alhamoud, Kumail and Ghunaim, Yasir and Alfarra, Motasem and Hartvigsen, Thomas and Torr, Philip and Ghanem, Bernard and Bibi, Adel and Ghassemi, Marzyeh},
        title = { { FedMedICL: Towards Holistic Evaluation of Distribution Shifts in Federated Medical Imaging } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper addresses the limitations of current evaluation protocols in federated learning and medical imaging, which typically focus on a single type of distribution shift. It provides a unified approach to evaluate and address label, demographic, and temporal distribution shifts simultaneously, creating a more realistic evaluation protocol for dynamic healthcare settings.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strength of the paper lies in its comprehensive approach to modeling and evaluating distribution shifts in medical imaging. By considering multiple types of shifts (label, demographic, and temporal) and incorporating demographic metadata, the FedMedICL benchmark provides a realistic representation of the challenges faced in real-world healthcare settings. This comprehensive evaluation protocol, unlike existing benchmarks that focus on a single type of shift, enables a more accurate assessment of model performance across diverse patient populations and changing healthcare conditions. Additionally, the paper showcases the practical application of FedMedICL by simulating COVID-19 propagation across hospitals and evaluating model adaptation to pandemic-induced changes in disease prevalence.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The authors do not clearly describe how they address imbalanced data and temporal shifts. The paper lacks a thorough explanation of how these issues are resolved using its framework. Secondly, the results presented in the paper are not sufficiently compared and analyzed. The paper does not provide precise numerical results, whereas numerous works in the literature provide quantitative evaluation results. Additionally, the methodology for splitting the datasets to achieve better performance on temporal problems is not well-defined.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The proposed framework models distributed hospitals experiencing independent demographic shifts over time. Demographic groups are defined based on skin type for the PAPILA dataset and age brackets for other datasets. The presented framework also permits alternative attributes such as sexuality. The localized split strategy assigns sequential training tasks to each client, simulating temporal distribution shifts. For the evaluation setup, the authors have utilized 10 clients per dataset, except for the largest dataset, CheXpert, where 50 clients are used. This choice ensures similar dataset sizes across clients, enabling a fair comparison across them. Each training task includes multiple communication rounds, with M = 5 iterations and a batch size of 10, followed by federated averaging. At the end of every training task, the authors assess each client’s model on all previously encountered tasks, reporting the mean LTR (Long-Term Retention) across clients. While LTR captures the retention of knowledge, it does not directly evaluate a model’s ability to adapt to new tasks or distributions that were not encountered during training. Therefore, it is essential to complement LTR with other metrics that capture different aspects of model performance.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Federated learning and its applications in medical imaging is a hot topic, but proposing a method in this area needs to provide a very high level of debiasing techniques and several federated scenarios test a number of state-of-the-artwork and compare different metrics and also ensure that in this scenario the data sharing is secure. Moreover, verifying the performance by providing precise results on frequently used metrics is very important.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors present FedMedICL, a framework to accommodate federated medical imaging evaluation of AI models across labels, demographics (mainly age), and temporal shifts. They suggest a client (clinical site) splitting that accounts for demographic and label shifts on the data (balanced and skewed). On top of this, a temporal task split per client is applied accounting for both demographic and label (disease evolution) temporal shifts. They use five diverse publicly available datasets and apply the proposed federated scheme/splits, along with the federated average, in several regularization methods, showing that class balancing is the most effective. Finally, they simulate FedMedICL in the clinical scenario of COVID-19 (chest x-rays) showing that most methods exhibit severe “forgetting” as the disease evolves.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is clear, well-written, and referenced.
    2. It tackles a real clinical problem.
    3. Diverse datasets are involved in the evaluation process in an attempt to draw general conclusions.
    4. The application of simultaneous client and temporal task splitting in federated learning.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. There are only simulated scenarios and splits. It is, indeed, difficult to conduct in reality such a study with several participating clinical sites; however, it is possible to include public datasets that provide data from several centers, e.g., breast radiology or digital pathology data, or brain radiology data (and apply FedMedICL in each one of them, and collectively if the task/data allows).
    2. Complementary to the above, the accounted demographic factors are poor (essentially only age), while a variety of more social and even clinical data (available in public datasets) could provide another dimension to the splits. A variety of social/demographic and/or clinical factors could act as split determinants, too (instead of having a fixed one) since different factors have different impacts on the imaging results, as the PAPILA results confirm.
    3. Results are provided for Long-Tailed Recognition (LTR) accuracy only. Assuming the downstream task is classification in all experiments AUC (and per class) would be informative, too.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Dear authors, thank you for your work. A few points to consider in addition to the above:

    1. Assuming that classification is the downstream task in all experiments, it is not clear which model is utilized - please give details. It is especially surprising the low Non-COVID performance of the model in Fig.4 (also for the PAPILA and OL3I in Fig.3). It is clear that the scope of the paper is not to present SOTA classifiers and performance, however, the impact of the proposed method should be reviewed on top of acceptable (for the task) performance. Please comment.
    2. The table describing the data should include data size, and specifics on labels and all possible metadata available.
    3. The types of distribution shifts vary beyond the demographic, label, and temporal ones. Especially in medical imaging, there are considerable shifts inherent to the acquisition process. Factors like the capturing device’s/scanner’s manufacturer and model, software protocol, possible presence of agents, etc., play a significant role in the in-task-domain data diversity and can serve as additional splitting factors.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The weaknesses and missing components weigh more than the merits currently. It is an important topic, with clinical essence and feasibility, however, further evidence - including more extensive splitting factors, perhaps customizable to the data/task - and simulations closer to the clinical reality are needed.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors have addressed the review comments. Adding the AUC results (that require no further experiments) to assess them in conjuction with LTR, and the rest clarifications, qualify for a weak accept.



Review #3

  • Please describe the contribution of the paper

    The paper describes issues regarding distribution shifts in medical data which seriously hinder the performance of machine learning models and limit model generalization. The paper introduces a novel framework called FedMedICL to simulate various out-of-distribution shifts such as label, demographic and temporal shifts. Evaluating existing state-of-the-art approaches on the novel framework, the authors find that batch balancing techniques demonstrate superior performance across the new framework as compared to advanced methods of learning distribution shifts.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The papers main strengths are:

    1. The paper introduces a comprehensive framework called FedMedICL, which takes into account various label-based, demographic and temporal shifts in medical data. Specifically, variations among individual hospitals are taken not account to enable the integration of highly specific datasets.
    2. The paper follows a comprehensive evaluation protocol to evaluate the robustness of existing methods to the novel framework.
    3. Additionally, the paper also examines the effect of new diseases which can come up periodically, such as COVID-19. Such an inclusion is a strategic decision to enable the modelling of unforeseen events and allow machine learning algorithms to generalize in shorter time.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weaknesses of the paper are:

    1. The rationale behind the usage of image based datasets is not clear. Many existing datasets consist of questionnaire responses that are analysed with NLP based approaches.

    https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6986921/

    1. It is unclear how the performance depends on the number of attributes utilized. For example, all the chosen datasets appear to consist of a single attribute.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Please explain the benefits of using image-based data as compared to video-based, audio-based or questionnaire based data.
    2. Please provide an explanation for the choice of T=4 for localized split.
    3. Additionally, an analysis of the effect of the imbalance factor on the performance of each method would further strengthen the work.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The major factors influencing my decision are:

    1. The paper is well organized and clearly written. It contains sufficient mathematical background to demonstrate the proposed approach.

    2. The paper consists of a robust evaluation protocol that is comprehensive and takes into account not only the most common scenarios that can exist, but also accounts for less common scenarios, which is crucial in the healthcare industry.

    3. The paper performs a suitable modification of existing approaches to ensure consistency with the new framework. Such a modification is crucial to ensure a fair comparison.

    4. The paper’s main contribution is the extension to unforeseen situations such as the prevalence of a new disease, such as COVID-19. The paper clearly demonstrates the failure of existing models to predict such situations and highlights the importance of developing new models that can perform in such unforeseen scenarios.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors have addressed my concerns comprehensively.




Author Feedback

We thank the reviewers for their constructive comments and suggestions. References follow our submitted paper.

Lack of Numerical Results (R1) We disagree with R1’s comment. It seems that R1 overlooked our detailed numerical results in Figures 2, 3, and 4, covering over 50 experiments across six medical imaging datasets of various sizes. Notably, conducting a single experiment on our splits of the CheXpert dataset requires a full day of computation on a high-end V100 GPU. In total, our experiments consumed 546 GPU hours.

Evaluation Metrics and Low Performance (R1, R3) We clarify a misunderstanding of the “LTR” metric by R1. In our paper (section 3.3 - Reported Metrics), “LTR” stands for “Long-Tailed Recognition,” not “Long-Term Retention,” which is not mentioned in the paper. Regarding R1’s concern about adaptability to new distributions, our adapted LTR measures performance across all seen tasks, and we presented continual learning curves in Figure 2, which demonstrate the model’s adaptability to new tasks. We used LTR to highlight the impact of imbalanced datasets, as shown in Figure 2, but we accept R3’s suggestion to include AUC for a broader perspective on performance. We will add AUC for the tested methods, which can be derived from existing data without additional experiments. For instance, the AUC scores for the FedAvg method are 73.4 on PAPILA and 66.9 on OL3I, comparable to those reported in MEDFAIR Table A8 [34]. These AUC results emphasize the importance of dealing with imbalanced datasets. Specifically, a model trained on a class-imbalanced dataset tends to over-predict the majority class, resulting in a high AUC but a low LTR. Examining the AUC vs LTR values across datasets resolves R3’s concerns about the seemingly low performance on Non-COVID, PAPILA, and OL3I datasets. In response to R3, we used a ResNet-18 backbone for all experiments, following MEDFAIR [34]. We will include this discussion in our final version.

Demographic Factors and Simulated Splits (R3, R4) We acknowledge R3’s suggestion to use multicenter splits. However, public multicenter datasets are limited in scale (usually 2-5 clients), which does not fully represent the complexity of large healthcare networks (e.g., all hospitals in a country like the UK, involving more than 2000 clients). Our methodology enables the simulation of such large-scale scenarios using any publicly available dataset that includes attribute metadata, such as CheXpert. This scalable approach is useful for researching various medical tasks. Previous works, such as MEDFAIR, have demonstrated significant disparities in performance when datasets are split by age. This is why we used age as the primary attribute in most datasets. Note that we used other attributes such as skin type in Fitzpatrick17k. Nonetheless, our benchmark is easily adaptable and can incorporate other single (e.g., device manufacturer) or multiple attributes (e.g., sex and age) with minimal additions to the code. We appreciate the reviewers’ suggestions to perform a comprehensive analysis of these demographic factors, and we aim to explore this in future work.

Why focus on images and not include NLP tasks? (R4) Recent studies showed that the most common type of FDA-regulated ML systems involved image processing [11], making the safe development of such systems an urgent matter. Due to the limited space in conference submissions, we focus on images in this work. Yet, our FedMedICL framework can work with any modality, and extending it to NLP tasks is a promising future work suggestion.

Why T=4 in the localized split? (R4) Our choice of four tasks is motivated by seasonal variations in disease patterns and hospital admission rates, which affect demographic variations in hospitals. For example, the prevalence of seasonal flu can alter the demographic composition of hospital patients. Our choice captures such temporal and demographic shifts to enhance the realism of our simulations.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top