Abstract

Traditional federated learning relies on fully labeled datasets in each medical institution, which is impractical in real-world clinical scenarios. Federated Active Learning (FAL) addresses this by selecting a few informative samples for labeling, but it faces challenges such as domain shift across institutions. Besides, existing FAL methods rely on single-round model knowledge to estimate prediction-level uncertainty, ignoring uncertainty from features and model evolution during training. In this work, we propose TM-FAL, a novel framework for federated active medical image classification under domain shift. TM-FAL proposes a new uncertainty by integrating feature differences and prediction confidence from temporal local and global models to capture both local-global differences and the inherent complexity of images. Additionally, we use the prediction of the global model as pseudo labels to group images to mitigate class imbalance caused by uncertainty-based selection. Experiments on two medical image classification datasets demonstrate that TM-FAL outperforms various state-of-the-art methods.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1626_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/IAMJackYan/TM-FAL

Link to the Dataset(s)

N/A

BibTex

@InProceedings{YanYun_Temporal_MICCAI2025,
        author = { Yan, Yunlu and Feng, Chun-Mei and Li, Yuexiang and Xie, Jinheng and Chen, Jun and Elhoseiny, Mohamed and Hu, Ming and Wu, Kaishun and Zhu, Lei},
        title = { { Temporal Model-Based Federated Active Medical Image Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15973},
        month = {September},
        page = {616 -- 626}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a method for federated active learning under domain shift. They leverage uncertainty comparing feature differences and prediction confidence from local and global models. Additionally, using the predictions of the pseudo labels to group images helps to maintain class diversity. Their method is evaluated on two medical image classification datasets (Fed-ISIC and Fed-Camelyon) and compared with several baselines.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is well written and well structure.
    • The experiments are well organized.
    • One aspect of their method is novel: it leverages knowledge from models at different rounds (in addition to local and global models).
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The evaluation of the results is limited to the classification, without comparison to the context of the clinical applications (skin lesions and histopathology classification).
    • While the authors claim that “the selector pool effectively reduces computational overhead and enhances temporal differences”, the paper lacks a quantitative comparison of computational overhead, training time, or similar metrics against the baselines.
    • There is a concern regarding the novelty of the proposed approach, as the authors do not adequately position their work in relation to existing methods, including those that explore sampling from easy to difficult, curriculum learning [1], curriculum learning for active learning in medical image analysis [2], and curriculum learning for federated learning in medical image classification [3].
    • Missing justification for the choice of datasets, network architectures, and evaluation metrics employed.
    • The paper does not include a discussion of the limitations of the proposed method or suggestions for future work.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • There is a concern regarding the novelty of the proposed approach, as the authors do not adequately position their work in relation to existing methods, including those that explore sampling from easy to difficult, curriculum learning [1], curriculum learning for active learning in medical image analysis [2], and curriculum learning for federated learning in medical image classification [3].
    • The authors state in the introduction that “models typically classify simpler samples in the early stages and progressively handle more difficult samples in later stages” but they do not cite the foundational work by Bengio et al. [1], which introduced this concept.
    • The authors state ““we propose a novel local-global temporal uncertainty-based sampling strategy, inspired by the cognitive principle of the model”, similar approaches in medical imaging have already leveraged model uncertainty and curriculum learning in federated settings to penalize forgotten samples and enhance local model consistency [3]. It would strengthen the paper if the authors more clearly positioned their work in the context of the current state of the art in medical image classification.

    • The uncertainty-based sampling introduces class-imbalance, which must later be addressed through pseudo-grouping. However, the advantage of this two-step approach is unclear. Would it be more effective to explore an uncertainty metric that accounts for class distribution?

    • Missing justification for the choice of datasets, network architectures, and evaluation metrics employed.
    • Could the authors clarify why they selected FED-ISIC and FED-Camelyon from Flamby? The Flamby dataset suite is composed of Fed-Camelyon16, Fed-LIDC-IDRI, Fed-IXI, Fed-KITS2019, Fed-ISIC2019, Fed-Heart-Disease.
    • The justification for the choice of the respective networks seem to be missing. Could the authors explain why different networks were used for FED-ISIC and Fed-Camelyon?
    • The paper lacks justification for the choice of evaluation metrics. Why were balanced accuracy or accuracy selected, rather than AUC? Additionally, is one of the datasets balanced while the other is not? Please clarify.

    • Could the authors point to their statistical analysis of the results (e.g. p-values as indicated in reviewer guidelines) to justify their claims in performance?
    • I am concern about the small performance improvements.
    • Ablation studies: “Compared to TM-FAL, both M1 and M2 show significant performance degradation across different FAL epochs”
    • How is the result of #E5 (69.63 ± 0.30) significantly different from 70.46 ± 1.20?
    • “Additionally, M2 outperforms M1, indicating that the LGTUS module plays a more crucial role in improving the method’s performance, as it effectively measures the importance of the data”: I’m confused here, as M1 appears to be TM-FAL without LGTUS.

    • The experimental setup is presented as though the paper is a follow-up work of [5].

    • While I understand that additional experiments cannot be requested for this paper, I suggest that the authors consider performing an analysis that distinguishes between uncertain and certain images. This would provide valuable clinical insights into the Fed-ISIC and Fed-Camelyon datasets, highlighting the significance of their work for both the MICCAI community and the broader medical imaging field.

    Minor comments:

    • TM-FAL → TM is not introduced in the abstract
    • Figure 2 does not provide a clear explanation of the roles of the unlabeled and labeled datasets, nor the relationship between the local and global models
    • There may be a typo: “This strongly demonstrates TM-FAL’s effectiveness in addressing FAL challenges under domain shift”. Should it be “FEAL” instead of “FAL”?
    • Figure 3 mentions “results”, but balanced-accuracy would be more informative.

    References: [1] Bengio, Y., Louradour, J., Collobert, R., & Weston, J. (2009, June). Curriculum learning. In Proceedings of the 26th annual international conference on machine learning (pp. 41-48). [2] Ma, S., Du, H., Curran, K. M., Lawlor, A., & Dong, R. (2024, October). Adaptive Curriculum Query Strategy for Active Learning in Medical Image Classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 48-57). Cham: Springer Nature Switzerland. [3] Jiménez-Sánchez, A., Tardy, M., Ballester, M. A. G., Mateus, D., & Piella, G. (2023). Memory-aware curriculum federated learning for breast cancer classification. *Computer Methods and Programs in Biomedicine, 229, 107318.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method demonstrates only minor performance improvements, and the justification for its significant impact is not clear.

    The authors do not position their work in relation to the relevant context of medical image classification.

    The authors only focus on classification performance, overlooking other important factors such as computational resources, training time, carbon footprint, etc.

    There is no discussion of the limitations or clinical implications of the work, nor are future research directions addressed.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    It is not correct that curriculum learning (CL) estimates “easy to hard” solely based on a predefined metric. I recommend that the authors consult the relevant literature on CL and self-paced learning (SPL), as well as the previously mentioned references. For instance, Ref [2] introduces an adaptive curriculum query strategy, while Ref [3] dynamically prioritizes local training samples to enhance model consistency by penalizing inconsistent predictions, i.e., forgotten samples.

    My concerns regarding the novelty of the work remain, particularly due to the lack of contextualization concerning curriculum learning and self-paced learning strategies. The justification for the two-step approach is insufficient, as defining uncertainty at the sample level appears to be a design choice made by the authors. Furthermore, using [5] to justify all the experimental settings further suggests that the contribution is incremental over previous work. It also remains unclear why only two datasets from the Flamby suite were selected, given that the suite includes Fed-Camelyon16, Fed-LIDC-IDRI, Fed-IXI, Fed-KITS2019, Fed-ISIC2019, and Fed-Heart-Disease.

    As I mentioned in my original review, a paper that does not address the limitations of its approach may lack a clearly defined scope for its claims. I appreciate the authors’ response in the rebuttal and encourage future work investigating uncertain and certain images. This would provide valuable clinical insights into the Fed-ISIC and Fed-Camelyon datasets, highlighting the significance of their work for both the MICCAI community and the broader medical imaging field.



Review #2

  • Please describe the contribution of the paper

    The authors propose a novel method for tackling Federated Active Learning (FAL), named TM-FAL. Their approach is inspired by the observation that, during the learning process, models tend to classify simpler samples in the early stages and progressively handle more complex ones in later stages. Based on this assumption, they introduce a new way of computing uncertainty that accounts for the temporal progression of the model—specifically, uncertainty is calculated using both global and local models saved at different time steps. To address the potential issue of this strategy skewing sample selection toward only the most difficult classes, the authors also incorporate a pseudolabeling mechanism. They use the global model to generate pseudolabels, thereby creating a more balanced dataset for annotation during the active learning phase. Experiments conducted on standard medical benchmarks demonstrate that the proposed method outperforms state-of-the-art approaches.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The core idea of the paper is genuinely interesting. Incorporating the temporal progression of the model to compute uncertainty is novel and deserves recognition.
    • Results are computed with respect to most current state-of-the-art (SOTA) models and are reported with standard deviations.
    • The paper is well written—everything is clearly explained, and results are presented in an accessible and structured manner.
    • The ablation studies are particularly informative, especially the hyperparameter analysis. It shows that even using a large window between selected models still achieves SOTA performance, while also reducing computational and storage costs.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The ablation study (Table 2) clearly demonstrates that LGTUS is essential for achieving strong performance. However, the second model, M2 (the one without PLG), contributes only marginally to the overall performance improvement compared to the full model. The authors could provide a more detailed discussion or justification for this behavior in the paper.
    • I suggest including a link to the anonymized code, or at the very least, mentioning that it will be released upon acceptance.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I really appreciated the paper; I think it is novel enough for this conference. Moreover, the work is clearly presented, and the results are thoroughly detailed.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper introduces a novel federated active learning (FAL) framework for medical image classification under domain shift. The authors use uncertainty based FAL consisting of two components: (1) local-global feature differences and prediction confidence from temporal local and global models, and (2) pseudo-labels generated by global model. They utilize both the feature differences and prediction confidence of temporal local and global models to capture the local-global differences as well as the inherent complexity of the data. Since, relying solely on this uncertainty-based strategy can lead to a bias in data selection towards more difficult-to-classify categories, they use pseudo-labeling-based grouping strategy that maintains class diversity. They perform experiments on two datasets and their results demonstrate superior performance over the state-of-the-art methods. They also conduct two ablation analyses: one accessing the individual contributions of the Local-Global Temporal Uncertainty-based Sampling and Pseudo-labeling components, and another on the hyperparameters used to sample the temporal local-global models.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The problem statement is well-motivated, emphasizing the importance of selecting data samples that contribute not only to local model improvement but also to improve the generalization capability of the global model across diverse institutions. Figure 1 clearly illustrates the motivation, providing clear justification for both the components of the proposed method: Local-Global Temporal Uncertainty-based Sampling and Pseudo-labeling. The results demonstrate superior performance of the proposed methodology on state-of-the-art methods on both Fed-ISIC and Fed-Camelyon datasets. The ablation analysis effectively demonstrates the significance of each component, particularly demonstrating impact of Local-Global Temporal Uncertainty-based Sampling. The results from ablation analysis of hyperparameters demonstrate stability of model performance, consistently outperforming other methods across all reported settings.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The methodology is well depicted by the figure and the explanation in the methodology section. However, algorithm 1 seems to be redundant offering little additional information beyond what is already described. While figure 1 clearly illustrates the motivation for this work, the paper would benefit from more detailed descriptions of the experimental setup and the specific procedure used to identify uncertain samples.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Computational times for calculating uncertainty based on a temporal local-global model can provide clearer understanding of the method’s practicality and scalability.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The results are promising and the proposed methodology is well-justified. The paper is well written and easy to follow. The ablation analysis clearly demonstrates the necessity of each component. Additionally, the method relies on a small number of hyperparameters, contributing to its overall simplicity.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have adequately addressed my main concerns, including clarifying the experimental setup for the motivation figure, and computational costs, which improves the paper’s clarity and practical relevance. The redundancy of Algorithm 1 remains unaddressed, but this minor issue does not affect the overall quality. I continue to recommend acceptance.




Author Feedback

We sincerely thank all reviewers for their valuable comments, and R2 and R3 for acknowledging our method as novel. [R1] Q1. Relation to clinical: In Fed-ISIC, our method improves balanced accuracy by 2.62% (#E2) over the best baseline—a clinically meaningful gain given the presence of melanoma cases, where errors may delay treatment. In Fed-Camelyon, the 0.84% gain (#E5) helps reduce misdiagnosis in cancer detection. These results highlight the method’s practical reliability in clinical settings. [R1] Q2. Complexity analysis: TM-FAL’s data selection time drops from 413s to 217s with the selector pool—a nearly 50% drop in overhead. While TM-FAL is slightly slower than FEAL (184s), we believe this small additional cost is acceptable given the consistent performance gains shown in Table 1. In the final version, we will clarify this trade-off. [R1] Q3. Comparison with curriculum learning (CL): Although both CL and our method involve ‘easy-to-hard’, they are different concepts. In CL, they [Ref 1-3] aim to learn samples in an easy-to-hard order based on predefined difficulty metric. In contrast, our method interprets ‘easy-to-hard’ as a dynamic learning pattern in the model’s training. With this implicit cognitive principle, we propose a novel uncertainty estimation approach based on the feature differences and prediction confidence between temporal local and global models. We select samples with the highest uncertainty rather than following an easy-to-hard order. In the final version, we will cite [Ref 1-3] and discuss the differences. [R1] Q4. Justification for experimental setup: We followed [5] to ensure a fair comparison, and selected Fed-ISIC and Fed-Camelyon as they are widely adopted FAL benchmarks that exhibit significant domain shift—a key challenge our method aims to address. We followed [5] and used EfficientNet-B0 for Fed-ISIC and DenseNet-121 for Fed-Camelyon due to their established effectiveness in skin lesion and histopathology tasks, respectively. We followed [5] and used balanced accuracy for Fed-ISIC (multi-class, imbalanced) and accuracy for Fed-Camelyon (binary). We agree AUC is valuable and will report it in the final version. [R1] Q5. Add discussion: One limitation of our work is that it mainly targets classification. The key insight of our method is general and can be applied to more tasks, e.g., segmentation. We plan to explore this in future work. [R1] Q6. Statistical analysis: We conducted paired t-tests between TM-FAL and two baselines (M2 and FEAL) across #E2–#E5. The resulting p-values are: M2 (0.019, 0.021, 0.020, 0.042) and FEAL (0.003, 0.017, 0.026, 0.010), all below 0.05, confirming statistical significance. [R1] Q7. M2&M1: M1 denotes TM-FAL without LGTUS, and M2 without PLG. Since M2 outperforms M1, this suggests that LGTUS contributes more critically to performance. [R1] Q8. Advantage of two-step method: The mentioned uncertainty may conflict with our design goal, as uncertainty is sample-level, while class distribution is dataset-level. Forcing class distribution prior into uncertainty estimation risks distorting the informativeness signal. In contrast, our two-step method decouples informativeness and diversity, offering better interpretability and flexibility, supported by Table 2. [R1] Q9. Analysis of Uncertain vs. Certain Images: We included a brief analysis in Fig. 1 (a), showing that samples with high uncertainty are associated with lower accuracy and exhibit greater variance over training. In future work, we plan to conduct a deeper analysis to better reveal certain vs. uncertain cases. [R2] Q1. Ablation study: This arises because the key of FAL lies in how to measure the informativeness of samples. We conducted paired t-tests between TM-FAL and M2, and the p-values across #E2-#E5 are 0.019, 0.021, 0.020, 0.042, all below 0.05, confirming statistical significance. [R3] Q1. Suggestion: We will provide a more detailed experimental setup and the procedure for identifying uncertain samples.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper introduces a federated active learning method leveraging temporal uncertainty across local and global models. While the idea is interesting, the novelty is limited and insufficiently distinguished from existing curriculum learning approaches. The choice of datasets and architectures lacks justification, and the experimental design closely follows prior work, raising concerns about incremental contribution. Statistical analyses and performance gains are marginal, and key claims are not well substantiated. Important aspects such as computational cost, clinical applicability, and limitations are either underexplored or omitted. Despite some strengths, the paper does not meet the standards required for acceptance.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top