List of Papers Browse by Subject Areas Author List
Abstract
Continual Test-Time Adaptation (CTA) aims to improve model generalization under distribution shifts by adapting to incoming test data. However, conventional CTA methods, such as pseudo-label refinement and entropy minimization, face challenges in fundus image classification due to the limited number of training samples and class categories, which lead to overconfident yet miscalibrated predictions, making traditional adaptation methods ineffective. To address these issues, we propose a novel diffusion-based CTA framework, DiffCTA, which leverages the generative capabilities of diffusion models to refine test samples and align them with the source domain distribution without modifying the source model. DiffCTA enhances test-time adaptation using diffusion guidance while preserving diagnostic features. Specifically, we integrate content guidance to retain anatomical structures, consistency guidance to stabilize predictions via entropy minimization, style guidance for CLIPbased domain alignment, and a sampling optimization module that dynamically adjusts guidance strength across diffusion timesteps. We conducted experiments on glaucoma classification and diabetic retinopathy grading tasks. In the glaucoma classification task, our method outperformed the best existing approach by 2.6%, demonstrating its effectiveness in handling domain shifts without modifying the source model. The code is available at: https://github.com/mingsiliu557/DiffCTA.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1304_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/mingsiliu557/DiffCTA
Link to the Dataset(s)
N/A
BibTex
@InProceedings{LiuMin_Leveraging_MICCAI2025,
author = { Liu, Mingsi and Li, Xiang and Guo, Mengxiang and Duan, Lixin and Fang, Huihui and Xu, Yanwu},
title = { { Leveraging Diffusion Models for Continual Test-Time Adaptation in Fundus Image Classification } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15964},
month = {September},
page = {338 -- 348}
}
Reviews
Review #1
- Please describe the contribution of the paper
The authors use a classifier-guided diffusion model for test-time adaptation for fungus image classification test. The authors propose three conditions as guidance, including content, consistency and style. The authors show the proposed adaptation methods achieve the best performances based on the downstream classification test, and demonstrate the effectiveness of each condition through the ablation study.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The authors use a classifier-guided diffusion model for test-time adaptation for fungus image classification test.
- he authors propose three conditions as guidance, including content using L2 loss, consistency using entropy regularization, and style using text and image embedding.
- The authors show the proposed adaptation methods achieve the best performances based on the downstream classification test, and demonstrate the effectiveness of each condition through the ablation study.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Section 2.1, “a target input x_0, the model generates samples \hat{x}_0 that progressively align with the source domain”, it is not clear what is preserved between x_0 and \hat{x}_0.
- Eq. 1 define q(x_t|x_0) is not necessarily a Markov chain. A Markov chain requires the probability transition q(x_t|x_{t-1}).
- Eq. 2, how is sigma_t defined? Is it related with alpha or beta or not related?
- Algorithm 1, line 3, the notation is confusing. Why x_T^g is both a probability distribution and a sample from standard Gaussian?
- Algorithm 1, line 8, what is the rational behind this condition? If x_0 = x_{0,t}^g, should it still be updated?
- Algorithm 1, line 12, where does x_{t-1}^g come from?
- Eq. 5, y is not defined. Is it continuous or discrete? Is it label class or something else? What label is it? There are not enough details on how A_k’s are designed.
- Eq. 7, there are not enough details about the embedding network E_i and E_t. Are they trained from scratch or fine-tuned from pretrained models? Are there relevant references?
- What does the number in Table 1 and Table 2 represent? What does bold font mean? What does underline mean? Are the differences significant through some statistical tests? A more detailed description should be given.
- Fig. 2 is confusing, why does “Ori Img” look almost the same as “Adapted Img” if they represent a test time adaptation? Are they supposed to represent images from different domains? The caption talks about the glaucoma classification, but no classification results are shown in Fig. 2.
- The methods are compared based on the downstream classification. What about the image generation quality, such as Fréchet inception distance between the generated images and true images.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(2) Reject — should be rejected, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- The authors applied a classifier-guided diffusion model for test-time adaption, which is a standard practice. The novelty is not high.
- The presentation of results lacks clarity. The evolution of the methods depends only on the downstream classification task, which ignores the evaluation of the quality of the generative model.
- Too many details are missing from the paper. Mathematical rigor is not sufficient.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The authors have clarified many confusions and addressed most comments from reviewers.
Review #2
- Please describe the contribution of the paper
The paper proposes DiffCTA, a diffusion-based framework for continual test-time adaptation (CTA) in fundus image classification. It is the first work to address domain shifts in medical imaging using diffusion models without modifying the source model. Key contributions include: 1.Diffusion-driven adaptation: Aligns target domain images with the source distribution via reverse diffusion while preserving anatomical structures. 2.Guidance mechanisms: Introduces content, consistency, and style guidance to stabilize predictions and harmonize domain shifts. 3.Anatomy-aware sampling optimization: Dynamically adjusts guidance strength based on structural integrity, preventing noisy updates.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1.First to integrate diffusion models into CTA for medical imaging, addressing limitations of entropy-based methods . 2.Preserves anatomical structures (e.g., optic disc) critical for diagnosis, enhancing trustworthiness in real-world deployment. 3.Extensive experiments on 5 glaucoma and 4 DR datasets validate robustness, with a 2.6% accuracy gain over prior methods. 4.The combination of CLIP-based style guidance and anatomy-aware sampling offers a balanced approach for domain alignment while preserving diagnostic features.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Insufficient comparison with generative baselines and no performance comparison with GAN domain adaptation methods commonly used in medical images (e.g. CycleGAN).
- The evaluation indexes in the paper are relatively single, and the use of Accuracy as the evaluation criterion may not fully reflect the model’s ability to recognize different categories, especially in the case of imbalanced categories.
- The details in the diagrams need to be further refined and confirmed, and it is suggested to delete or add necessary explanations for the elements with unclear meaning or without clear explanations, so as to improve the readability of the diagrams and the accuracy of the information conveyed, and to better support the core viewpoints of the paper.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Dependency on proprietary datasets (e.g., SUSTech-SYSU) and lack of pre-trained diffusion model weights may hinder full replication.
- Provide additional experimental evidence by incorporating comparative experiments with generative domain adaptation methods (e.g., CycleGAN, StyleGAN-ADA) based on existing baseline methods (e.g., TENT, CoTTA) to comprehensively validate the advantages of DiffCTA in medical image alignment.
- Refine the details in the charts, provide further explanations for the charts, and optimize their presentation. Consider adding more target domain alignment cases in Figure 2, highlighting the retention effects of key anatomical structures, such as the optic disc and blood vessels, and compare the results with those generated by traditional methods, such as TENT.
- Further analyze the experimental results to quantify the performance improvement, including the percentage increase and other relevant metrics.
- On the basis of Table 3, please provide an analysis of the independent impact of each module on model robustness.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper is logical in its ideas and organization, and the mathematics and explanation of the mathematical formulas are clear, but the results and evaluation part need to be further improved, and it is suggested that more evaluation indexes, such as Precision, Recall, and F1-score, should be introduced to evaluate the performance of the classification model more comprehensively and objectively, so as to enhance the scientific and persuasive nature of the study. Comparison with methods such as GAN domain adaptation commonly used in medical images should also be added.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
The authors propose DiffCTA, a novel approach that leverages diffusion models to adapt fundus images from a source domain to better match the characteristics of a target domain, thereby improving model generalization during test time. This approach falls under the umbrella of continual test-time adaptation (CTA). DiffCTA integrates three guidance mechanisms: (1) Content-preservation guidance, which minimizes the L1 loss between the clean estimates generated by the diffusion model and the ground truth; (2) Prediction-consistency guidance, which minimizes the entropy of the prediction uncertainty across slight variations of the clean estimate; and (3) Cross-modal style guidance, which minimizes the cosine similarity between CLIP embeddings of the image and a textual prompt (“fundus”) in the source domain.
Additionally, the authors introduce an anatomy-aware sampling strategy that activates guidance only when the L1 distance between the input image and the model’s estimate is smaller than the distance to completely black or white images—thereby avoiding guidance in cases dominated by noise.
Experiments are conducted on public datasets across two classification tasks, comparing DiffCTA with five state-of-the-art methods. Results show consistent improvements of DiffCTA over all baselines across the evaluated datasets and tasks.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The authors present a novel CTA strategy that integrates multiple types of guidance into a diffusion model, regulated by an “anatomy-aware” sampling method. This represents a new and interesting approach, particularly in the context of fundus image classification, where such techniques have not been widely explored.
The proposed guidance mechanisms are well-motivated and draw inspiration from prior work, and their individual contributions are partially validated through an ablation study.
The method is evaluated on public datasets across different classification tasks, which enhances reproducibility and enables direct comparison with future approaches. Additionally, the results demonstrate a clear and consistent improvement in classification accuracy compared to several state-of-the-art baselines.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
The ablation study does not evaluate each guidance component in isolation, but rather focuses on an incremental inclusion of them. This makes it difficult to assess the individual contribution of each guidance mechanism. Additionally, the ablation is limited to a single dataset (REFUGE), which restricts the generalizability of the findings.
Execution time comparisons across methods are missing. Reporting the computational cost—particularly for diffusion-based approaches—would be highly valuable for practitioners considering the method.
Reproducibility and evaluation practices could also be improved. Specifically, the paper lacks details on the hyperparameter search space for each model (including baselines), the number of training and evaluation runs, and the validation strategy. I encourage the authors to follow best practices for experimental reporting, such as those outlined by Dodge et al. in “Show Your Work: Improved Reporting of Experimental Results” (2019).
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
A few minor comments:
-
In Section 3.2, the claim that DiffCTA “excels in challenging domains where other methods struggle (e.g., Domain C)” seems somewhat overstated, given that the DDA method achieves 67.20% accuracy while DiffCTA reaches 68.46%. The difference, while in favor of the proposed method, is modest.
-
The captions of Tables 1 and 2 could be improved for clarity. Specifically, it would be helpful to explicitly mention that the reported performance metric is accuracy, and to replace the abbreviation “DR” with the full term diabetic retinopathy for better readability.
-
Table 2 would benefit from including per-class accuracy to give a more detailed view of the model’s performance across different categories.
-
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper presents a novel and interesting contribution to the field of test-time adaptation, particularly within the context of fundus image classification. The integration of multiple guidance strategies within a diffusion framework, along with an anatomy-aware sampling mechanism, is a creative and promising approach.
However, the evaluation could be strengthened. In particular, a more comprehensive ablation study isolating the impact of each component would better support the design choices. Additionally, including training and inference time comparisons would provide a more complete picture of the method’s practical utility and computational demands.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
Considering MICCAI’s policy on not allowing additional experiments after submission, I have evaluated the authors’ rebuttal in light of the original content, the reviewers’ comments, and the proposed textual clarifications. While the novelty of the method may be moderate, the paper presents a clear and practical application of diffusion models for continual test-time adaptation in the context of fundus image classification—a relevant and challenging task in medical imaging.
The authors have adequately addressed the reviewers’ concerns and clarified several aspects of their method. Although some methodological details could be more thoroughly discussed, and the experimental validation remains limited in scope, the contribution is nonetheless meaningful and of interest to the community.
I therefore consider this a modest but valuable contribution to the field and support its acceptance.
Author Feedback
We thank all reviewers. We especially appreciate R1 and R2’s positive feedback, including R1’s comment that “the mathematics and explanation… are clear,” Reviewer 3 raised technical concerns, which we address in detail below. We begin with general issues, followed by reviewer-specific comments. G1: Comparison with Generative Baselines (R1&2) We used CycleGAN for target-to-source translation; DiffCTA achieved a 7.80 average accuracy gain on glaucoma. StyleGAN-ADA is for training-time augmentation, incompatible with TTA. TENT/CoTTA adapt weights, not images. So they cannot be combined with image-level translation methods like GANs. G2: Evaluation Metrics (R1&2) DiffCTA outperforms SOTA DDA in Precision, Recall, F1 by 2.12/1.53/1.32 (glaucoma) and 1.11/1.02/0.78 (DR). G3: Computational Cost and Execution Time (R2) DiffCTA uses 8.8GB/7s vs. DDA’s 5.9GB/9s, with higher memory from multi-guidance and faster sampling from our strategy. G4: Component-wise Ablation Averaged Across Source Domains (R1&2) We report average accuracy over all glaucoma datasets as source domains: Content-only yields 53.38; adding Style, Sampling, or Consistency improves it to 54.36, 53.83, and 53.95, while Style + Consistency reaches 54.60. G5: Figure refinement and table clarity (R1&2&3) Tables now show accuracy, bold/underline for best/second-best, and “DR” is expanded. We also clarified table layouts to highlight performance gains, including Accuracy/Precision/Recall/F1. Fig. 2 adds more cases and structure cues, TENT is excluded as it doesn’t generate images. None of the compared baselines report significance testing, so we do not perform it either. R1: R1Q1: Dependency on proprietary datasets and pre-trained diffusion weights SUSTech-SYSU is public; diffusion weights are in the anonymous GitHub. R2: R2Q1: Clarity in Implementation Details As noted in Sec. 3.1, we use single-source (batch 1, ResNet-50, T=50, η=5, λ=8) and will follow Show Your Work for improved reporting. R2Q2: Claim about DiffCTA in Domain C We acknowledge the modest Domain C gain and revised the text to reflect robustness. R2Q3: Inclusion of per-class accuracy in Table 2 Per-class accuracy (DDA vs. DiffCTA): No DR (74.1→75.6), Mild (58.3→60.4), Moderate (62.8→ 64.1), Severe (47.6→51.2), Proliferative DR (30.4→ 35.0). R3: All points clarified below. R3Q1: What is preserved between x_0 and \hat{x}0 We preserve structures (e.g., optic disc) via Content, Style, and Consistency (Sec. 2.2–2.4), though Sec. 2.1 introduces diffusion. R3Q2: Whether q(x_t|x_0) defines a Markov chain q(x_t|x_0) is derived from Markov q(x_t|x{t-1}), used in DDA [6]; it is indeed Markov. R3Q3: Definition of sigma_t sigma_t is derived from beta_t under standard DDPM setting [10], no extra parameterization. R3Q4: whether x_T^g denotes a distribution or a sample from Gaussian x_T^g is a sample from Gaussian q(x_T|x_0); symbol reuse follows DDPM convention. R3Q5: Update condition in Algorithm 1, line 8 The condition (Sec. 2.5) avoids applying guidance on noisy samples in early steps. Even if x_0 = x_{0,t}^g, updates continue for alignment. R3Q6: Origin of x_{t-1}^g in Algorithm 1, line 12 x_{t-1}^g is from line 6 via Eq. (2) using x_t^g and predicted noise. R3Q7: Definition of y and design of A_k in Eq. (5) y is the class label, and A_k is AugMix-based (Sec. 3.1). p_theta(y∣A_k(x_{0,t}^g)) is the predicted class distribution; entropy is computed over this, not on y directly. R3Q8: Embedding networks E_i and E_t in Eq. (7) E_i/E_t are from BioMedCLIP and not fine-tuned, we will clarify this in Sec. 3.1. R3Q9: Visual difference between “Ori Img” and “Adapted Img” “Ori Img” is input; “Adapted Img” is source-aligned version. Difference is small due to structure preservation, we conducted confidence tests showing improved prediction certainty. R3Q10: Evaluating image generation quality (e.g., FID) We aim to shift domain style while preserving content, so we focus on accuracy. FID doesn’t reflect domain alignment.
Meta-Review
Meta-review #1
- Your recommendation
Invite for Rebuttal
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
This manuscript introduces a new diffusion model-based continual test-time adaptation method for fundus disease classification. It is able to align target data with the source domain distribution while preserving diagnostic features, and this is achieved by using content, consistency and style guidance during the reverse diffusion process. The method produces better classification performance than multiple state-of-the-art continual test-time adaptation approaches on two different fundus image datasets. The rebuttal has well addressed the (major) concerns raised by the reviewers, including clarification of technical details. Thus, the manuscript is recommended for acceptance.
Meta-review #3
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
This work presents enough technical contributions and meets the bar of MICCAI.