Abstract

Distribution shifts of medical images seriously limit the performance of segmentation models when applied in real-world scenarios. Test-Time Adaptation (TTA) has emerged as a promising solution for ensuring robustness on images from different institutions by tuning the parameters at test time without additional labeled training data. However, existing TTA methods are limited by unreliable supervision due to a lack of effective methods to monitor the adaptation performance without ground-truth, which makes it hard to adaptively adjust model parameters in the stream of testing samples. To address these limitations, we propose a novel Test-Time Evaluation-Guided Dynamic Adaptation (TEGDA) framework for TTA of segmentation models. In the absence of ground-truth, we propose a novel prediction quality evaluation metric based on Agreement with Dropout Inferences calibrated by Confidence (ADIC). Then it is used to guide adaptive feature fusion with those in a feature bank with high ADIC values to obtain refined predictions for supervision, which is combined with an ADIC-adaptive teacher model and loss weighting for robust adaptation. Experimental results on multidomain cardiac structure and brain tumor segmentation demonstrate that our ADIC can accurately estimate segmentation quality on the fly, and our TEGDA obtained the highest average Dice and lowest average HD95, significantly outperforming several state-of-the-art TTA methods. The code is available at https://github.com/HiLab-git/TEGDA.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2263_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/HiLab-git/TEGDA

Link to the Dataset(s)

M&MS dataset: https://www.ub.edu/mnms/ BraTS2023 dataset: https://www.synapse.org/#!Synapse:syn51156910/wiki/

BibTex

@InProceedings{ZhoYub_TEGDA_MICCAI2025,
        author = { Zhou, Yubo and Wu, Jianghao and Liao, Wenjun and Zhang, Shichuan and Zhang, Shaoting and Wang, Guotai},
        title = { { TEGDA: Test-time Evaluation-Guided Dynamic Adaptation for Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {639 -- 649}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This work studied test-time adaptation of segmentation models tackling domain shift. To generate high-confidence pseudo labels, the authors propose to use uncertainty-like evaluation metric to select high-confidence samples saved in a bank, used for test-time feature fusion-based refinement.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The strength of this work are mainly two folds. The first point is that an uncertainty-like evaluation metric is developed for sample selection, which is used for pseudo label generation. The second is that the proposed method is evaluated on two typical datasets, and compared to SOTA methods, showing better results.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The weaknesses are listed below.

    1. In method section for ADIC, the reason of using the overall confidence to calibrate the ADI is not clear enough. What if do not use it?
    2. Although feature-fusion based refinement can help boost the robustness when the prediction is poor, it will inevitably introduce bias as the knowledge is different among these samples, leave the test image in a high risk of bias that is hard to remove. I think this technique is more robust for classification task, which targets on global features, but for segmentation task, I’m skeptical of that, although the ablation study would show its effectiveness.
    3. Overall, the proposed method is lack of novelty. The base idea is still generate pseudo labels first, and in this work uncertainty-like technique is used for sample selection, and mean-teacher framework is for robust updating.
    4. The bank configurations are lack of discussion, such as dropout sampling number M, and so on.
    5. In experiments, the settings of other methods are not described. For a fair comparison, other methods should also be tuned with hyperparameters.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    As listed in the weaknesses, the proposed method is simply a combination of several commonly used techniques, either an uncertainty-like metric for sample selection, or the mean-teacher framework for robust learning, thus overall lack of novelty. Moreover, the comparison setting are not clear enough to convince readers.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have tried to clarified some confusions and issues I’m concerned, which are not convincing enough to me to be honest. For the motivation or ground of ADIC metric, Fig.3(a) only shows the correlation with the final Dice value, and it cannot be taken as an evidence of the advantage of ADIC over others, thus ablation study for this is still needed. Second, regarding the novelty, I think the techniques used here are good combination for application in this TTA scenario, but the novelty itself is limited. Last but not least, it would be great if the authors can also provide the hyperperramters that are tuned for the other compared methods in the final version or provide the code publicly, as they claimed in the rebuttal, that would increase the transparency.



Review #2

  • Please describe the contribution of the paper

    The paper introduces a continual test-time adaptation method that introduces Monte-Carlo (MC) dropout based guidance along with memory-bank formulation to the student-teacher framework. The method is designed as follows:

    (1) Compute a quality score q by measuring the agreement between predictions with and without dropout. Calibrate this score using confidence derived from the average of dropout predictions.

    (2) Maintain a memory bank of past features, where each feature is updated as a q-weighted linear combination of the current feature and memory bank features. The memory bank features are weighted based on their cosine similarity to the current feature.

    (3) Introduce a q-guided update for the teacher network, ensuring updates are based on the prediction quality.

    The method is tested on a cardiac and brain MRI datasets. The method shows improvement over baseline and state-of-the-art methods.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper demonstrates the following strengths:

    (1) Presents a continual test-time domain adaptation method which does not require source data or modifications to the initial training procedure.

    (2) Provides a prediction quality-informed technique to update the teacher weights in a student-teacher framework.

    (3) Provides a memory-bank to update the test feature on-the-go based on the quality of the predicted pseudolabel.

    (4) Shows improvements on two MRI datasets (cardiac, brain) and domain adaptation settings (between MRI scanners, adult to pediatric brain imaging).

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    I would suggest following areas of improvement or for clarification:

    1. The primary drawback seems to be the need for extensive computation per each back-propagation:
      • the student-teacher framework, unless modified, requires predictions from multiple augmentations of the input image.
      • the MC dropout enabled agreement score further requires m (number not specified in the paper) predictions with random dropout (percentage not specified in the paper) plus one prediction without dropout.
      • This is followed by dot product per dropout prediction as well as feature similarity dot products depending upon the length of memory bank.
      • This likely increases the inference time by a considerable margin, compared to the existing methods like TENT, which requires one prediction, trains only BN affine parameters and is able to still achieve improvement over the baseline. Similarly, CoTTA which performs marginally worse but also saves computation as it uses only the student-teacher formulation. What do you think about this limitation?
    2. SAR[23] shows limitations of batch-normalization in their paper and only run their method with layer/group normalization. Have you taken that into account while performing the comparison?

    3. For reproducibility:
      • Cited paper for M&M Dataset provides 375 cases while the manuscript uses 345 cases only. Is there any specific exclusion criteria?
      • Please specify the number/percentage of drop out predictions used for computing ADIC.
      • Could you further clarify how the continual learning setup designed? For instance, in M&Ms, are domain B, C and D part of the randomly included in the same batch, fed sequentially or tested independently? In 2D UNet, each batch uses slices from the same case (10-13 slices per case, batch size of 10 as mentioned in the paper) or random selection?
    4. Quoting from the text - “dropout inference often yields high agreement between P and Pm in interior regions but minor discrepancy at borders in the target [25], leading ADI score to overestimate Dice. Therefore, we further introduce a factor b ∈ (0, 1) to calibrate the ADI score”. It is unclear how calibrating with confidence score of average dropout prediction solves the interior/border overestimation of dice?

    5) Do you also use random update to weights as in CoTTA [22] as an anti-forgetting measure?

    6) Finally, this is more so to start a discussion rather than a weakness: The authors of M&Ms paper found that in their organized challenge, strong augmentations in the source network were enough for the network to perform well in unseen scanners and did not require complex domain adaptation techniques. However, in Table 1, we see substantial differences between performances in Domain B/C/D. How important do you think is a robust source network training?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    1) MC Dropout has been used for segmentation works in the past to refine pseudolabel through different measures of confidence (entropy, standard deviation etc.), sometimes used either as a post processing method or for end-to-end training. Even in test time literature, MC Dropout has been used by Continual-MAE* [work not cited in the paper, reference below] to compute token-wise uncertainty to adapt tokens showing significant domain shifts. However, I have not seen works in test-time literature with MC Dropout being used to inform the memory bank as well as a mode of teacher-network update. So, in that context, it is a novel application of an existing method in a different domain adaptation setting. If the authors can respond to the rebuttal, I would still support its acceptance as it fills a gap in the literature and the efficacy of the method is well demonstrated in the paper.

    2) If any extension is planned, then the method can be compared to DLTTA [26] which also uses memory bank as a measure of updating - in their case, the learning rate, for medical image segmentation.

    3) I have selected ‘The submission does not provide sufficient information for reproducibility’ because of some key missing details as mentioned in weaknesses. If the authors can update the necessary information, then I am happy to change it to ‘The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.’

    *Liu, Jiaming, et al. “Continual-mae: Adaptive distribution masked autoencoders for continual test-time adaptation.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    In my opinion, this is a method that seems to work well and is a novel application of ideas. However, I am wary of the tradeoff between computation and performance, especially in the test time adaptation setting as well as other concerns detailed above. Therefore, I prefer to wait for authors’ response before recommending accept.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Thanks to the authors for their response. I maintain the same feedback as in the review. In fact, I have additional questions upon reading the rebuttal. For instance: how does the network handle catastrophic forgetting, which is a fundamental problem in continual test-time adaptation? I would suggest authors to incorporate clear explanations of the pipeline for reproducibility. I recommend acceptance as the method proposes a nice methodology and performs robust validation.



Review #3

  • Please describe the contribution of the paper

    The authors proposed a framework called TEDGA from test-time-adaptation in medical image segmentation under domain shifts. The framework involves a segmentation quality evaluation metric ADIC, a feature fusion module to improve robustness.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The proposed framework is novel in its methodology. The segmentation quality estimation metric ADIC is novel and its effectiveness is demonstrated through the high correlation factor between ADIC and the real Dice. ADIC is not used to guide the update of weights but used for adaptive feature fusion and model EMA rate.
    • Improved test time performance. The proposed method shows improved segmentation results with statistical significance compared to methods.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Table 3, the increase caused by proposed AFFA and SMU is small (< 1% in Dice). The significant improvement is attributed to Lmt and Lre.

    • The loss functions Lmt and Lre needs elaboration. They contribute most to the improved performance, but are not clear in the current presentation. The marginal effect of adding AFFA and SMU is not comparible yet they occupy the most space in methodological contribution.

    • Error in visualization. Figure 2, M&MS-B, segmentation masks don’t match the image, see groundtruth and the proposed TEGDA. The rest don’t match either. The mask is for either another cardiac phase or other subjects.

    • Fig. 2, M&MS-C the source model fails on a relatively easy case I would say, given the high contrast and the image is free of severe artefacts. This raises concerns about the quality of the trained baseline model. A well trained baseline like nnUNet would probably not fail on such cases even when trained on a single domain image. It would be more interesting to validate the method on a stronger baseline.

    • It would make the paper stronger if the authors had discussed the possibility of using ADIC as a loss to guide the adatation. Perhaps it was not differentiable or could cause mode collapse. But if the ADIC matches Dice so well, a natural move would be using it as a guidance for model adaptation.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Test-time-adaptation has practical impact in generalizing AI models in medical image segmentation with domain shift. The proposed method is novel and the ADIC provides a good test-time performance evaluation. However, the methodology needs further clarification and the proposed method leads to limited improvement shown in the ablation study. The results section regarding the ablation study and the visualization in Fig. 2 also needs further elaboration and discussion.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We are glad that the Reviewers appreciate the novelty (R2R3) and performance (R1R2R3) of our method. Here we address the main concerns:

  • Novelty (R1) Though our method is built on general ideas of pseudo-label, uncertainty and mean-teacher, we propose three novel components that are core ideas of our method: 1) A new metric ADIC to accurately estimate the Dice scores of testing images, which performs better than existing uncertainty metrics like entropy [16] or variance [25] (Fig. 3(a)). 2) Traditional methods generate pseudo-labels by the teacher directly, or averaging multiple predictions [22][24], while our method uses adaptive feature fusion based on ADIC to obtain better pseudo-labels. 3) Using our ADIC, we propose three adaptive mechanisms for better performance: adaptive feature fusion, ADIC-aware mean teacher, and adaptive loss. Each of our proposed components has been validated in the ablation study.

  • Preventing bias of feature-fusion (R1) In Eq. 4, the fusion is adaptive based on ADIC, where preservation of the original feature is encouraged for well-predicted samples, the degree of feature fusion is larger only for poorly predicted ones. As a result, it improves robustness for poor samples, while being less likely to introduce bias for good samples. Besides, similarity-based refinement acts as denoising, not knowledge transfer, and it is applied at the intermediate layer of U-Net, thus the skip connections can preserve details, reducing the risk of bias. Its effectiveness was validated in ablation study.

  • Calibrating ADI (R1&R2) As we explained before Eq. 2, ADI tends to overestimate Dice score due to high agreement in interior regions when using MC dropout. To address this, we introduce a confidence factor b in (0, 1) to calibrate it, which measures the confidence based on the discrepancy at borders, leading to ADIC (Eq. 3) that has a higher correlation with real Dice, as shown in Fig. 3(a).

  • Comparison and reproducibility (R1&R2) 1) Comparison: All methods shared the same optimizer, backbone and learning rate, but we tuned their respective additional hyperparameters for optimal results. 2) Dataset: On MMs official download page, the 30 cases from the Canadian clinical center can’t be downloaded, leading to only 345 publicly available cases. 3) Dropout number: We used M = 10 dropout inferences following [12] and a dropout rate of 0.5 [17]. 4) Evaluation: Each domain was tested independently with source model. As the images arrive at the volume level, each batch uses slices from the same volume. Note that we will release the code for reproducibility.

  • Baseline (R2&R3) While a stronger baseline may have better performance in the target domain, the inherent complexity and variability of real-world data continue to pose challenges. We consider challenging scenarios where the source model has poor performance on target domains, which has a better scientific merit for TTA.

  • Ablation Study (R3) Table 3 shows that a comprehensive combination of Lmt, Lre, AFFR and SMU leads to the best performance. Note that Lre is based on proposed feature fusion-based refinement, and our metric ADIC which occupies most part of our method (pages 2&3) can contribute to both Lmt and Lre, enabling the proposed method to outperform others. Therefore, the contribution to performance matches the space in the method section.

  • Other comments from R2 Computation cost: in clinical, accuracy is often prioritized as long as the runtime is acceptable. Our networks use the same input without augmentations, making it more efficient than CoTTA [22] (2.58s vs 14.77s on BraTS-PED) which requires 32 augmented inputs per back-propagation. We did not use Random Update in CoTTA.

  • Other comments from R3 Not using ADIC as a loss: Computing ADIC involves multiple independent forward passes, which disrupt the computation graph and prevent gradient backpropagation. Visualization: Thanks for pointing out the issues in Fig. 2, and we will modify it.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper proposes a principled and effective solution for segmentation under domain shift, focusing on continual, online test-time adaptation.

    While built on existing techniques, it introduces novel combinations and dynamic mechanisms to address known limitations in pseudo-labeling and teacher updates. Three reviewers ultimately supported acceptance, noting practical value and solid experimental validation.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This work presents enough technical contributions and meets the bar of MICCAI.



back to top