Abstract

Test-time adaptation enables a trained model to adjust to a new domain during inference, making it particularly valuable in clinical settings where such on-the-fly adaptation is required. However, existing techniques depend on large target domain datasets, which are often impractical and unavailable in medical scenarios that demand per-patient, real-time inference. Moreover, current methods commonly focus on two-dimensional images, failing to leverage the volumetric richness of medical imaging data. Bridging this gap, we propose a Patch-Based Multi-View Co-Training method for Single Image Test-Time adaptation. Our method enforces feature and prediction consistency through uncertainty-guided self-training, enabling effective volumetric segmentation in the target domain with only a single test-time image. Validated on three publicly available breast magnetic resonance imaging datasets for tumor segmentation, our method achieves performance close to the upper bound supervised benchmark while also outperforming all existing state-of-the-art methods, on average by a Dice Similarity Coefficient of 3.75%. We will publicly share our accessible codebase, readily integrable with the popular nnUNet framework, at https://github.com/smriti-joshi/muvi.git.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2927_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/smriti-joshi/muvi.git

Link to the Dataset(s)

Duke Breast Cancer MRI: https://www.cancerimagingarchive.net/collection/duke-breast-cancer-mri/ TCGA-BRCA: https://www.cancerimagingarchive.net/collection/tcga-brca/ ISPY1: https://www.cancerimagingarchive.net/collection/ispy1/

BibTex

@InProceedings{JosSmr_Single_MICCAI2025,
        author = { Joshi, Smriti and Osuala, Richard and Garrucho, Lidia and Kushibar, Kaisar and Kessler, Dimitri and Diaz, Oliver and Lekadir, Karim},
        title = { { Single Image Test-Time Adaptation via Multi-View Co-Training } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {628 -- 638}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a block-based multi-view co-training method for single-image test-time adaptation (TTA) in medical image segmentation. The approach leverages uncertainty-guided self-training to enhance feature and prediction consistency, enabling effective volumetric segmentation using only a single test-time image. It focuses on exploiting the volumetric richness of medical imaging data, unlike existing 2D-focused TTA methods. Validated on two public breast MRI tumor segmentation datasets, the method achieves performance close to supervised benchmarks, with an average Dice similarity coefficient improvement of 3.75% over state-of-the-art methods.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper has the following strengths:

    1. Extensive Experimental Evaluation: The authors conduct experiments across three breast cancer-related datasets, providing a robust assessment of the proposed method’s performance, which is a valuable effort in validating its effectiveness.
    2. Focus on an Emerging Area: The paper addresses test-time adaptation for medical image segmentation, an underexplored domain, contributing to a niche but clinically relevant field where real-time adaptation is critical.
    3. Code Accessibility: The commitment to publicly share an nnUNet-compatible codebase enhances reproducibility and practical applicability, which is a significant asset for the research community.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Poor Readability and Lack of Clarity: The introduction, particularly the second paragraph, lists existing test-time adaptation methods in a descriptive, almost chronological manner without summarizing their common limitations, resembling a literature dump. The actual challenges—large data requirements and lack of 3D data usage—are only clarified in the third paragraph. Moreover, the cited issues are not unique to TTA. Single-image TTA methods, such as Khurana et al. (2021) and Dong et al. (2024), already address small-data scenarios, undermining the claimed challenge of large data needs. Similarly, the lack of 3D data usage in prior work reflects design choices rather than a fundamental limitation, weakening the justification for 3D modeling as a contribution.
    2. Limited Novelty: The proposed multi-view co-training is presented as a key innovation, but this concept has been explored previously, notably in Xia et al. (2020), which introduced uncertainty-aware multi-view co-training for 3D semi-supervised learning. The use of cross-entropy to refine predictions during testing is also not novel, as it is common in semi-supervised learning and has been applied in TTA, such as in Dong et al. (2024). Additionally, the emphasis on 3D modeling does not constitute a methodological innovation, as it is primarily an extension of existing frameworks to volumetric data rather than a new approach.
    3. Narrow Evaluation Scope: The experiments are limited to breast cancer MRI datasets, which restricts the assessment of the method’s generalizability. To convincingly validate a general medical segmentation algorithm, the evaluation should include diverse organ structures and imaging modalities, such as CT or ultrasound, to demonstrate robustness across different medical contexts.

    4. Khurana, A., Paul, S., Rai, P., Biswas, S., & Aggarwal, G. (2021). SITA: Single image test-time adaptation. arXiv preprint arXiv:2112.02355.
    5. Dong, H., Konz, N., Gu, H., & Mazurowski, M. A. (2024). Medical image segmentation with intent: Integrated entropy weighting for single image test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5046–5055).
    6. Xia, Y., Liu, F., Yang, D., Cai, J., Yu, L., Zhu, Z., Xu, D., Yuille, A., & Roth, H. (2020). 3D semi-supervised learning with uncertainty-aware multi-view co-training. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 3646–3655).
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I recommend rejecting the paper due to significant shortcomings that outweigh its strengths. The primary factors include the lack of methodological novelty, as the multi-view co-training and cross-entropy refinement build on established techniques (e.g., Xia et al., 2020; Dong et al., 2024) without introducing substantial innovation. The paper’s poor readability, particularly in the introduction, hinders comprehension and fails to clearly articulate the unique challenges addressed, with overstated issues like large data needs and 3D data usage that do not hold up against prior single-image TTA work (e.g., Khurana et al., 2021). Additionally, the evaluation’s focus on breast cancer datasets limits claims of generalizability, missing opportunities to test across diverse medical scenarios. While the extensive experiments and commitment to code sharing are positive, these do not compensate for the lack of originality, unclear writing, and narrow evaluation, leading to my rejection recommendation.

    1. Khurana, A., Paul, S., Rai, P., Biswas, S., & Aggarwal, G. (2021). SITA: Single image test-time adaptation. arXiv preprint arXiv:2112.02355.
    2. Dong, H., Konz, N., Gu, H., & Mazurowski, M. A. (2024). Medical image segmentation with intent: Integrated entropy weighting for single image test-time adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5046–5055).
    3. Xia, Y., Liu, F., Yang, D., Cai, J., Yu, L., Zhu, Z., Xu, D., Yuille, A., & Roth, H. (2020). 3D semi-supervised learning with uncertainty-aware multi-view co-training. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 3646–3655).
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    The author claims that the cancer breast dataset is a challenging scenario, but this does not solve the problem that the experiments conducted in the article are overly one-sided, resulting in the unconvincing algorithms and conclusions of this article. Meanwhile, the author’s other responses did not clarify the differences between him and other similar methods In conclusion, I wasn’t convinced by the author’s rebuttal



Review #2

  • Please describe the contribution of the paper

    This paper introduces MuVi, a novel source-free test-time adaptation method that operates on a single 3D medical image. The method combines entropy-guided pseudolabeling and patch-based multi-view co-training to adapt segmentation models without access to source data or target labels. It is particularly well-suited for real-world clinical settings where per-patient inference and volumetric consistency are essential.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Realistic problem formulation: The focus on single-image, source-free test-time adaptation directly addresses practical constraints in medical imaging deployment.

    Strong methodological design: The combination of multi-view patch co-training, entropy-based pseudolabel fusion, and feature-space consistency offers a robust and elegant solution to test-time adaptation without requiring batch statistics.

    Thorough evaluation: The method is validated on two challenging multi-center breast MRI datasets, with strong comparisons against both normalization-based and self-training baselines. Ablation studies clearly demonstrate the contribution of each component.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Unusually high standard deviations in results: The reported standard deviations in the evaluation tables are unexpectedly large, raising concerns about the stability and reliability of the method. It would be helpful to clarify whether this variance is due to data heterogeneity, small test sets, or instability in the adaptation process.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper presents a strong and well-motivated solution—through realistic problem formulation, robust methodological design, and thorough evaluation—I find that the unusually high standard deviations in the results make it difficult to confidently assess the effectiveness of the method. If the authors can address this concern regarding statistical reliability, I would be happy to recommend acceptance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The only concern regarding std has been addressed. I moved my decision to accept.



Review #3

  • Please describe the contribution of the paper

    This paper proposes a novel approach for self-supervised test-time adaptation of medical images under domain shift. The method specifically addresses the more challenging scenario of single-image adaptation, where no batch accumulation is available at test time. This is achieved through a source-free strategy that processes one image at a time, leveraging multi-view co-training to generate pseudo-labels from a pre-trained network. Consistency is enforced during adaptation in both image and feature spaces through the use of transformed views.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is clear, well-written, and well-organized.
    • The proposed model addresses an extremely challenging scenario by relaxing the common assumption of having large batches of test data available for simultaneous processing, an aspect particularly relevant in the medical imaging domain.
    • The evaluation is thorough, and the results demonstrate the superiority of the proposed approach.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The evaluation explores only adaptation from A→B and A→C configuration.
    • Ablation studies can be more insightfull
    • Ablation was performed only for one of the two datasets
    • It would be interesting to see the change in performance with different test-batch sizes

    More details in the next section.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Given access to three different datasets, it would be ideal to evaluate all possible adaptation configurations (A→B, A→C, B→A, B→C, C→A, C→B). This would help demonstrate that the proposed method remains robust even when the source domain is less varied. Is there a specific reason why the ablation studies were conducted only for a single target domain? Similarly, it is unclear why the experiments involving Instance Normalization were performed only on the second dataset. Clarifying these choices would strengthen the evaluation. Additionally, Tables 2 and 3 report average values and possibly standard deviations, but it is never specified what these statistics refer to. Do they represent multiple adaptation runs on the same pre-trained network? This information should be explicitly stated to ensure the reproducibility of results. Lastly, as noted by the authors, some baseline methods require batches larger than one to approximate the target distribution. It would be valuable to compare their performance when using multiple images at test time, alongside the proposed method under the same conditions. This comparison could provide further insight into the strengths and limitations of the approach.

    Minor:

    • In abstract, only 2 datasets are mentioned
    • Figure 1 is slightly unclear, i suggest to revisit it, especially a) and b)
    • In formulation, z and v definition is missing
    • Figure 2 (e,e) duplicated entry, also the row 1 d, e and e2 segmentations seem oddly similar, please double check
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well structured and the obtained results are strong. The evaluation is thorough even if ablations could use some more results.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    After the rebuttal my evaluation remains unchanged. While not all the initial comments were fully addressed, I believe that this work tackles a challenging and relevant problem for the MICCAI community. The proposed method shows promising results and therefore I recommend acceptance of the paper.




Author Feedback

We thank the reviewers for their valuable feedback and for acknowledging the strengths of our work, particularly our focus on a “challenging setup” (R1, R2, R3), “novelty and methodological strength” (R1, R2), “robust evaluation” (R1, R2, R3), and the “clarity” of the paper (R1). We address the reviewers’ concerns below:

High Standard Deviation (std) (R2, R1): To clarify the points raised in R2-Q7 and R1-Q10, the reported mean and std refer to statistics computed on the test set. The std captures its heterogeneity, which includes tumors with varying characteristics such as non-mass enhancements, necrosis, and cysts. These are particularly hard to segment accurately as they introduce data shift and inter-observer variability in ground truth. As shown in Tables 2 & 3, we compare to the supervised upper bounds, which also have a high std for this reason. Our method always maintains or reduces std over the baselines, indicating higher stability. We clarify this in the revised version and include it in our future work to introduce application-specific modifications to our approach (e.g., explicitly address tumor variability).

Evaluation scope (R3): We thank R3 for appreciating our extensive evaluation with 3 breast cancer MRI datasets (R3-Q6). While we agree that evaluation on additional modalities would further validate our method’s generalizability, we demonstrated its usefulness in multiple challenging domain shift scenarios, including tumor and anatomical variability. We appreciate the suggestion and will extend on this in future work.

Single-image test-time adaptation (SITTA) (R3): R3-Q7 rightly notes that SITA (Khurana et al., 2021) and InTent (Dong et al., 2024) address the SITTA setting. To clarify, we do not claim to be the first to address SITTA in medical tasks or otherwise. On the contrary, additional SITTA methods are covered in the introduction. Next, we clarify how our method overcomes key limitations of the mentioned SITTA approaches:

1) Both SITA and InTent estimate batch-normalization (BN) statistics, which are known to be unstable. This is not only supported by prior work but also by our reported findings. Unlike our approach, which is based on test-time training, they rely only on ensembling.

2) InTent is consistently outperformed by our method (see Tables 2 & 3). SITA is developed for image classification on 2D natural images (CIFAR, ImageNet). Referencing it in R3-Q12, to discount 3D TTA work, may understate the added complexity of 3D medical segmentation tasks.

Novelty of 3D and Multi-View Setups (R3): R3-Q7 also notes that certain methods opt for 2D processing due to design choices. Our results show that these methods perform poorly when extended to 3D applications. This also points to a broader and more concerning trend: the routine simplification of inherently 3D volumetric data into 2D slices, despite the availability of robust 3D baseline networks, risking losing relevant contextual information. We believe this is due to: (1) reliance on unstable BN in single-image setups, and (2) the complexity of implementing TTA in 3D.

Solving this, our method offers a clear path forward: it integrates seamlessly into the 3D nnUNet segmentation framework, providing a practical and scalable test-time adaptation, without relying on 2D simplifications. Additionally, to our knowledge, this is the first TTA method to leverage 3D medical data through a patch-based multi-view strategy. We would also like to stress that concepts designed for relaxed settings (e.g., Xia et al., 2020, as noted by R3) need substantial changes to work under stricter constraints like SITTA. We point the reviewer towards the value of tackling this challenging setup.

Further in-depth analysis (R1): We thank R1 for recommending acceptance and especially for the constructive suggestions in Q10, which we consider valuable for future work. We will incorporate the remarks by R1, R2, and R3 in the revised version of the paper.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper tackles an important and challenging test-time adaptation task. Most reviewers recognized that the proposed method is novel. Although as mentioned by R3, the multi-view co-training has been used in semi-supervised segmentation, its application in the context of test-time adaptation and its integration with other components represent meaningful contribution. Furthermore, the empirical results are strong to demonstrate the effectiveness of the method. The authors are encouraged to revise the paper by addressing the reviewers’ comments and incorporating the clarifications provided during the rebuttal to further enhance the quality of the final version.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This work leverages multi-view co-training for single test image adaptation. While this work focusing on an emerging topic, there are some major issues raised but the reviewers: 1) The novelty is limited, as most part of the method has been used in existing works on semi-supervised segmentation; 2) The writing needs to be improved, and the challenge for single image TTA and the difference from existing works are not clarified; 3) Only one dataset was used for experiments, which limits its generalizability to other tasks.



back to top