Abstract

Continual Test-Time Adaptation (CTTA) adapts a model pretrained on the source domain to sequentially arriving unlabeled target domains. However, existing approaches predominantly assume that model would complete adaptation to all samples within the same target domain before transitioning to the next domain, deviating from realistic clinical scenarios where samples from diverse domains arrive stochastically. Such gradual adaptation strategies suffer from performance drop under rapid domain shifts and limits their clinical applicability. To address this issue, we propose Mixture of Incremental Experts (MoIE), a lightweight network structure that maps new patterns to established knowledge. Specifically, MoIE incorporates two key innovations: 1) Progressive Expert Expansion (PEE), which dynamically adds experts when existing ones fail to effectively process the current sample, enabling stable and swift adaptation to target domains; 2) Knowledge-Transfer Initialization (KTI), which initializes new experts by combining existing ones through domain-similarity based weights, enabling fast adaptation to unseen domains while preserving learned knowledge to prevent immediate forgetting. Experiments on two CTTA tasks (prostate and fundus segmentations) indicate its superiority by achieving SOTA performance with minimal performance gaps across diverse inference sequences. (Code available at https://github.com/dyxu-cuhkcse/MoIE)

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0652_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/dyxu-cuhkcse/MoIE

Link to the Dataset(s)

N/A

BibTex

@InProceedings{XuDun_SequenceIndependent_MICCAI2025,
        author = { Xu, Dunyuan and Yuan, Yuchen and Zhou, Donghao and Yang, Xikai and Zhang, Jingyang and Li, Jinpeng and Heng, Pheng-Ann},
        title = { { Sequence-Independent Continual Test-Time Adaptation with Mixture of Incremental Experts for Cross-Domain Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15975},
        month = {September},
        page = {507 -- 517}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This work has presented a new test-time cross-domain adaptation framework for medical image segmentation, where a MoIE and KTI methods are adopted to improve the experimental performance.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    As one of the new settings in medical image segmentation, test-time adaptation could not only bring more potentials in practical applications but also is a relatively hard task. To enhance the feature alignment performance, this work has introduced a progressive expert expansion method with knowledge transfer initialization. Numerous experiments on popular datasets have verified the efficiency of this work.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Although the work shows a good cross-domain segmentation performance, the reproducibility is limited due to the complex construction. Besides, the Mixture of Incremental Experts network is also complex. A clear MoE network with a selection function might be better. Finally, most innovations, including incremental learning and knowledge initialization, seem to improve the processing speed. However, the work claims these methods could improve segmentation performance only, without mentioning processing speed.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The reproducibility is limited due to the complex construction. 2. A clear MoE network with a selection function might be better. Authors should clarify the reason for adopting MoIE in their work.
    2. Most innovations, including incremental learning and knowledge initialization, seem to improve the processing speed. However, the work claims these methods could improve segmentation performance only, without mentioning processing speed.
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper proposes a novel technique for continual test-time domain adaptation with a focus on stochastic sequences, meaning that the samples in the test sequence are not ordered by domain. The method uses a variant of mixtures of experts and apply it to incrementally add experts when they detect a domain shift. In experiments it is shown that the approach can reach state of the art performance in regular and stochastic sequences.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The setting of stochastic sequences is interesting, more realistic for clinical settings and understudied.
    • The overhead for test-time adaptation seems relatively low.
    • The experiments are well designed and show some good baselines.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • In the method section there are some points that need clarification: – The feature ensemble F_l^ens is introduced on page 5 in “weighted feature ensemble” after it is used on page 4 in “expert expansion”. The feature ensemble should be introduced before using it, to get a easier to follow method description. – In eq. 2 why is the “2-“ needed? To make it easier a possibility would be to just write a new expert is created when d_l^ens<alpha*d_l^tar and omit the 2-? I might be missing something. – Why does the BCE and dice loss in equation 4 get Theta as an input? I assume the theta are still the weights of the experts? Shouldn’t it get y and ŷ? – What range does the sum in equation 5 cover? does it run over different samples y? or over layers l? How is the l for gap_l chosen? – In the “Adaptation stage” subsection the description goes back to the expert expansion,which is confusing. I would recommend having the part from “Moreover, to prevent…” in the expert expansion section.

    • In the implementation settings it is mentioned that every expert consists of two fully connected layers. Is the convolutional feature map flattened for that? And especially for MoIE layers at the beginning of the network that seems to add quite a few parameters? How many parameters are added in total by those experts?

    • For Figure 3(c) a bit more guidance on interpretation is needed. For example if I understand correctly the MoIE at block 1 are only extended at the end. But I have no immediate explanation why? Can the authors please provide some insights or theory about what layers are expanded at what point in training?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • The phrase “enabling stable and swift/rapid adaptation” is used five times in the paper, maybe think of rephrasing sometimes.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall I think the paper provides a relevant contribution to MICCAI, therefore I chose a “weak accept”. However, I have a bit of concerns about the structure of the method section that can be resolved in a rebuttal.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper is the first to identify the Stochastic Sequence (SS) scenario in continual test-time adaptation and introduces the Mixture of Incremental Experts (MoIE) framework, which dynamically measures the similarity between new samples and existing experts to expand the expert pool for efficient and robust adaptation.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Identification of a New Scenario: The paper insightfully identifies the Stochastic Sequence (SS) scenario in continual test-time adaptation, differentiating it from the previously assumed Regular Sequence (RS) setting. This is significant because the SS scenario better reflects real-world clinical workflows where samples arrive unpredictably, thus providing the community with a novel research direction.

    2. Novel Mixture of Incremental Experts (MoIE) Framework: The proposed MoIE framework is innovative in addressing the SS scenario. Unlike previous approaches that rely on fixed expert configurations or language-guided gating (as seen in Low-Rank Mixture-of-Experts for Continual Medical Image Segmentation), this work dynamically measures the statistical similarity between new data and existing experts to decide when to incrementally expand the expert pool. This adaptive strategy enhances the model’s ability to robustly handle rapid domain shifts.

    3. Innovative Sub-Modules within MoIE: (1) Progressive Expert Expansion (PEE): The challenge of determining when to add new experts is tackled by employing a statistical domain-similarity metric. The paper illustrates this with clear visual representations (e.g., Fig. 1(b)) that help explain the underlying rationale, making the approach both transparent and innovative. (2) Knowledge-Transfer Initialization (KTI): KTI addresses the critical issue of initializing new experts. By performing a weighted initialization based on the contributions of existing experts, the method not only facilitates rapid adaptation to new data but also mitigates catastrophic forgetting. This double benefit—enhanced adaptability coupled with reduced forgetting—is particularly compelling.

    4. Strong Empirical Evaluation: The experimental results underscore the value of the proposed method. For instance, Table 2 demonstrates performance improvements, while Fig. 3(b) and Fig. 3(c) provide dynamic visual insights into the evolution of experts within the model. Furthermore, the ablation study in Fig. 3(a) validates the effectiveness of each component, offering strong evidence of the methodological contributions.

    5. Clear and Informative Visualizations: The paper features well-designed figures that effectively communicate its ideas. (1) Fig. 1(a) illustrates the difference between the new CTTA scenario (SS) and the traditional RS setting. (2) Fig. 1(b) dispels initial concerns about whether RS alone could boost model generalization by clearly comparing the two settings. (3) Fig. 2(a) presents a coherent and readable framework of the entire method, showing how the various modules interconnect and work together.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    A key concern is whether the comparison with state-of-the-art methods is entirely fair. In this paper, the proposed approach enhances a standard backbone with a Mixture of Incremental Experts (MoIE) structure. However, if the competing methods—such as Tent, CoTTA, ECoTTA, or ViDA—do not employ a similarly enhanced architecture (e.g., an additional MoE structure), then the observed performance gains might simply reflect the benefits of increased model capacity rather than the effectiveness of the dynamic expert expansion itself.

    To address this issue and ensure a fair comparison, the authors should consider performing additional evaluations where they augment a baseline model with a fixed MoE structure (using a fixed number of experts) without the dynamic expansion and knowledge-transfer mechanisms. This would help quantify how much of the performance gain is attributable to the architectural upgrade versus the novel dynamic mechanisms (PEE and KTI). In essence, showing that the dynamic aspects of MoIE—namely, adaptive expert addition and weighted initialization—yield further improvements over a static MoE model would reinforce the claim that the method’s innovations bring genuine advantages beyond a simple architectural increase in parameters.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, the paper introduces a novel continual test-time adaptation scenario and addresses it with the innovative Mixture of Incremental Experts (MoIE) framework—including adaptive expert expansion and effective expert initialization—which, together with rigorous experimental validation and clear presentation, demonstrates significant advantages and clinical feasibility, meeting the publication standards.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We sincerely thank all reviewers for their positive and constructive comments. They acknowledged that our method as “a good cross-domain segmentation” (R1), “significant advantages and clinical feasibility” (R2) and “a relevant contribution to MICCAI” (R3). Here, we give response to the main comments:

To R1: Q1: reproducibility. A1: We will release our code on the GitHub. Q2: reason for adopting MoIE. A2: Traditional MoE models struggle to adapt to sequences with rapid and stochastic domain transitions due to their learnable routing mechanisms. Frequent updates to the gating function during inference can cause instability and degrade performance. In contrast, our dynamic MoIE framework addresses this challenge with a non-learnable, domain-similarity-based routing strategy. It adaptively reuses existing experts when applicable and introduces new ones on-the-fly when needed, enabling stable and efficient adaptation to evolving domains. Q3: processing speed. A3: Thank you for recognizing the efficiency of our method. The CTTA setting requires rapid adaptation to diverse domains with minimal updates. By integrating historical knowledge, our method effectively aligns to the source domain within just a few iterations, significantly reducing the frequency of updates needed during adaptation.

To R2: Q1: fair comparison. A1: The effectiveness of each component in our framework is validated through ablation study in Fig. 3(a), which shows that our dynamic approach significantly improves both performance and stability compared to a static MoE setup. Specifically, the experiment without PEE represents the number of experts is initialized to the maximum and remains unchanged. This highlights the effectiveness of our dynamic expert expansion mechanism. To ensure a fair comparison, we use the same backbone network for our method and all other approaches.

To R3: Q1: logistic sequence. A1: Thank you for highlighting the structural sequence issues. We will add an explanatory formula for the F_l^ens on page 4 and reorganize some content in the expert expansion section. Q2: explanation for some equations. A2: In Eq. (2), the domain similarity consists of two terms based on the mean and variance from Eq. (1). We calculate the distance as (1 - mean term) + (1 - variance term). To enhance clarity, we retain the ‘2 -’ notation in our formula, your observation is correct. In Eq. (4), the symbol Theta denotes the parameters of the model. Both BCE and dice loss are calculated on these parameters, while the actual input aligns with your observation. In Eq. (5), our experimental results show that this loss ranges from 0.8 to 1.2 during the adaptation process. The term gap_l means the summation of gaps across all layers. We appreciate your opinion and will include an additional summation symbol in the camera-ready version. Q3: additional parameters. A3: Before each FC layer, the feature map is flattened from [B, H, W, C] to [B, H*W, C]. This ensures that the additional parameters are applied efficiently as the FC layers operate on the channel axis, similarly to convolutional layers with 1×1 kernels. In total, only 0.075M parameters are added by these incremental experts during the adaptation process, which is extremely small compared to the original backbone model containing 7.77M parameters. Q4: expert expansion. A4: The number of experts in blocks 2 and 3 increases earlier than in blocks 1 and 4 according to Fig. 3(c). We speculate the reason is that the shallow layers of the network primarily extract texture or boundary features that are shared across domains, while the deeper layers focus on processing semantics that have already been processed by the previous layers. However, the middle layers are responsible for integrating the low-level features into high-level representations, making their role the most critical and the hardest to satisfy, which explains why the number of experts grows the fastest in these middle layers.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top