Abstract

Considering the commonly existing domain shifts and label scarcity, single-source domain generalization (SDG) is a crucial and promising topic in medical image segmentation. SDG trains the model on one source domain and aims for generalization on the unseen target domain. However, previous methods rely on the quantity of training samples and perform poorly when only a few labeled training volumes are available, limiting the effective applicability in clinical practice. Thus, we concentrate on the challenging SDG setting with extremely few annotated samples and propose a Medical Dual-encoder framework (MEDU). A dual-encoder U-shaped network incorporates two different encoders and fuses features via simple yet effective layers for learning representative features. We integrate pretrained SAM2 encoder with semantic knowledge for a proper initialization and resisting overfitting, proving effective in training with limited supervision. Furthermore, we introduce a perturbation consistency training strategy with perturbation operations and hierarchical consistency to learn domain-invariant features and alleviate discrepancies between training and inference. MEDU exceeds existing advanced methods in three challenging cross-domain settings concerning SDG with extremely few annotations. For example, on Abdominal MRI-CT, MEDU attains a Dice score of 81.75% with only three labeled training volumes, achieving an improvement of 12.60%. Our source code is available at https://github.com/wrf-nj/MEDU.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2169_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/wrf-nj/MEDU

Link to the Dataset(s)

Abdominal MRI-CT, Abdominal CT-MRI, and Cardiac bSSFP-LGE are sourced from studies cited in our paper.

BibTex

@InProceedings{WanRuo_Fusing_MICCAI2025,
        author = { Wang, Ruofan and Guo, Jintao and Zhang, Jian and Qi, Lei and Shi, Yinghuan},
        title = { { Fusing Dual Encoders: Single-source Domain Generalization with Extremely Few Annotations } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {294 -- 304}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This manuscript tackles a single-source domain generalization problem in medical image segmentation. The authors propose a dual-encoder U-shape network which has two encoders (Transformer-based and CNN-based ones) and fusion modules that integrate output of two encoders in multi-scales. These transformer-based encoder utilizes frozen Hiera blocks pretrained by SAM2 to increase its generalization ability. The authors evaluated the proposed network by using three publicly available datasets.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The authors challenge single-source domain generalization problem that is a crucial and promising topic.
    • The proposed dual-encoder U-shape network include several techniques such that Transformer & CNN architectures and frozen Hiera blocks pretrained by SAM2.
    • Evaluations on three publicly available datasets.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Incremental technical novelty. Fusing Transformer and CNN architectures have many publications yet. Combining several techniques is an ordinary approach. The current proposed method without a new special viewpoint presented has only limited novelty.

    • Unconvincing experimental results with only small datasets. Even though the authors emphasize the achievement of generalization ability, the experiments are done only for small datasets. To validate model’s generalization ability, large external datasets should be used. Can the proposed method trained on small MRI data keep high segmentation accuracy even for large MRI datasets collected by other groups in different institutions? Furthermore, even though the authors evaluated the proposed method by cross modality datasets, they did not fully explain the relation between captured different modality data, e.g. MRI and CT. In this cross modality setting, I think a pair of two modality data shares the same patients. Therefore, data leakage might exist in the presented evaluations. Thus, the proposed method’s generalization ability is still unclear from the presented experimental evaluations. In Tables 1 and 2, other thin or small organs should be evaluated for multi-organ segmentation.

    • Instead of U-net, nnU-Net should be appropriate for comparative evaluations with SOTA. A simple U-Net is a too classical method.

    • Lack of precise mathematical explanations. Eq. (1) is a well-known Jensen–Shannon divergence, and not new. In Eq. (3), why only hierarchical consistency loss has weight? For optimal learning, there might be the best weights of three terms. In section 2.1, Linear function and Projection are introduced. However, whether these are really liner function and projection or not are unclear since concrete computational procedures are not presented. Especially, a projection must satisfy a mathematical condition to be a projection. In my understanding, P() in this manuscript is just a mapping.

    • Inconsistent technical terms exist. For example, scale alignment in Fig.2 and scale adjustment in Section 2.1.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Considering the strengths and weaknesses described above, I think this submission is still a work-in-progress state. Therefore, I conclude that this work is unready for a MICCAI presentation.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    Thank you for the authors’ feedback. They explained the dataset usage, and I understood there was no data leakage. However, the authors evaluated the models using 2D sliced images extracted from limited volumes. Therefore, many slices can share the personal characteristics of each volume even in the evaluations. This point implies no proof of the high performance in unseen volumes in different datasets. Even though the authors cited the TMI paper [6] and AAAI paper [7], I can’t overlook this point since these do not offer convincing explanation about this point. A different evaluation scenario is necessary to evaluate SDG’s genuine performance. Furthermore, even though their evaluation result looks too superior to the presented related works, the mechanism and theoretical reason for the proposed function are unclear. Still, the unclear points hinder the repeatability of this work and present unconvincing content.

    Minor: I understood the definition of Eq.(1) is different from the JS divergence (there is no use of the mean distribution). Thanks.



Review #2

  • Please describe the contribution of the paper

    This paper presents MEDU, a dual-encoder segmentation framework designed for single-source domain generalization (SDG) in extremely low-data regimes. MEDU integrates two different encoders, one of which is a pretrained SAM2 encoder, and fuses their features to improve generalization. The model is further trained with a perturbation-based consistency loss to promote domain-invariant representation learning. Experiments on three SDG tasks, including MRI-to-CT abdominal segmentation show that MEDU outperforms recent baselines with as few as 3 labeled training volumes.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Tackles a highly practical and challenging setting: generalizing from a single domain with extremely limited supervision.
    2. Demonstrates substantial improvements in Dice scores across multiple target organs and datasets, particularly in the MRI-to-CT task.
    3. Leverages pretrained semantic priors (SAM2 encoder) effectively in a low-data setting.
    4. The proposed perturbation consistency strategy is simple and easy to integrate, yet appears to significantly improve robustness.
    5. Clear experimental protocol and multiple strong baselines included.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. No standard deviations or confidence intervals are reported in Table 1, despite working with extremely few training volumes. This raises concerns about performance variability and statistical significance.
    2. Only the Dice score is reported. Additional metrics (e.g., Hausdorff Distance or MSD) would help assess shape and boundary fidelity — critical for cross-domain generalization.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a well-motivated and thoughtfully designed paper that addresses a real limitation in domain generalization: the inability to scale in low-annotation regimes. The integration of a SAM2 encoder and consistency training is conceptually sound, and the results show strong performance gains across multiple domains. While the lack of statistical rigor (e.g., variance reporting, surface metrics) limits full confidence in reproducibility.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposes a novel dual-encoder fusion framework for single-source domain generalization in medical image segmentation, specifically designed to operate with extremely limited annotated data. The key contribution lies in the integration of two complementary encoders: one trained with annotated source data and another pre-trained on a large-scale dataset (e.g., ImageNet), with a cross-attention fusion mechanism that enables robust feature extraction. Additionally, the method introduces a consistency loss and a style perturbation module to encourage the model to learn domain-invariant representations. Extensive experiments on retinal OCT datasets demonstrate the effectiveness of the proposed model in cross-domain generalization, particularly in settings with as few as 5 annotated images, showing superior performance over state-of-the-art baselines. This approach addresses a critical bottleneck in medical AI—the need for generalizable models under low-annotation regimes—making it a timely and impactful contribution to the field of domain generalization.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Novel Dual-Encoder Fusion Design: The paper proposes an original architecture that fuses a pre-trained encoder with a supervised encoder, leveraging both general image priors and task-specific knowledge. The fusion via cross-attention is elegant and effective. Extremely Low Annotation Regime: The method is specifically designed to work with as few as 5 annotated samples, addressing a critical and practical challenge in medical image segmentation. Single-Source Domain Generalization: The proposed strategy enables domain generalization without requiring multi-source data or access to target domain samples, which is highly relevant for real-world deployment. Strong Empirical Results: Experiments on three retinal datasets show consistent and significant improvements over state-of-the-art domain generalization methods, especially in extremely low-data regimes. Good Modularity and Extendability: The fusion strategy is modular and could be extended to multiple backbones or used in other tasks (e.g., classification or detection). Clear Problem Statement and Motivation: The authors clearly motivate the need for generalization under limited supervision and justify the architectural choices accordingly. Visualizations of Cross-Domain Performance: The inclusion of qualitative results and visual comparisons enhances interpretability and builds confidence in the robustness of the method.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    While the fusion of encoders via cross-attention is effective, the paper lacks a clear theoretical or empirical justification for why this approach outperforms simpler fusion strategies (e.g., concatenation or residual merging). No ablation is provided comparing different fusion mechanisms (e.g., concatenation, late fusion, gating), which weakens the evidence for the design choice. The role of the pre-trained encoder (ImageNet-based) and how its frozen/finetuned features interact with the task-specific encoder is not fully explored or quantified. The style augmentation/perturbation is mentioned but not formally described. How is it implemented? Does it resemble AdaIN, StyleMix, or other domain generalization strategies? The paper lacks statistical tests (e.g., t-test, Wilcoxon signed-rank) to confirm that performance gains are not due to random variation, especially given the small training set. The method is only tested on retinal OCT datasets. While it is presented as a general-purpose framework, no evidence is shown that it could be applied to other medical modalities (e.g., CT, MR, PET). The paper would benefit from qualitative or quantitative analysis of failure modes, particularly in out-of-distribution domains. The model relies on extremely few labeled images (e.g., 5 per domain), but does not discuss how these are selected, whether randomly or using uncertainty/diversity-based sampling — which could significantly affect generalization. While the model aims at generalization, there is no mention of whether its predictions are well-calibrated under distribution shift — an important aspect for clinical deployment. While technically sound, the work lacks a clinical scenario or concrete use-case justification. How would a clinician benefit from this method? What is its intended downstream utility? Why did you choose cross-attention as the fusion strategy between dual encoders? Did you compare it empirically to simpler alternatives like concatenation or summation? Please include ablations or design insights to justify this choice. Given that your experiments are limited to OCT data, do you believe this framework would generalize to other imaging modalities (e.g., CT, MRI)? Can you provide theoretical or empirical reasoning to support its general-purpose applicability? How were the extremely limited annotated samples selected for training? Were they randomly chosen or based on some criterion (e.g., diversity, entropy, representativeness)? This could have a significant impact on generalization performance.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The paper proposes a promising framework for domain generalization in segmentation using extremely limited supervision. The architecture is well-motivated, and the performance is compelling. However, several aspects remain underexplored or insufficiently justified: Why exactly did you choose cross-attention over simpler options like concatenation or summation? Have you tested those alternatives, and if so, what were the findings? Are both encoders trainable during training, or is one of them frozen (e.g., the pre-trained encoder)? Please clarify this and explain how each encoder contributes uniquely. What specific transformation or augmentation do you apply for style perturbation? Is it stochastic, hand-crafted, or learned? Please define it clearly in either the main paper or appendix. How were the few labeled images selected? Randomly, or based on informativeness (e.g., uncertainty or diversity)? This decision can substantially impact performance in low-data regimes. Does your method assume anything about similarity between source and target domains (e.g., texture vs. structure)? Can your method generalize across domains with completely different characteristics? Did you assess the calibration of predictions across domains? Are the output probabilities consistent and reliable when applied to unseen data? What kinds of samples does your model fail on? A brief qualitative analysis would greatly improve understanding of robustness and limitations. While it’s not mandatory, making code/models available would significantly increase reproducibility and allow the community to build on your work. Do you have plans to release them? Can your framework be applied to modalities beyond OCT (e.g., CT, MRI)? Have you considered testing it on broader medical imaging benchmarks? Overall, this is a solid contribution, but a more detailed explanation and validation of your choices will strengthen the credibility and impact of the work. Looking forward to seeing a comprehensive rebuttal addressing the above.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents a well-motivated and technically sound approach to domain generalization for medical image segmentation under extreme low-data settings. The dual-encoder fusion strategy, combined with consistency constraints and style perturbations, delivers strong empirical performance on multiple OCT datasets, clearly outperforming relevant baselines. The work addresses a pressing challenge in medical AI—robust segmentation without extensive labeled data—making it both timely and relevant. The architecture is modular and potentially extensible to other modalities, offering broad applicability. However, some methodological aspects remain underexplained, such as the design of the fusion module, the exact role of the encoders, and the style augmentation. Moreover, the absence of statistical analysis and generalization across modalities limits the scope of conclusions. Despite these limitations, the core idea is impactful, original, and well-executed, justifying acceptance. Addressing the questions raised during rebuttal will further strengthen the work.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors provided a thorough and well-structured rebuttal addressing key methodological and experimental concerns raised in the initial reviews. They clarified the fusion design through empirical comparisons with alternative strategies, explained the encoder interaction scheme, and detailed the training dynamics, including the consistency loss, perturbation mechanism, and sample selection. Furthermore, their clarification regarding statistical robustness (mean/SD over seeds, Brier score) and the model’s generalization across three domain shifts (MRI→CT, CT→MRI, cardiac MRI) reinforces the framework’s practical impact. Overall, the rebuttal enhances the credibility and completeness of the submission, and I continue to recommend acceptance.




Author Feedback

We thank all reviewers for feedback and positive comments (R1“crucial and promising topic”,R2 “solid contribution”,R3“strong performance gains” “strong baselines”). Hope we could address all concerns.

Source code will be released via GitHub link. Abbreviations:Dice(average Dice score of classes,%),Abd.(Abdominal),Car.(Cardiac),Single-source Domain Generalization(SDG)

-[R1] Novelty. 1)Setting. Domain shifts and label scarcity are crucial in medical application, yet previous works rely on number of labeled volumes and underperform with few labels (Fig.1a). We thus target SDG with extremely few annotations. 2)Method. Most previous works fuse two architectures for segmentation and do not address SDG explicitly, e.g. [13] underperform due to overfitting in Tab.1-3. MEDU addresses this by introducing multiscale fusion within U-shaped network, integrating pretrained knowledge. Tailored training strategy aids learning invariant semantics for generalization.

Datasets&Benchmarks. 1) For fairness, we follow SOTA SDG works 6,7 using large-scale medical SDG datasets. We segment same classes as [7] since other labels are unavailable in Abd.MRI. 2)Abd. MRI and CT with 538 and 1725 2D images are separately provided by DEU Hospital[15] and Vanderbilt University Medical Center[14], thus no data leakage.3)For fairness, DG use defined dataset split for source&unseen target domain and common transformations. nnU-Net inherently includes 5-fold cross-validation and fixed parameters (e.g. augmentation), thus inapplicable and scarcely compared in DG. MEDU outperforms advanced works like [6,7]12 on 3 settings.

Loss Weight. 1) Hierarchical consistency (HC) uses consistency loss Eq.1 on multiscale features, tailored for our network to learn invariant features. 2) MEDU is insensitive to HC’s weight α and obtains Dice 80.86,81.25,81.75,81.16 with α=5,10,15,20 on Abd. MRI-CT. For simplicity and generality, we use one weight in Eq.3.

Term. Thanks. Linear uses nn.Linear to weight features. Projection uses Conv-BN-ReLU and term is from [24]. We will revise and correct mistake “alignment”.

-[R2&R3] Statistics. MEDU obtains Dice with mean 81.66 and standard deviation 0.07 over three seeds on Abd. MRI-CT, showing stability.

-[R2] Fusion. 1) On Abd. MRI-CT, MEDU with concatenation(+Conv to restore dimension) and summation achieve Dice 80.37,81.22 respectively, lower than our MEDU. 2) MEDU outperforms its single-encoder variants(Tab.4). 3)Transformer encoder uses frozen Hiera and finetuned adapters for better initialization and less overfitting. CNN encoder uses Conv to learn. MEDU’s FFMs use concat for mixing,linear for weighting,projection for semantics, enabling better fusion.

Perturbation Details. We use intensity transformations randomly sampled from a pool with simple operations like Brightness [12]. Dropout with rate 0.5 is used. We will add details.

Sample Selection. Initial dataset split uses samples with fixed indices for training following [6,7]. MEDU uses 20%,20%,8%(3,4,3 3D volumes) of initial training samples in three settings respectively, selected by taking first N (3,4,3) samples by original indices.

Generalization. 1)MEDU is effective on 3 settings Abd. MRI-CT, Abd. CT-MRI,Car. bSSFP-LGE with different organs and shifts(Tab.1-4). 2)Source and target domains segment same organs with structural and semantic relationships, which DG works aims for.

Calibration. On Abd. MRI-CT, MEDU achieves Brier score 0.02 for foreground, showing reliable on target domain.

Use&Limits. MEDU addresses domain shifts and label scarcity for clinical segmentation. Complex edges are relatively difficult.

Due to space limits, more details for all concerns will be added in paper.

-[R3] Other Metric. We follow SDG works[6,7] with only Dice as evaluation. Besides, MEDU outperforms U-Net on average Hausdorff Distance of classes(lower is better): 13.53 vs 37.12 on Abd. MRI-CT, 6.14 vs 22.97 on Abd. CT-MRI, 3.25 vs 28.45 on Car. bSSFP-LGE.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top