Abstract

Large foundation models, known for their strong zero-shot generalization capabilities, can be applied to a wide range of downstream tasks. However, developing foundation models for medical image segmentation poses a significant challenge due to the domain gap between natural and medical images. While fine-tuning techniques based on the Segment Anything Model (SAM) have been explored, they primarily focus on scaling up data or refining inference strategies without incorporating domain-specific architectural designs, limiting their zero-shot performance. To optimize segmentation performance under standard inference settings and provide a strong baseline for future research, we introduce SyncSAM, which employs a synchronized dual-branch encoder that integrates convolution and Transformer features in a synchronized manner to enhance medical image encoding, and a multi-scale dual-branch decoder to preserve image details. SyncSAM is trained on two of the largest medical image segmentation datasets, SA-Med2D-20M and IMed-361M, resulting in a series of pre-trained models for universal medical image segmentation. Experimental results demonstrate that SyncSAM not only achieves state-of-the-art performance on test sets but also exhibits strong zero-shot capabilities on unseen datasets. Code and checkpoints are available at \url{https://github.com/Hhankyangg/SyncSAM}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1708_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Hhankyangg/SyncSAM

Link to the Dataset(s)

N/A

BibTex

@InProceedings{YanSih_Improved_MICCAI2025,
        author = { Yang, Sihan and Feng, Jiadong and Mi, Xuande and Bi, Haixia and Zhang, Hai and Sun, Jian},
        title = { { Improved Baselines with Synchronized Encoding for Universal Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15961},
        month = {September},
        page = {258 -- 268}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a new variant of the Segment Anything Model (SAM) called SyncSAM, which incorporates a synchronised dual-branch encoder combining convolutional and transformer features to enhance segmentation performance. The method is evaluated on large-scale medical segmentation datasets and demonstrates strong performance, particularly in zero-shot scenarios. The authors also will release a series of pre-trained models for universal medical image segmentation.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Architectural novelty - The dual-branch synchronisation between convolutional and transformer features is a well-motivated idea that could improve feature diversity and model robustness.

    Strong empirical performance - The method shows good results in both standard and zero-shot evaluations.

    Ablation study - The ablation experiments are thorough and clearly show the incremental contribution of key components.

    Potential impact - The release of pre-trained models for the community is valuable for future research and deployment in medical segmentation tasks.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Lack of clarity in the motivation - The introduction claims a gap in foundation models with expert-designed modules, but does not sufficiently differentiate SyncSAM from existing models like MedSAM, MedSAM2, or MED-SA. A clear articulation of the novelty and differences is missing.

    2. Ambiguous phrasing - Statements like “these modifications may limit broader applicability to downstream tasks” (referring to interactive methods or support-set-based inference) are vague and lack concrete examples or justification.

    3. Missing mathematical detail - The method section lacks a formal mathematical description of how the input propagates through the dual branches and into the decoder. Including this would improve clarity and reproducibility.

    4. Limited evaluation metrics - Only Dice scores are reported. Justification is needed for not including standard metrics such as IoU and Hausdorff Distance, especially in a medical context where boundary accuracy matters.

    5. Incomplete comparison justification - The selection of baselines (e.g., MedSAM, SAM-Med2D, FT-SAM and the ones for zero-shot performance) lacks justification. Are these the strongest or most appropriate comparisons?

    6. No analysis of model complexity - It is unclear whether the performance improvements stem from architectural innovations or simply from increased parameter count via the CNN branch. A comparison of parameters, FLOPs, and inference time is necessary to support claims of efficiency or architectural effectiveness.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the method is promising and the results are strong, the paper currently lacks sufficient clarity and justification in its motivation, comparison strategy, and performance analysis. The architectural innovation could be impactful, but without clear differentiation from existing work, deeper explanation of how the architecture operates, and analysis of computational complexity, the contribution is underspecified.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have addressed all major concerns raised in the initial review. They clarified the novelty of SyncSAM relative to prior SAM-based methods, committed to improved mathematical exposition, and provided new insights into model complexity and runtime. Although the original submission lacked certain technical details, the rebuttal demonstrates that these are revision-level issues. Hence my final decision is accept.



Review #2

  • Please describe the contribution of the paper

    In the paper, the authors present a model architecture for promptable zero-shot segmentation of medical images. The method is based on SAM, and expands the SAM architecture by adding a second, parallel path to both the encoder and decoder part of the model. The second path utilizes convolutional-style building blocks to better encode local features, arguing that global relationships are captured and represented better than local ones by the ViT-style encoder/decoder pair of SAM. In combination with a synchronized (layer-wise) fusion mechanism and redesigned MED token, this architecture is capable of improving zero-shot performance on a number of datasets.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strength of the paper is the addition of the second branch in the encoder and decoder of SAM, as evidenced by the ablation study performed in the paper. While the idea of combining convolution operations and transformers is not novel, the specific architecture implementation is, and the evaluation shows performance improvement over comparable models.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    It is difficult to disentangle the influence of the novel method on the performance from the influence of the datasets that were used, which are typically larger than the competition against which the paper compares itself. The ablation study shows that for FT-SAM and the SyncSAM SAMed version there is still a decently sized performance gap, so the presented method likely does lead to a performance gain, just perhaps not as big as the numbers reported in the paper.

    During training, how much compute overhead does the additional branch cause, if any (since the SAM ViT is frozen)?

    “A multi-scale fusion branch is incorporated into the mask decoder, as early-stage features preserve more fine-grained edge details.” -Please cite a source here, as the statement intuitively makes sense, but is probably not trivial.

    The paper does not report uncertainties. Please quantify some measure of uncertainty on your experimental results.

    “ […] replacing ResNet-34 with ResNet-50 increases DSC by 1.2%, demonstrating the model’s scalability with parameter size.” - This assumption feels intuitively correct, but should be written with less confidence when not providing uncertainties. 1.2% DSC might still be a statistical fluke. Consider writing “suggesting” instead of “demonstrating” here.

    If FT-SAM only fine-tuned the decoder of SAM on SA-Med2D-20M, and already improved performance by 8% DSC, it stands to reason that fully fine-tuning SAM on that dataset might lead to an even bigger performance gain. However, this means that a significant part of the performance gain reported in the paper (at least 8% of the reported 21.6%) should be attributed to the datasets and not the method. This should be reflected in the discussion of the ablation study. Please also provide the fully fine-tuned FT-SAM in the ablation study for the same reason.

    Please add a thorough discussion of limitations.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The presented paper is nicely written and the reported performance of the presented method is impressive. Nonetheless, the paper could still be strengthened. The works would benefit from some clarifications, rephrasing, and additional experiments as stated above.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well structured, well written, and easy to follow. Its contribution is clearly motivated and explained. The contribution itself seems novel and the results demonstrate that it is quite powerful. A little more convincing is needed to disentangle the influence the chosen datasets on the performance of the model.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Comments have been addressed sufficiently.



Review #3

  • Please describe the contribution of the paper

    As the main contribution, they introduce a synchronized fusion strategy that performs stage-wise integration of ViT and CNN features, rather than a single-step merge. They augment SAM’s image encoder with a lightweight CNN branch to capture fine-grained details, achieving exceptional zero-shot performance on six unseen datasets. This design establishes a strong baseline for next-generation segmentation foundation models.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is clearly written.

    • The method integrates a CNN branch into the widely known SAM image encoder, injecting medical-specific domain bias into the ViT backbone. Rather than merging through expert-designed modules, it employs a synchronized stage-wise fusion strategy.

    • Zero-shot generalization is robustly evaluated with both qualitative and quantitative results across six unseen datasets from different modalities: MR, CT, X-ray, Microscopy, and Pathology, the last two being entirely new modalities. The results highlight strong zero-shot capabilities, ranking first in four out of six datasets and second in the remaining ones across various modalities and model variants.

    • The model consistently demonstrates strong performance under a fair comparison design. Limitations regarding zero-shot generalization performance are also clearly acknowledged.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Section 3.5 presents the ablation study; however, the analysis is not straightforward to follow. Table 5 lacks row numbering, despite frequent references to specific rows, e.g., “rows 3 vs. 8 vs. 11,” “rows 4 vs. 6 vs. 7,” and “rows 2 vs. 5”, which makes it difficult to trace the comparisons.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper introduces a novel foundation model with strong potential for impact, serving as a robust baseline that demonstrates strong zero-shot capabilities on unseen datasets. It holds significant promise for advancing medical image segmentation tasks.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    This approach provides a solid foundation for next-generation segmentation Foundation Models. It’s zero-shot generalization is robustly evaluated across unseen datasets with entirely a few new modalities. The qualitative results show intuitively and effectively the boundary segmentation intuitive and effective performance. Many suggestions were addressed in the rebuttal phase. Overall, it shows great promise for advancing medical image segmentation tasks.




Author Feedback

We sincerely appreciate all the positive and insightful comments. Below, we clarify the issues raised by reviewers.

@Reviewer#1 Q1, Q6 (Dataset and Ablation Study):

  • Tab.2 & 3 show that our method consistently outperforms the baselines under identical training datasets.
  • We will revise the ablation section to clarify the connection with FT-SAM.
  • As shown in Tab.2, MedSAM (fully fine-tuned SAM) reaches a DSC of 80.8, while our synchronized encoding variant achieves 85.5 (Tab.5, Row 3), demonstrating the strength of our design. This comparison will be made explicit in the revised ablation section.

Q2 (Compute Overhead):

  • SyncSAM-50 and SyncSAM-34 contain approximately 48M and 26M trainable parameters, respectively—substantially fewer than SAM (~90M). We will include this in the final version.
  • Regarding inference time: SyncSAM-50 runs at 37ms, SAM-Med2D at 29ms, and SAM at 20ms. Despite this, SyncSAM-50 outperforms SAM-Med2D and SAM by 8.7% and 22.0% on average, respectively.
  • We will add FLOPs and latency comparisons in Tab.2, 3 & 4.

Q3, Q5 (Citations and Wording):

  • We appreciate the suggestions. We will include the missing citations and revise the text for clarity and precision.

Q4 (Uncertainty Quantification):

  • We conducted three independent runs per experiment with different random seeds and will report standard deviations accordingly.
  • We also plan to investigate uncertainty estimation methods such as Monte Carlo Dropout and Deep Ensembles in future work.

@Reviewer#3 Q1 (Table Clarity):

  • Thank you for the suggestion. We will add explicit row numbers to Tab.5 in the final version.

@Reviewer#4 Q1 (Motivation and Novelty Clarity):

  • Due to space limitations, the second paragraph of the Introduction provides a high-level categorization of existing paradigms, which includes MedSAM (fully fine-tuned on large datasets), MedSAM2 (optimized for inference strategies), and Med-SA (PEFT with small-scale data). We will explicitly mention these models in the revised version to improve clarity.
  • To our knowledge, our work is the first to explore incorporating expert-designed convolutional branches into a foundation model trained on large-scale medical data, achieving SOTA performance in interactive segmentation with significantly fewer trainable parameters.

Q2 (Ambiguous Phrasing):

  • We referenced grounded versions of SAM (e.g., Grounded SAM) in Sec.1 to motivate the need for a strong, simple backbone. We will add concrete medical examples, such as MedCLIP-SAM.

Q3 (Mathematical Clarity):

  • We appreciate your suggestion. In the final version, we will include formal mathematical expressions in the Method section to clearly describe the data flow.

Q4 (Evaluation Metrics):

  • Due to space constraints, we reported only Dice scores, following the conventions of prior work.
  • We also agree that boundary precision is important. Qualitative results in Fig.2 (first row) demonstrate accurate boundary segmentation.

Q5 (Baseline Selection Justification):

  • For Tab.2 & 3, we selected three representative fine-tuning strategies: full (MedSAM), adapter-based (SAM-Med2D), and partial (FT-SAM).
  • For Tab.4, which evaluates zero-shot interactive segmentation, we included both the latest models (e.g., IMIS-Net, released in Nov 2024) and established methods from top-tier conferences (e.g., ScribblePrompt, ECCV 2024) to ensure comprehensive and timely comparisons.

Q6 (Model Complexity Analysis):

  • SyncSAM-50 and SyncSAM-34 have ~48M and ~26M trainable parameters, which is significantly lower than the ~90M parameters of SAM. We will include this in the final version.
  • In terms of inference latency, SyncSAM-50 requires 37ms per image, compared to 29ms for SAM-Med2D and 20ms for SAM. Despite the slightly higher latency, SyncSAM-50 achieves superior performance, with average improvements of 8.7% over SAM-Med2D and 22.0% over SAM.
  • We will include FLOPs and inference time metrics in Tab.2, 3 & 4 of the final version.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    After the rebuttal phase, all three reviewers reached a consensus in recognizing the contribution of this paper and supporting its acceptance. The authors are encouraged to revise the paper by addressing the reviewers’ comments and incorporating clarifications provided during the rebuttal to further improve its quality.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top