Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Recent adaptations of the powerful and promptable Segment Anything Model (SAM), pretrained on a large-scale dataset, have shown promising results in medical image segmentation. However, existing methods fail to fully leverage the intermediate features from SAM’s image encoder, limiting its adaptability. To address this, we introduce MoE-SAM, a novel approach that enhances SAM by incorporating a Mixture-of-Experts (MoE) during adaptation. Central to MoE-SAM is a MoE-driven feature enhancing block, which uses learnable gating functions and expert networks to select, refine, and fuse latent features from multiple layers of SAM’s image encoder. By combining these features, the model creates a more robust image embedding that captures both low-level local and high-level global information. This comprehensive embedding facilitates prompt embedding generation and mask decoding, thereby enabling more effective self-prompting segmentation. Extensive evaluations across four benchmark medical image segmentation tasks show that MoE-SAM outperforms both task-specialized models and other SAM-based approaches, achieving state-of-the-art segmentation accuracy. The code is available at: \url{https://anonymous.4open.science/r/E-SAM-4418}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1526_paper.pdf

SharedIt Link: https://rdcu.be/eHwNl

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04937-7_35

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Asphyxiate-Rye/E-SAM

Link to the Dataset(s)

MMWHS dataset: https://zmiclab.github.io/zxh/0/mmwhs/ Synapse CT dataset: https://www.synapse.org/Synapse:syn3193805/wiki/217789 BTCV dataset: https://www.synapse.org/Synapse:syn3193805/wiki/217789 ACDC dataset: https://www.creatis.insa-lyon.fr/Challenge/acdc/index.html

BibTex

@InProceedings{LiRuo_MoESAM_MICCAI2025,
        author = { Li, Ruocheng AND Wu, Lei AND Gu, Jingjun AND Xu, Qi AND Chen, Wanyi AND Cai, Xiaoxu AND Bu, Jiajun},
        title = { { MoE-SAM: Enhancing SAM for Medical Image Segmentation with Mixture-of-Experts } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15961},
        month = {September},
        page = {367 -- 377}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper introduces MoE-SAM, an approach that enhances the Segment Anything Model (SAM) for medical image segmentation. MoE-SAM leverages Mixture-of-Experts (MoE) module to selectively combine features from multiple layers of SAM’s image encoder, generating an enhanced image embedding for segmentation. The method is evaluated on four benchmark medical image segmentation datasets and compared to task-specific SOTAs, promptable SAMs and prompt-free SAMs.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is generally clear and well-written.
- The experimental evaluation is thorough and comprehensive.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The paper primarily explores the use of hierarchical or intermediate features within the ViT encoder of SAM for segmentation, which the authors claim has been overlooked in previous literature (paragraph 2, last 2 sentences). However, this approach has already been demonstrated as effective and has become a common practice in medical image segmentation, where U-shape hierarchical decoding [A1-A2] or fusion [A3] of multi-layer ViT features is widely employed.
- The motivation for using MoE to fuse image features from different ViT layers is not sufficiently justified. The necessity of the MoE architecture is unclear, and its advantages over more standard fusion methods, such as simple addition or concatenation, are not demonstrated. Moreover, the paper would benefit from further analysis regarding the rules learned for selecting features from different layers, an assessment of which layers contribute most significantly to segmentation, or a clearer explanation of how the MoE mechanism dispatches features.
- The description of the method in Fig. 1 and equations (2)-(4) seems inconsistent. The equations suggest the use of a sparse MoE strategy with a single gating function, whereas Fig. 1(b) depicts multiple routers.
- It’s unclear how to decide the number of experts in the MoE. Is there a relationship between the number of experts and the number of layers in the ViT encoder?
References: [A1] Tang, Yucheng, et al. “Self-supervised pre-training of swin transformers for 3d medical image analysis.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022. [A2] Gong, Shizhan, et al. “3dsam-adapter: Holistic adaptation of sam from 2d to 3d for promptable medical image segmentation.” Medical Image Analysis. 2024. [A3] Huang, Chaoqin, et al. “Adapting visual-language models for generalizable anomaly detection in medical images.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Some statements in this paper are incorrect or incomplete. Although the method introduces the popular MoE architecture, it lacks motivation and interpretability. Some narratives of the methodological details are ambiguous.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This paper addresses image segmentation by augmenting the Segment Anything Model (SAM) with a mixture of experts block to exploit latent features from SAM layers. Evaluation on four public datasets of a range of anatomy showed improvement over SAM, SAM variants, and other baseline methods.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Novel formulation gaining advantage of multiple SAM encoding layers with mixture of experts and self-prompting
- Thorough evaluation with multiple baselines and datasets
- Promising results
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The improvement beyond the baselines is relatively small.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Incorporating MoE in SAM is novel and the results are promising.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

I like the use of MoE with SAM at multiple layers. I think it’s novel and well evaluated. I found the rebuttal useful and I think it reasonably addressed all of our issues.

Review #3

Please describe the contribution of the paper

The paper introduces MoE-SAM, a novel adaptation of the Segment Anything Model (SAM) for medical image segmentation, leveraging Mixture-of-Experts (MoE) to enhance feature integration from multiple layers of SAM’s image encoder. Ablation studies validate the contributions of each component, with visualizations showing improved feature integration.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Novel use of MoE to selectively combine features from multiple SAM encoder layers, capturing both local and global context for robust segmentation.
- A lightweight prompt embedding generator eliminates manual prompts, enabling efficient adaptation to diverse medical images.
- Adapter-based training preserves SAM’s pretrained knowledge while reducing computational costs compared to full fine-tuning or lora.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- While the method claims efficiency, it lacks a comparison of training/inference times against LoRA-based approaches or parameter counts.
- All datasets focus on anatomically contiguous structures (e.g., hearts, organs). The method’s performance on dispersed or multi-instance targets (e.g., tumors, lesions) is untested, despite SAM’s potential for such tasks
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Minor comments:
- Figure 2: The feature visualizations would benefit from side-by-side comparisons with baseline SAM features to highlight MoE-SAM’s improvements more clearly.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the paper could benefit from additional analysis of computational efficiency and validation on multi-instance targets, its rigorous evaluation across four benchmarks, clear ablation studies, and state-of-the-art performance justify acceptance
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank all reviewers for their insightful feedback. R1Q1: Limited improvement over baselines. A: Our method outperforms all other prompt-based and prompt-free SAM variants. Compared to task-specific models, it achieves competitive accuracy with significantly fewer trainable parameters and lower computational cost. Furthermore, it shows complementary strengths across DSC, HD, and the additionally computed NSD, achieving consistently superior NSD performance across all datasets.

R2Q1: Lack of runtime or parameter comparison. A: Our approach, comprising MoE-FEB and LPEG, is compatible with various fine-tuning strategies (see Table 2). We adopt Adapter in the final design due to its superior performance. Compared to standard fine-tuned SAM (Adapter: 12.5M/99.1M trainable/total parameters, LoRA: 5.5M/92.2M), our full model (32.7M/120.5M) introduces a moderate increase in model size and computation, but yields significant accuracy gains.

R2Q2: Unverified performance on dispersed targets. A: Our current experiments focus on continuous targets with clear structures and rich hierarchical features, ideal for evaluating MoE-FEB. We believe our approach will generalize well to datasets with dispersed targets in future experiments, owing to the Adapter and MoE-based adaptations that preserve the robustness of the vanilla SAM.

R3Q1: Limited novelty in hierarchical feature usage. A: While we agree that using intermediate features is common in medical image analysis, only a few SAM-based methods leverage them. These prior works typically aggregate outputs from only four ViT stages and neglect fine-grained semantics within individual layers. In contrast, our method selectively fuses features from all 12 layers of the ViT via the MoE-FEB module, capturing richer hierarchical structural semantics.

R3Q2.1: Unclear MoE motivation. A: Our motivation stems from the observation that different layers of SAM’s image encoder capture distinct information (Dosovitskiy et al., 2020). Moreover, MoE frameworks are well-suited for handling such hierarchical representations (Han et al., 2024, FuseMoE). Accordingly, we propose MoE-FEB to adaptively select and fuse these features.

R3Q2.2: Lack of MoE necessity justification. A: We compare the proposed MoE-FEB with simple feature addition across three SAM variants in Table 3. The results demonstrate that incorporating MoE-FEB consistently improves segmentation accuracy over the addition methods. As for concatenation, fusing all 12 layers in this way would greatly increase feature dimensionality and computational cost. Additionally, our design relies on element-wise addition with SAM’s encoder output, making concatenation incompatible.

R3Q2.3: Unclear how MoE prioritizes layer-wise features. A: We employ an Expert Choice Routing strategy in our MoE-FEB (Zhou et al., 2022), where experts select top-k tokens rather than traditional token-to-expert routing. In our preliminary exploration, we calculated how frequently each layer was selected by the experts. The results suggest that shallower layers are selected more often than deeper layers. However, due to space constraints, we did not include these findings in the paper. We plan to explore this topic further in future work.

R3Q3: Inconsistency between Figure and Equations. A: Thank you for pointing out the inconsistency. We agree that Fig. 1(b) may be misleading. All Router1–Router4 should refer to the same Router. We will correct this in the revised version.

R3Q4: Unclear selection of the number of MoE experts and its relation to ViT layers. A: We empirically choose four experts to balance model capacity and efficiency, achieving sufficient feature representation without redundancy. The relationship between the number of experts and ViT layers remains unclear. While more layers may require more experts to capture richer semantics, validating this hypothesis is non-trivial — modifying the number of layers also affects the performance of the base model.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The authors addressed most concerns. I tend to accept the paper.

back to top

MoE-SAM: Enhancing SAM for Medical Image Segmentation with Mixture-of-Experts

Author(s):