Abstract

Volumetric medical image segmentation is pivotal in enhancing disease diagnosis, treatment planning, and advancing medical research. While existing volumetric foundation models for medical image segmentation, such as SAM-Med3D and SegVol, have shown remarkable performance on general organs and tumors, their ability to segment certain categories in clinical downstream tasks remains limited. Supervised Finetuning (SFT) serves as an effective way to adapt such foundation models for task-specific downstream tasks and achieve remarkable performance in those tasks. However, it would inadvertently degrade the general knowledge previously stored in the original foundation model. In this paper, we propose SAM-Med3D-MoE, a novel framework that seamlessly integrates task-specific finetuned models with the foundational model, creating a unified model at minimal additional training expense for an extra gating network. This gating network, in conjunction with a selection strategy, allows the unified model to achieve comparable performance of the original models in their respective tasks — both general and specialized — without updating any parameters of them. Our comprehensive experiments demonstrate the efficacy of SAM-Med3D-MoE, with an average Dice performance increase from 53.2\% to 56.4\% on 15 specific classes. It especially gets remarkable gains of 29.6\%, 8.5\%, 11.2\% on the spinal cord, esophagus, and right hip, respectively. Additionally, it achieves 48.9\% Dice on the challenging SPPIN2023 Challenge, significantly surpassing the general expert’s performance of 32.3\%. We anticipate that SAM-Med3D-MoE can serve as a new framework for adapting the foundation model to specific areas in medical image analysis. Codes and datasets will be publicly available.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3979_paper.pdf

SharedIt Link: https://rdcu.be/dV51K

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72114-4_53

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Wan_SAMMed3DMoE_MICCAI2024,
        author = { Wang, Guoan and Ye, Jin and Cheng, Junlong and Li, Tianbin and Chen, Zhaolin and Cai, Jianfei and He, Junjun and Zhuang, Bohan},
        title = { { SAM-Med3D-MoE: Towards a Non-Forgetting Segment Anything Model via Mixture of Experts for 3D Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15009},
        month = {October},
        page = {552 -- 561}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents a framework that combines foundation and specialized models for specific downstream medical imaging tasks. The model uses SAM-Med3D and follows a Mixture of Experts approach to learn how to combine the knowledge of the foundation model with the best expert model. The results show that the baseline SAM-Med3D considerably drops its performance on the original classes when finetuned, while the proposed method is capable of preserving in part their performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    RELEVANCE OF THE TASK: medical image segmentation is a highly relevant task due to its complexity and importance in diagnosis. Also, foundation models are of great interest to the machine learning community. PERFORMANCE SUPERIOR TO THE BASELINE: the model shows superior general performance compared to SAM-Med3D finetuned on specific tasks. WRITING AND PRESENTATION: the paper is organized and easy to follow and understand. The figures are clear and self-contained.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    UNCLEAR MOTIVATION: the paper aims to use foundation models for specific medical image analysis tasks. However, it’s unclear why general knowledge needs to be preserved if the goal is to focus on a specific downstream task, especially since the specialized models obtain better performance. Conversely, if the aim is to maintain generality, using specialized models that harm the performance of some classes when improving others might hinder this goal. TECHNICAL NOVELTY: the proposed method builds upon SAM-Med3D [21], Mixture of Experts [12], and uses cross-attention layers [5] to create a larger model that preserves general knowledge and leverages specialized knowledge. SCALABILITY: to handle tasks with numerous labels (e.g., segmentation of abdominal organs), the model needs to fine-tune one expert mask decoder per category, which could become impractical. In addition, Fig. 4 (b) shows that the 15 un-finetuned classes shown in the graph, and probably all the “other” classes, are affected by the fine-tuning. Although the proposed method reduces this impact, it is still evident, suggesting a need for an expert in every class for a fully a non-forgetting model. TRAINING SPEED COMPARISSON: the training speed of this model and SAM-Med3D are not directly comparable since in this method the image and prompt encoders are frozen and only one of the mask decoders and the gating network are updated on each pass. Also, the similar training speed raises questions about the resource demands of the gating network, since the mask decoders are designed to be very lightweight in comparison to the encoder.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The title of the paper should be adjusted. The “non-forgetting segment anything model” is misleading because it suggests that the foundation model could be improved for some downstream tasks without affecting the generalized knowledge. However, the proposed method ensembles the new specialized model (which has already forgotten the original knowledge) with the foundation one. Even if this solution mitigates the effects of forgetting, it does not really address it. Also, the motivation of the paper should be clarified since at the moment is confusing (please refer to the weaknesses and justification of the rating sections for more information)

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper’s main weakness lies in its unclear motivation. Foundation models are designed to generalize to new data or tasks, either through zero-shot learning or adaptation to downstream tasks. When adapting a foundation model for a specific task, it indicates the need for specialized knowledge, prioritizing task-specific performance over generalizability. Alternatively, if a different task arises, a derived model could be created from the original foundation model. Thus, the rationale behind combining foundation and expert models to create a new model that is both general and specific remains unclear. Moreover, if the expert model performs better on its own compared to when combined with the foundation model, as was the case in the paper, it raises the question of why not solely use the expert model. I’m leaning towards rejection, but would like to hear the author’s explanation of the motivation and goal of the paper

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors have addressed the raised concerns by all reviewers. The two main ones for me are about the aim of the paper and the model’s scalability. With respect to the the aim of the paper, the model tries to tackle the catastrophic forgetting problem by ensembling a general model and expert decoders. This solution results in a middle point where the new model is better than the general in the specific categories, but worse than the experts. In the rest of the categories is the other way around: it is better than the experts but worse than the general. The greatest concern with this solution, however, is the fact that a new expert is required for every task (category) that one would like to improve. In other words, even if there is no need to train a super-model from scratch, the resulting model keeps growing and becomes impossible to scale beyond a certain number of categories. Therefore, the explanation about the scalability did not answer the concern. Given that most concerns were addressed and in general the paper could be of interest to the community, I am changing my score to weak accept. However, I strongly encourage the authors to rephrase the parts about the efficiency of the final model (section 3) since, even if finetuning the experts and gating mechanism is not as expensive as training the whole model from scratch, it does not mean the resulting model is efficient.



Review #2

  • Please describe the contribution of the paper

    This paper introduces a simple strategy to fine-tune the foundation model based on the Mixture of Experts (MOE). The main idea is combining the general knowledge from foundation model with task-specific knowledge from the finetune expert decoders by the gating module.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well-written and easy to follow.
    2. The proposed method is simple, theoretically sound and aligned with the motivation.
    3. This idea has a high level of generalization.
    4. The proposed method is evaluated through several experiments.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The only weakness of this paper is the performance, the proposed method doesn’t surpass neither the baseline in original task nor the fine-tuning expert in SPPIN.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1) With both table 1 and table 2, the author should reiterate the metric being used here, instead of solely mentioning it in the introduction. 2) Please bold the highest scores in table 1 and table 2 for easy observing. 3) If possible, please explain the motivation for choosing the weighted sum of the first-ranked expert and the general output. The authors should have an ablation study for choosing only the first-ranked expert when the switch is activated.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This method is simple, aligns with the motivation, and can be applied to other computer vision problems. However, it is not a groundbreaking work, and the results of the method are not particularly impressive.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors provided a strong rebuttal, addressing the concerns about motivation, novelty, ablation study, and performance. In general, the proposed method is technically novel and clearly presented. I recommend acceptance for this work.



Review #3

  • Please describe the contribution of the paper

    This papper proposes SAM-Med3D-MoE to integrate task-specific finetuned models with the SAM foundational model. This paper proposes to use a gating network to avoid fine-tuning complete foundation models and avoid catastrophic forgetting problem.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper addresses an interesting problem, catastrophic forgetting, when fine-tuning existing models on downstream tasks. This paper proposes an effective strategy for combining specific models using a gating network.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There is a slight inconsistency in Figure 3 and section 2.3. Figure 3 suggests that when the score Stop equals the threshold, the output is the general mask. However, section 2.3 suggests that when the score equals or is above the threshold, the output is a weighted sum.

    Since the authors propose to use multiple mask decoders, it will be beneficial if the authors can include inference costs for Med-SAM-MOE and compare with regular SAM3D.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors should clarify what the original tasks are in Table 1 and comment on why the differences between original tasks and downstream tasks could lead to a large performance gap.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper is an interesting study on SAM’s transferability to downstream tasks and addresses an interesting problem during fine-tuning. This paper also conducts a comprehensive evaluation on various segmentation tasks.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors have addressed all my concerns. This paper deserves to be accepted.




Author Feedback

To Reviewer 1

  1. Unclear motivation We re-clarify our motivation here. Our motivation is to propose a method that broadly extends the capability of a foundation model using an ensemble approach with several well-trained expert models for 3D medical image segmentation (MIS). Consequently, the ensembled model can handle both general and task-specific tasks simultaneously. The motivation stems from the limitations of existing foundation models in 3D MIS, and we appreciate other two reviewers acknowledge our motivation. While foundation models need to have capabilities to handle most tasks in its area, current foundation models in 3D MIS are limited due to the restricted availability of public data and the time-consuming nature of the manual labeling process (discussed on page 3). Therefore, our method proposes a new approach tailored to this scenario, by continuously expanding the capabilities of the foundation model via ensembling expert knowledge.
  2. Technical novelty To clarify, we propose a new framework that can quickly ensemble a foundation model with the off-the-shelf well-trained decoders derived from the foundation model by training a simple gating network, without expensive training as training a new foundation from scratch or finetuning a foundation model with large-scale data (combine original data and task-specific data). From this perspective, our method is novel, scalable, and efficient to inherit the rich knowledge from both pretrained general and specific models.
  3. Scalability There may be a misunderstanding of the application of our method. Given the off-the-shelf well-trained decoders with the same architectures in a model zoo, we can utilize our method to cheaply integrate new knowledge into an original foundation model in a progressive way, without needing to train a super network using large-scale data from scratch. Therefore, our framework is scalable and efficient.
  4. Training speed We agree that comparing the training speed with the original SAM-Med3D is not appropriate since SAM-Med3D aims to train a foundation model from scratch while we aim to extend the capability of SAM-Med3D by efficiently and quickly ensembling the task-specific decoders. In practice, our gating network is lightweight and thus the training speed of our method is very fast, training the gating network with newly added categories from well-trained decoders only takes about 16 or 2 hours of training time with 1 or 8 V100 (32G). For inference cost, please refer to A1 of R4.

To Reviewer 3

  1. Performance Issue As the proposed gating module is responsible for combining the capabilities of the foundation model and the finetuned expert, the optimal outcome, theoretically, is for the performance on the original tasks to same as the baseline SAM-Med3D, while the performance on SPPIN matches the fine-tuning expert’s results.
  2. Output selection The weighted sum motivation is to prevent bad output if the first-ranked expert is wrong. We admit it is an oversight for missing the ablation, the results of directly using a first-ranked expert are worse than our strategy in our experiments. We will add its results later if it is possible. Thanks for your reminder.
  3. Writing Metric reiteration and bold best score: We will introduce the metrics used in the paper and optimize the presentation of Tab1 and Tab2 by bolding the highest scores for easier observation.

To Reviewer 4

  1. Inference cost For memory cost, SAM-Med3D has 100M parameters. Our method adds only an additional 7M fixed parameters for the gating network and 7M per newly added expert. For inference speed, the comparison between SAM-Med3D and our model (with 15 newly added experts) in FPS is as follows: 1.33/1.14 vs. 1.03/0.91 for box/6 points.
  2. Writing Content inconsistency: The formula in Section 2.3 was incorrectly written regarding the boundary condition. When the confidence score equals the threshold, we use the general mask decoder. We will correct this later.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviewers found the work to be well-motivated and novel, proposing an effective strategy for addressing an interesting problem during fine-tuning of SAM. They also commend the comprehensive evaluation and theoretical soundness of the approach. The rebuttal successfully addresses most of the concerns raised by the reviewers.

    Considering the clear motivation, demonstrated novelty, and thorough evaluation, I recommend accepting this paper. This will give the MICCAI community the opportunity to further discuss and explore the potential of this work.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The reviewers found the work to be well-motivated and novel, proposing an effective strategy for addressing an interesting problem during fine-tuning of SAM. They also commend the comprehensive evaluation and theoretical soundness of the approach. The rebuttal successfully addresses most of the concerns raised by the reviewers.

    Considering the clear motivation, demonstrated novelty, and thorough evaluation, I recommend accepting this paper. This will give the MICCAI community the opportunity to further discuss and explore the potential of this work.



back to top