Abstract

Deep learning-based medical image segmentation models often suffer from domain shift, where the models trained on a source domain do not generalize well to other unseen domains. As a prompt-driven foundation model with powerful generalization capabilities, the Segment Anything Model (SAM) shows potential for improving the cross-domain robustness of medical image segmentation. However, SAM performs significantly worse in automatic segmentation scenarios than when manually prompted, hindering its direct application to domain generalization. Upon further investigation, we discovered that the degradation in performance was related to the coupling effect of inevitable poor prompts and mask generation. To address the coupling effect, we propose the Decoupled SAM (DeSAM). DeSAM modifies SAM’s mask decoder by introducing two new modules: a prompt-relevant IoU module (PRIM) and a prompt-decoupled mask module (PDMM). PRIM predicts the IoU score and generates mask embeddings, while PDMM extracts multi-scale features from the intermediate layers of the image encoder and fuses them with the mask embeddings from PRIM to generate the final segmentation mask. This decoupled design allows DeSAM to leverage the pre-trained weights while minimizing the performance degradation caused by poor prompts. We conducted experiments on publicly available cross-site prostate and cross-modality abdominal image segmentation datasets. The results show that our DeSAM leads to a substantial performance improvement over previous state-of-theart domain generalization methods. The code is publicly available at https://github.com/yifangao112/DeSAM.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1496_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/yifangao112/DeSAM

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Gao_DeSAM_MICCAI2024,
        author = { Gao, Yifan and Xia, Wei and Hu, Dingdu and Wang, Wenkui and Gao, Xin},
        title = { { DeSAM: Decoupled Segment Anything Model for Generalizable Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a fully automated segmentation method by adapting SAM for decoupling prompts from mask generation. The SAM prompt and image encoders are inherited, but the mask decoder is replaced with PRIM and PDMM modules. PRIM generates the mask embeddings and has the IoU head. The mask embeddings are merged with the bottleneck embeddings in PDMM. PDMM consists of SE residual blocks and upsampling to fuse the image embeddings in a hierarchical way and eventually predict the segmentation mask. Two different types of losses are used based on the prompts (grid vs whole box). Qualitative and Quantitative(Dice) validation are done on prostrate and abdoman segmentation datasets from various sites and modalities. Two variants DeSAM-B and DeSAM-P are evaluated against competing methods and are shown to have better performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    -Validation done on publicly available and cross-site datasets -Quantitative validation shows improvements in dice metric for all the individual datasets/sites and on the overall score in Table 2. -Ablations are provided in Table 1 -Upper bound / Oracle is also listed for reference

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Minor typos: “Deep learning models achieves remarkable performance in medical image…” “Specifically, we design two new modules and added to the fully automated SAM.” “which fused the image embeddings from the image encoder with the mask embeddings from PRIM” “cross-modality abdominal abdominal multi-organ segmentation”

    -What does “c” mean in figure 1? Concatenation?

    -Only Dice is used for quantitative analysis

    -Table 2 should list min/max/variation for the various experiments to get more insights on performance

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors make simple architecture changes to SAM and provide quantitative analysis by comparing against SOTA methods on public datasets. A weakness of this paper is the use of only DSC for quantitative analysis. It would be good to show strong performance on other metrics (eg: volumes/area) to show the usefulness of the method in a clinical setting. The min/max for each of the methods would also be revealing. A stronger validation would be needed for deployment of such a method in clinical settings.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposes architectural modifications to SAM to get a fully automated segmentation method for medical images. Quantitative validations demonstrate good results, although, they could have been made stronger. The writing is good in most places except for some typos listed above.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposes DeSAM, a decoupled framework for SAM. It observes the coupling effect of inevitable poor prompts and mask generation, and proposes a Prompt-Relevant IoU Module (PRIM) and a Prompt-Decoupled Mask Module (PDMM). Experiments are conducted on DG task, DeSAM outperforms other methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The PRIM proposes a iou prediction module that predicts the iou of the prompt, which is a novel method for SAM.
    • The paper is well written and easy to follow.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • More details of the PRIM should be provided. What is the IoU pred for and what is the supervision? Is it the iou of the pred mask and the gt mask? The description is very unclear.
    • What is the difference between PRIM and Mask scoring RCNN? It seems the method is very similar, but the paper didn’t mention nor cite it. Please response to this carefully, or I may consider there may be ethical issues here. [1] Huang, Zhaojin, et al. “Mask scoring r-cnn.” Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.
    • The whole framework is built on “We argue that the poor performance of fully automated SAM in medical image segmentation can be attributed to a mechanism, namely the coupling effect”, but it seems to be only a conjecture, no theoretical or empirical evidence. The authors should provide strong evidence for this, which is the motivation of the method.
    • What is the extra computational cost of the PDMM? Compared to other adapter methods such as MedSAM, it seems PDMM introduces much more trainable params, which may cause unfairness.
    • How to do inference? Is the PRIM used in the inference stage?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    no

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Overall, although I recognize the novelty of the method and framework, there are also many concerns raising and the authors haven’t discussed properly. I will give a WA now, but I may consider lower my score if my concerns are not well addressed.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    see weaknesses

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The prompt-driven SAM faces difficulties in automatically segmenting images using generalised bounding boxes and grid points as prompts. This challenge stems from what the authors call the “coupling effect,” where the interactions between the input embedding and the prompts in SAM’s cross-attention transformer layers heavily influence the final segmentation output. Despite attempts to fine-tune the model, it remains sensitive to generalized prompts. To address this issue, the authors propose DeSAM which introduces two modules, PRIM and PDMM, aimed at decoupling the dependency between prompts and input embedding while still leveraging SAM’s pretrained weights. PRIM, or Prompt-Relevant IoU Module, calculates IoU scores and generates mask embeddings, while PDMM, or Prompt-Decoupled Mask Module, extracts multi-scale features from SAM’s intermediate layers and combines them with the mask embeddings from PRIM.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The authors conducted ablation studies to evaluate the significance of the two introduced modules, PRIM and PDMM, in providing interpretability.

    2. The paper is well-organized and easy to comprehend. The concept, while straightforward, presents a novel approach to leveraging foundational models like SAM for efficient automatic segmentation of medical images.

    3. In a single-source domain generalization scenario, DeSAM exhibits robustness against unseen distribution changes by integrating image embeddings at various scales, eliminating the need for additional augmentations.

    4. Instead of discarding the pretrained SAM decoder, DeSAM efficiently utilizes its pretrained weights with its innovative two-module solution, effectively decoupling prompts and merging mask embeddings with input image embeddings.

    5. Quantitative results demonstrate that DeSAM exhibits no false positives in the background, with segmentation boundaries closely aligning with ground truth compared to existing state-of-the-art methods.

    6. The authors investigate two DeSAM variants: DeSAM-B, employing generalized bounding boxes as prompts, and DeSAM-P, using generalized grid points. Furthermore, an ablation study varying the number of grid points reveals that increasing their count does not degrade performance, underscoring the effectiveness of decoupling prompts from input embeddings as an effective solution.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. In Section 4.1, “Dataset and Implementation Details,” there is a repetition of the word “abdominal” in the phrase “1) cross-modality abdominal abdominal multiorgan segmentation.”

    2. The paper provides a concise discussion of related works, but there is room to further explore comparisons between DeSAM and previous approaches utilizing SAM. Emphasizing the unique approach of DeSAM in contrast to prior SAM-based methods would enhance the discussion.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The anonymized link to the code makes the paper reproducible. Furthermore, the methodology section offers comprehensive and detailed explanations, while the implementation details are meticulously described.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. In the related works section, expanding on comparisons between DeSAM and previous SAM-based approaches could enrich the discussion. Highlighting the distinctiveness of DeSAM’s approach compared to earlier methods would add depth to the analysis.

    2. It would be valuable to investigate the number of parameters, inference time, and memory usage relative to current state-of-the-art techniques. This analysis would provide insights into the efficiency and computational requirements of the proposed approach compared to existing methods.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The simplicity and novelty of the idea are notable.
    2. The approach demonstrates high reproducibility.
    3. Both qualitative and quantitative results indicate significantly improved segmentation performance with DeSAM compared to the current state-of-the-art SAM-based methods.
    4. The effective utilization of a foundational model like SAM for medical imaging is of considerable interest to the medical community.
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We sincerely appreciate your insightful comments and positive feedback on our paper. We are glad that you recognize the novelty and effectiveness of our method. We address the main concerns as follows:  Quantitative analysis (R1): We will include more metrics like Hausdorff distance to comprehensively evaluate the segmentation performance. The min/max values and ranges will be added to provide a detailed analysis. PRIM vs. Mask Scoring R-CNN (R3): Thank you for pointing out the similarity between PRIM and Mask Scoring R-CNN in terms of decoupling mask generation and IoU prediction. We apologize for not making the connection in our original manuscript. However, it is important to note that our motivation differs significantly. The primary goal of our method is to decouple the mask generation process in SAM from the influence of prompts, thus improving its robustness to low-quality prompts in automatic segmentation scenarios. In contrast, Mask Scoring R-CNN aims to improve the post-processing step by adding an IoU prediction branch. Moreover, the IoU prediction in PRIM follows the same mechanism as in SAM, serving as a quality filter for generated masks. Thank you for the insightful comment. Evidence of the “coupling effect” (R3): We will provide both theoretical and empirical evidence to support our claim. Due to the space limit, we did not show the quantitative results of prompt quality on output deviation but will include it in a future journal version. Computational cost (R3, R4): We will compare the parameter count, inference time, and memory usage of DeSAM with other methods. Results show that the overhead introduced by PDMM is minor compared to the performance gain. DeSAM can be trained on personal devices with entry-level GPU since our approach does not rely on tuning the heavyweight image encoder. During training, the video memory usage was approximately 7.8 GB. Discussion of related works (R4): We will expand the comparison between DeSAM and other SAM-based methods, emphasizing the novelty of our decoupling strategy in enhancing robustness and efficiency. Typos and writing (R1): We appreciate your meticulous reading and apologize for the typos in our manuscript. We have carefully proofread the revised version and corrected the errors as follows:

  1. “Deep learning models achieves …” -> “Deep learning models achieve …”
  2. “Specifically, we design two new modules and added to the …” -> “Specifically, we design two new modules and add them to the …”
  3. “which fused the image embeddings…” -> “which fuses the image embeddings from the image encoder with the mask embeddings from PRIM…”
  4. The duplicated “abdominal” in “cross-modality abdominal abdominal multi-organ segmentation” has been removed.  We once again thank all reviewers for the opportunity to strengthen our work. Your recognition and valuable suggestions are highly appreciated!




Meta-Review

Meta-review not available, early accepted paper.



back to top