Abstract

End-to-end medical image segmentation is of great value for computer-aided diagnosis dominated by task-specific models, usually suffering from poor generalization. With recent breakthroughs brought by the segment anything model (SAM) for universal image segmentation, extensive efforts have been made to adapt SAM for medical imaging but still encounter two major issues: 1) severe performance degradation and limited generalization without proper adaptation, and 2) semi-automatic segmentation relying on accurate manual prompts for interaction. In this work, we propose SAMUS as a universal model tailored for ultrasound image segmentation and further enable it to work in an end-to-end manner denoted as AutoSAMUS. Specifically, in SAMUS, a parallel CNN branch is introduced to supplement local information through cross-branch attention, and a feature adapter and a position adapter are jointly used to adapt SAM from natural to ultrasound domains while reducing training complexity. AutoSAMUS is realized by introducing an auto prompt generator (APG) to replace the manual prompt encoder of SAMUS to automatically generate prompt embeddings. A comprehensive ultrasound dataset, comprising about 30k images and 69k masks and covering six object categories, is collected for verification. Extensive comparison experiments demonstrate the superiority of SAMUS and AutoSAMUS against the state-of-the-art task-specific and SAM-based foundation models. We believe the auto-prompted SAM-based model has the potential to become a new paradigm for end-to-end medical image segmentation and deserves more exploration. Code and data are available at https://github.com/xianlin7/SAMUS.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1336_paper.pdf

SharedIt Link: https://rdcu.be/dZxc6

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72111-3_3

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1336_supp.pdf

Link to the Code Repository

https://github.com/xianlin7/SAMUS

Link to the Dataset(s)

https://github.com/xianlin7/SAMUS

BibTex

@InProceedings{Lin_Beyond_MICCAI2024,
        author = { Lin, Xian and Xiang, Yangyang and Yu, Li and Yan, Zengqiang},
        title = { { Beyond Adapting SAM: Towards End-to-End Ultrasound Image Segmentation via Auto Prompting } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15008},
        month = {October},
        page = {24 -- 34}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents a foundation model-based paradigm SAMUS first to transfer the strong feature representation ability of SAM to the domain of medical image segmentation, and then extend the trained SAMUS into an automatic version (i.e., AutoSAMUS) to flexibly handle various downstream segmentation tasks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. A feature adapter and a position adapter are developed to fine-tune the ViT image encoder from natural to medical domains.
    2. A parallel CNN-branch image encoder is proposed to run alongside the ViT-branch and a corresponding cross-branch attention module is designed to enable each patch in the ViT-branch to assimilate local information from the CNN-branch.
    3. An auto prompt generator with learnable task tokens is proposed to replace the manual prompt encoder of SAMUS for generating task-related prompt embeddings.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. New method is proposed for ultrasound image segmentation, yet there are no description about this point.
    2. Some details about the model is confusing.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Since the title indicates that the new method is designed for ultrasound image segmentation, the paper should clarify which module of the model targets which characteristic of ultrasound images. Otherwise, this is a common model rather than a task-specific model.
    2. Some details of the model should be descripted clearly. For example,in the position adapter, is the positional embedding the same as that in SAM? Additionly, how to split “T0” to task token and output token in Fig 2? And also in Fig 2, the point prompt is included, is it adopted in the trainning process?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method is not related to ultrasound image, inconsistent with the title.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision

    The authors did not convince me that the model was specifically designed for ultrasound images.



Review #2

  • Please describe the contribution of the paper

    The paper proposes a universal model for ultrasound medical image segmentation. By introducing an auto prompt generator, the proposed model can automatically generate prompt embeddings and achieve end-to-end segmentation. Promising results are reported using both in-distribution and out-of-distribution testing data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. SAM for medical image segmentation is a hot topic now, and this paper can contribute to the advancement in this field.
    2. Introducing an auto prompt generator to make the segmentation procedure automatic is important for clinical applications.
    3. Promising results were achieved.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The model is quite complex.
    2. There is no specific design for ultrasound image segmentation. Why do the authors focus on ultrasound imaging?
    3. The experimental design needs further elaboration. Right now, it is quite confusing. For example, how to guarantee that the compared foundation methods are properly trained?
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The model is quite complex. I am not sure that the design is optimized.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. First of all, the supplementary file cannot be opened.
    2. More details regarding the dataset (sample size, etc.) and the re-implementation of comparison methods should be provided either in the main text or the supplementary file.
    3. Since the authors state that the image encoder of the proposed model is carefully modified to address the challenges of both inadequate local features and excessive computational memory consumption. The computational memory consumption of different methods should be provided for comparison.
    4. What is k in the task tokens? What does it mean by number of task tokens? What is the value? How to determine it?
    5. Discuss why the proposed model is suitable for ultrasound image segmentation instead of other modalities.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, the paper has merits by contributing to the SAM in medical image segmentation field. The results are promising. However, the method is complex. Some details also need further elaboration before acceptance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors have tried their best to address my comments. Although I am still not so sure regarding the specific suitability of the model for ultrasound images as well as the reasonability of the re-implementation of the comparison methods. I suggest to accept the paper.



Review #3

  • Please describe the contribution of the paper

    The authors proposed to adapt SAM for end-to-end ultrasound image segmentation. They have several innovations: 1) reduced the input size to 256x256 and used feature adapter and position adapter to fine tune SAM encoder with reduced input size; 2) added a parallel CNN branch alongside the ViT branch to supplement low-level information; 3) introduced an auto prompt generator to automatically generate prompt embedding. They compared their method with SAM-based methods, 2D CNN-based and 2D ViT-based models on several public ultrasound dataset. Their method has demonstrated best performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors proposed an end-to-end ultrasound segmentation framework with automatic prompt generation based on SAM, demonstrating better segmentation performance than other state-of-the-art SAM-based methods. Details are below:

    1. As medical images have smaller input size than SAM input, they proposed a novel way to utilize adapter for the smaller ultrasound images and fine-tuning SAM encoder.
    2. The parallel CNN branch has demonstrated effectiveness through ablation study.
    3. They used a learnable task token for automatic prompt generation. Overall, the proposed method is sensible for the discussed problems, the experiments were extensively performed, and the paper is easy to follow.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. This method is restricted to certain predefined segmentation task which limits its application.
    2. It is unclear how the testing was performed for the compared SOTA SAM-based methods. Was the input size 1024x1024 or 256x256? The authors should make clear distinction between results from 1024x1024 and results from 256x256.
    3. The AutoSAMUS- result in Table 3 has better performance than the best result in Table 4 on DDTI and UDIAT dataset, indicating that APG module itself is better than all the components in SAMUS. But APG does not provide fine-tuning of the encoder. Detailed discussion is needed.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. How much GPU memory is saved with the proposed method compared to original SAM?
    2. There is no result showing if overlapped patches improves segmentation performance.
    3. More discussion on how the parallel CNN branch improve SAM encoder should be included.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors proposed an end-to-end segmentation framework based on SAM for ultrasound image segmentation. They used a smaller input size and fine tune the ViT encoder with feature and positional adapter. They added a task token for automatic prompt generation. Overall, the methods are innovative and effective. However, the authors need to clarify how they conducted comparisons with other SAM-based methods as they use the original input size and more detailed discussion about automatic prompt generation should be included.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision
    1. The authors claims that SAMUS is a universal segmentation model and not restricted to any predefined segmentation task and given a new category not included in SAMUS, it is addressable by fine-tuning SAMUS via zero-/few-shot learning. The model is designed for ultrasound and I don’t think it’s possible for the model to generalize well to a new segmentation category through few-shot learning, let alone zero-shot.
    2. There is no result showing if overlapped patches improves segmentation performance.




Author Feedback

We thank the reviewers for their valuable comments and recognition of SAM tuning (R1, R3, R4), auto prompting (R1, R3, R4), promising results (R3, R4), and extensive experiments (R4). Major concerns are addressed as follows: Q: Why and how is SAMUS designed for ultrasound? (R1, R3) A: The primary goal of foundation models is to provide a powerful backbone for various downstream tasks. Given severe modality gaps in medical imaging such as WSI, CT, and US, we focus on one of the most frequently-used modalities namely ultrasound. To build a foundation ultrasound image segmentation model, we design/construct: (1) A large ultrasound dataset (US30K) for adapting SAM into SAMUS for performance improvement across various ultrasound tasks as shown in Fig. 3. (2) A CNN-branch image encoder and a cross-branch attention to supplement local fine features to address blurred boundaries and complex object shapes in ultrasound imaging. This is why SAMUS outperforms other foundation models on ultrasound image segmentation as stated in Tabs. 1&2. (3) Pioneer exploration in auto-prompting for end-to-end segmentation. It is motivated by the fact that ultrasound is frequently-used in clinicians while vanilla SAM relying on manual prompts brings heavy burdens. Following SAMCT, SAMedOCT, and other SAM-related models named according to modalities or tasks, we name it as SAMUS. Q: The supplementary file cannot be opened. (R3) A: It contains more details of datasets, visualization, performance comparison, GPU and FLOPs comparison, and ablation studies. To avoid possible issues, it will be published online for reference. Q: More model details. (R1, R3, R4) A: [For R1&R4] In SAMUS, as shown in Fig. 1, we apply a position adapter onto the positional embeddings of SAM to generate new position embeddings. As convolution excels at capturing local and fine features, the CNN branch is to transfer its detailed information to the ViT branch in vanilla SAM. In the auto prompt generator, as stated in the first paragraph of page 6, T0 is the output tokens rather than the combination of task tokens and output tokens. Specifically, the task tokens Tt are newly introduced and learnable while T0 denotes output token embeddings of the frozen mask decoder. It should be clarified that no point prompt is included in APG. When using APG instead of SAM’s prompt encoder to extend SAMUS to AutoSAMUS, no manual prompts were used during training/inference. [For R3&R4] For each downstream task, the number/length of corresponding task tokens is k. As stated in Tab. 3 of the supplementary file, the quality of generated auto prompt embeddings first increases with the increase of k, and then tends to saturation. Thus, the value of k is set as 10 in our experiments. [For R4] It should be clarified that SAMUS is a universal segmentation model and not restricted to any predefined segmentation task. Given a new category not included in SAMUS, it is addressable by fine-tuning SAMUS via zero-/few-shot learning. Q: More experimental details. (R3, R4) A: For a fair comparison, SOTA foundation models are re-implemented and trained for 400 epochs on US30K under the same settings using the same single-point prompts. For foundation models requiring resolutions of 1024x1024 (i.e., SAM, MedSAM, MSA) and 512x512 pixels (i.e., SAMed), images were resized accordingly. It is noted that though their input sizes are different, the output sizes are consistent (i.e., 256×256 pixels) with SAMUS for comparison. The GPU memory costs of SAM and SAMUS are 15.34G vs. 4.30G, showing SAMUS’s stronger deployability. Q: AutoSAMUS- in Tab. 3 outperforms Tab. 4. (R4) A: Methods in Tab. 3 are trained and tested on DDTI, UDIAT, and HMC-QU. Methods in Tab. 4 are trained only on TN3K and BUSI and tested on DDTI and UDIAT for generalization evaluation. They are not comparable. In Tab. 3, without fine-tuning the encoder, AutoSAMUS- still outperforms task-specific models on DDTI and UDIAT, showing the value of SAMUS and APG.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper receives a mixed review: 2 positive (1 of the reviewer is not familiar with this topic) and 1 negative. The paper develops a SAM-based model for ultrasound medical image segmentation using a large scale of 7 ultrasound datasets, and demonstrates improved results. Although overall a reasonable contribution, several concerns are still valid. 1) all reviewers mentioned that the model was proposed for ultrasound image segmentation, however, the specifically designed modules have no link to the characteristic of ultrasound image. 2) The proposed method has minor improvement as compared to an adapter based SAM model MSA [8] (e.g., in UDIAT, TN3K, and CAMUS). 3) The cross branch (CNN and transformer) module may not be generally effective as shown in the last two column of Table 4 (row 3 vs row 2).

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper receives a mixed review: 2 positive (1 of the reviewer is not familiar with this topic) and 1 negative. The paper develops a SAM-based model for ultrasound medical image segmentation using a large scale of 7 ultrasound datasets, and demonstrates improved results. Although overall a reasonable contribution, several concerns are still valid. 1) all reviewers mentioned that the model was proposed for ultrasound image segmentation, however, the specifically designed modules have no link to the characteristic of ultrasound image. 2) The proposed method has minor improvement as compared to an adapter based SAM model MSA [8] (e.g., in UDIAT, TN3K, and CAMUS). 3) The cross branch (CNN and transformer) module may not be generally effective as shown in the last two column of Table 4 (row 3 vs row 2).



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviewers found that the method is innovative and effective, contributing to SAM in medical image segmentation, with promising results. The rebuttal successfully addresses most of the comments. However, the rebuttal response regarding the specific design of the method for the US image is not convincing, as already pointed out by R1 and R2. Considering the certain level of novelty in the methodology and the clear experimentation and ablation study, I am inclined to accept this paper and give the MICCAI community the opportunity to discuss it further.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The reviewers found that the method is innovative and effective, contributing to SAM in medical image segmentation, with promising results. The rebuttal successfully addresses most of the comments. However, the rebuttal response regarding the specific design of the method for the US image is not convincing, as already pointed out by R1 and R2. Considering the certain level of novelty in the methodology and the clear experimentation and ablation study, I am inclined to accept this paper and give the MICCAI community the opportunity to discuss it further.



back to top