Abstract

3D medical image segmentation is critical for clinical diagnosis and treatment planning. Recently, with the powerful generalization, the foundational segmentation model SAM is widely used in medical images. However, the existing SAM variants still have many limitations including lack of 3D-aware ability and automatic prompts. To address these limitations, we present a novel SAM-based segmentation framework in 3D medical images, namely 3D-SAutoMed. We respectively propose the Inter- and Intra-slice Attention and Historical slice Information Sharing strategy to share local and global information, so as to enable SAM to be 3D-aware. Meanwhile, we propose a Box Prompt Generator to automatically generate prompt embedding, leading full-automation in SAM. Our results demonstrate that 3D-SAutoMed outperforms advanced universal methods and SAM variants on both metrics and across BTCV, CHAOS and SegTHOR datasets. Particularly, a large improvement of HD score is achieved, e.g. 44% and 20.7% improvement compared with the best result in the other SAM variants on the BTCV and SegTHOR dataset, respectively.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2090_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Lia_3DSAutoMed_MICCAI2024,
        author = { Liang, Junjie and Cao, Peng and Yang, Wenju and Yang, Jinzhu and Zaiane, Osmar R.},
        title = { { 3D-SAutoMed: Automatic Segment Anything Model for 3D Medical Image Segmentation from Local-Global Perspective } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15009},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors proposed 3D-SAutoMed for iterative 3D medical image segmentation with automatic prompt generation based on SAM. They used inter- and intra-slice attention and historical slice information sharing strategy to share information from the local and global perspectives for 3D awareness. They propose a Box Prompt Generator to automatically generate prompt embedding, leading to fully automatic segmentation. They compared their method with several SAM-based 3D segmentation methods and demonstrated improvements in BTCV, CHAOS, and SegTHOR dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Adapting SAM for 3D medical image segmentation is an important area of research. The proposed methods are motivated by clearly-defined needs which makes the paper easy to follow and sensible. The authors conducted extensive experiments to compare with state-of-the-art 2D models and ablation studies to demonstrate the effectiveness of each component.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The authors didn’t compare with a 3D CNN or ViT models.
    2. There is a lack of details on how to train the model, which would result in poor reproducibility. Is the model iteratively trained? How do you deal with 3D volumes with different number of slices?
    3. The global 3D aware module is not effective as the information stored in the historical information token is too limited to induce global awareness. The authors should discuss possible reasons further.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?
    1. There is a lack of details on how to train the model, which would result in poor reproducibility. Is the model iteratively trained? How do you deal with 3D volumes with different number of slices?
    2. Several items are not clearly defined in the paper, e.g. semantic embedding, information filter.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. What’s default query for slice t=0?
    2. How is the bounding box prompt initialized at the central slice? Does it require manual input or is it automatically generated?
    3. Please provide details for inter-slice attention. Do you need to transpose the feature map to learn attention across slices?
    4. The ablation study was not complete. The performance of global 3D aware and global+prompt generation are missing.
    5. Please clearly define the added semantic embedding for box prompt generation.
    6. “we add an information filter at the end to filter out irrelevant information.” Please provide details about the filter.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed methods are motivated by clearly-defined needs which makes the paper easy to follow and sensible. The authors conducted extensive experiments to compare with state-of-the-art 2D models and ablation studies to demonstrate the effectiveness of each component. However, there is a lack of details about the model which would lead to poor reproducibility.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors have addressed the questions about model and experiments and claim to release code after acceptance. They also claimed to add results for comparison with 3D deep learning models.



Review #2

  • Please describe the contribution of the paper

    This paper introduces a 3D-aware Segment Anything Model (SAM) for 3D medical image segmentation. Specifically, the proposed method incorporates both local and global aspects of input slices through inter-slice and intra-slice attention, along with historical information sharing. Experimental results indicate that the proposed method performs comparably to previous works.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Although the network components are designed using various existing deep network layers (e.g., slice attention based on multi-head self-attention), their roles are clearly described and well justified.

    The experimental results offer comprehensive descriptions and analyses of the quantitative and qualitative results.

    Overall, the paper is easy to follow.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There are some unclear technical descriptions that must be revised thoroughly. Ablation studies should provide more analyses about network components such as intra- and inter-slice attentions.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    It is always recommended to publicly release training and evaluation code for the research community. Without official implementations, it can be challenging to validate the feasibility of the ideas proposed in this paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors should provide more detailed explanations on the following points:

    • LoRA [8]: At the beginning of Sec 2, LoRA is mentioned too briefly. It is recommended to give a more detailed explanation for readers unfamiliar with this concept.
    • Multi-head self-attention (MHSA) and inter-SA: While the use of MHSA in Intra-SA is straightforward to understand, the implementation of Inter-SA with MHSA requires further clarification. Specifically, does the module process [f^{t-1}, f^{t}] and [f^{t}, f^{t+1}] separately? Please elaborate on the structure and workflow of both Intra-SA and Inter-SA to aid understanding.
    • Default query at the beginning: It is unclear how the initial queries are generated in Eqn. (4) without using the tokens from the previous slice. How is the default query obtained at the beginning? This requires further explanation.
    • Inconsistent Notation: There is inconsistency in the use of notations token_{his} and tok_{his} in the manuscript. Please revise these to maintain consistency throughout the paper.
    • Term Revision: The term “3D-aware” might lead to awkward sentences in some sentences. Consider revising it to “3D-awareness” to improve readability in some sentences.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The manuscript, in its current form, is not yet suitable for publication in MICCAI. However, the reviewer is willing to reconsider the final score if the authors address the weaknesses mentioned above in their rebuttal.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    Based on the provided rebuttal, the reviewer found that various important factors of the proposed method are missing in the current manuscript. They must had been included in the original manuscript. Since the revision of the paper needs to be done more thoroughly, the reviewer concluded that it will be difficult to make it within the time required for publication of the paper this time.



Review #3

  • Please describe the contribution of the paper

    The manuscript introduces a new method for 3D image segmentation based on SAM. The method is able to use 3D context using inter- and intra-slice attention and, thanks to a box prompting system, is fully automated. The authors show that 3D-SAutoMed is competitive in a number of public benchmarks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The method described overcomes most of SAM’s limitations when applied to biomedical data and obtains very good results on various benchmarks.
    • In particular, using the HIS strategy to improve the 3D context is a very interesting idea that targets a common limitation of generalizing 2D models to 3D tasks (although ablation studies show that the improvement is mainly due to the Local 3D-aware).
    • Using a DETR-based prompt generation is also interesting since it allows the model to not rely too heavily on the segmentation quality in the previous slices.
    • The manuscript is clear and well-written.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • DETR based models usually require quite a long training (due to slow convergence), this together with the fine tuning of the SAM and all the other blocks makes it unclear how computational expensive is to train the model. On the same note, the authors also do not comment on inference efficiency of such complex architecture. This might be a limiting factor when scaling up the method to very large volumes segmentation. A short comparison of runtime/memory footprint would have made the manuscript stronger.
    • Section 3 of the manuscript, “Experiments and Results,” is short and lacks critical details. For example, it is unclear to the reviewer what training routine and hyperparameters have been used for each baseline method. These details are fundamental to understanding how much effort was used by the authors to compare all methods fairly. Moreover, the lack of source code makes it difficult to reproduce the results presented and exacerbates the issue.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • As mentioned in section 6, a short discussion of the model’s computational complexity (at train and test) would strengthen the manuscript.
    • The method results could have been much stronger if they were attained in a public challenge with standardized evaluation on a hidden test set.
    • The authors should comment on their stance on submitting their source code. The method is well described in the manuscript, but full access to the models and experiments would be precious for the MICCAI community.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    3D-SAutoMed introduces an interesting combination of ideas. The method contributions address some major limitations of using SAM-based models for biomedical applications. Although the reviewer maintains that the “experiments” sections are incomplete, they would still like the manuscript to be accepted.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Regarding the author’s response: Q5: Although the test labels are not public, the evaluation site for some challenges still accepts submissions. Q6: I still don’t see why it is “unfair”. Some other methods in the baseline are fully automated, and in my review, I mentioned that any discussion of computational complexity and limitations would have been welcome.

    Although the response did not fully address my concerns, they were minor issues overall. As a reviewer, I still believe the manuscript should be accepted.




Author Feedback

Thanks to reviewers for their time and insightful comments. They found our work is novel (R4), well-organized (R1,R4,R5) and effective (R1,R4,R5), but also pointed out some issues. We will clarify the main points: 1.Details of methods. We will release the code after acceptance to provide more details. Q1:Training and inference(R1). During training, we conducted iterative training. Specifically, when training t slice, we first selects the previous n (random from 1-3) slices for iterative inference to obtain the token_his of t slice. Since our model is a slice-based method and token_his is shared and fixed in length, our model can flexibly deal with 3D volumes with different number of slices. Q2:Default query in t=0(R1,R5). Since there is no previous slicing result to initialize the query at t=0, we additionally define a learnable default query to make the initial prediction. Default query is actually the same as object query in DETR, and both are the learnable embedding vectors and optimized during training. In addition, we also tried to use a centered box directly to initialize the query at t=0, but the result shows that the learnable default query works better. Q3:bbox prompt in begging slice(R1,R5). Our entire segmentation framework is fully automated without any human interaction. As mentioned in Q2, we use default query to predict the bbox prompt in the beginning central slice(t=0). Q4:Inter-slice attention(R1,R5). For inter-slice attention, the self-attention is performed on the features of slices at t-1,t,t+1. Leveraging the continuity of targets in 3D medical images, we compute the attention only between features at the same spatial position across slices, effectively reducing computational complexity, as illustrated by the yellow patches in Figure 1. Q5:Semantic embedding and information filter(R1). The semantic embedding is intended to allow each query to perceive the category for which it is responsible. Specifically, we add a corresponding learnable semantic token for each query separately. These semantic tokens are randomly initialized and optimized during training. For the information filter, it consists of MLP and Normalization, aiming at remaining the global inherent information and filter out redundancy. Q6:Global 3D-aware(R1). We found that longer token sequences did not significantly improve performance in experiment. Because of the continuity of 3D medical images, the global feature information of model learning is relatively stable. Therefore, too large a historical information token will lead to redundancy. Q7:LoRA(R5). Thanks for your suggestion. Indeed, we should briefly introduce the LoRA technique to make readers understand. 2.Experiment. Q1:3D model comparison(R1). In fact, with our proposed 3D-aware strategy, our method is still competitive compared with the advanced 3D model. We will compare these methods in further work. Q2:Comparison methods details(R4). Specifically, we follow the basic configuration parameters and training pipeline provided by the comparison methods. Q3:Ablation study on global 3D aware+prompt generation(R1). Due to space constraints, we did not show the complete results of the ablation study, and we will refine them in further work. Q4:Ablation study on slice attention(R5). For the ablation study on the inter and intra slice attention component, we use Local 3D-aware to represent this component in Table 4. Due to space constraints, we do not show the ablation results for each component, and we will refine them in further work. Q5: Comparison in the test set(R4). Because the test sets of these datasets do not provide labels, it is difficult to compare them directly in the test sets. Q6:runtime and memory(R4). It is unfair to directly compare the runtime and memory in this paper. Our method’s advantage is fully automatic, while other SAM variants need manual prompts. 3.Writing. Q1:Writing(R5).We will check and revise the full text for writing problems.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper receives an initial review of 2WR (R1, R5) and 1WA (R4). After rebuttal, R1 changes to A (R1 has limited expertise in this topic), based on the argument that “Authors claimed to add results for comparison with 3D deep learning models.”

    After going through the paper, reviews, rebuttals carefully, I agree with R1 and R5’s initial reviews that there are several critical issues with the experimental comparison and results. 1) One of the big flaw is that the proposed method is a 3D SAM-based segmentation model, however, there is no 3D comparing methods. Although authors claim to add the comparison to 3D methods in the future, upon MICCAI review guideline, the paper is judged based on its current form. 2) Both R1 and R5 point out that the proposed global 3D aware module is not effective as shown in the ablation result (row 2 vs row 3 in Table 4). 3) It also lacks of important method and training details in the current form. 4) Authors use 3 public datasets for evaluation, however, they did not report results on the testing set (some of them still accept submission as pointed out by R4). Therefore, this paper falls short to be accept at its current form.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper receives an initial review of 2WR (R1, R5) and 1WA (R4). After rebuttal, R1 changes to A (R1 has limited expertise in this topic), based on the argument that “Authors claimed to add results for comparison with 3D deep learning models.”

    After going through the paper, reviews, rebuttals carefully, I agree with R1 and R5’s initial reviews that there are several critical issues with the experimental comparison and results. 1) One of the big flaw is that the proposed method is a 3D SAM-based segmentation model, however, there is no 3D comparing methods. Although authors claim to add the comparison to 3D methods in the future, upon MICCAI review guideline, the paper is judged based on its current form. 2) Both R1 and R5 point out that the proposed global 3D aware module is not effective as shown in the ablation result (row 2 vs row 3 in Table 4). 3) It also lacks of important method and training details in the current form. 4) Authors use 3 public datasets for evaluation, however, they did not report results on the testing set (some of them still accept submission as pointed out by R4). Therefore, this paper falls short to be accept at its current form.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors propose 3D-SAutoMed for 3D medical image segmentation using SAM with automatic prompt generation. It employs inter- and intra-slice attention and a historical slice information sharing strategy for 3D context. Validated on BTCV, CHAOS, and SegTHOR datasets, it shows improvements over existing methods. Key strengths include addressing SAM’s 3D segmentation limitations and comprehensive experimental validation. However, it lacks comparisons with 3D CNN or ViT models, detailed training methodologies, and an analysis of computational efficiency. The technical details need to be enhanced.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The authors propose 3D-SAutoMed for 3D medical image segmentation using SAM with automatic prompt generation. It employs inter- and intra-slice attention and a historical slice information sharing strategy for 3D context. Validated on BTCV, CHAOS, and SegTHOR datasets, it shows improvements over existing methods. Key strengths include addressing SAM’s 3D segmentation limitations and comprehensive experimental validation. However, it lacks comparisons with 3D CNN or ViT models, detailed training methodologies, and an analysis of computational efficiency. The technical details need to be enhanced.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Two meta reviewers have different opinions. I think both make sense. From my perspective, the 2.5 D can be regarded as some kind of 3D. The lack of 3D baselines is an limitation, but in general the slide attention is an interesting idea to be discussed in the conference.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Two meta reviewers have different opinions. I think both make sense. From my perspective, the 2.5 D can be regarded as some kind of 3D. The lack of 3D baselines is an limitation, but in general the slide attention is an interesting idea to be discussed in the conference.



back to top