Abstract

Segment anything models (SAMs) are gaining attention for their zero-shot generalization capability in segmenting objects of unseen classes and in unseen domains when properly prompted. Interactivity is a key strength of SAMs, allowing users to iteratively provide prompts that specify objects of interest to refine outputs. However, to realize the interactive use of SAMs for 3D medical imaging tasks, rapid inference times are necessary. High memory requirements and long processing delays remain constraints that hinder the adoption of SAMs for this purpose. Specifically, while 2D SAMs applied to 3D volumes contend with repetitive computation to process all slices independently, 3D SAMs suffer from an exponential increase in model parameters and FLOPS. To address these challenges, we present FastSAM3D which accelerates SAM inference to 8 milliseconds per 128×128×128 3D volumetric image on an NVIDIA A100 GPU. This speedup is accomplished through 1) a novel layer-wise progressive distillation scheme that enables knowledge transfer from a complex 12-layer ViT-B to a lightweight 6-layer ViT-Tiny variant encoder without training from scratch; and 2) a novel 3D sparse flash attention to replace vanilla attention operators, substantially reducing memory needs and improving parallelization. Experiments on three diverse datasets reveal that FastSAM3D achieves a remarkable speedup of 527.38× compared to 2D SAMs and 8.75× compared to 3D SAMs on the same volumes without significant performance decline. Thus, FastSAM3D opens the door for low-cost truly interactive SAM-based 3D medical imaging segmentation with commonly used GPU hardware. Code is available at https://anonymous.4open.science/r/FastSAM3D-v1

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2456_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2456_supp.pdf

Link to the Code Repository

https://github.com/arcadelab/FastSAM3D

Link to the Dataset(s)

N/A

BibTex

@InProceedings{She_FastSAM3D_MICCAI2024,
        author = { Shen, Yiqing and Li, Jingxing and Shao, Xinyuan and Inigo Romillo, Blanca and Jindal, Ankush and Dreizin, David and Unberath, Mathias},
        title = { { FastSAM3D: An Efficient Segment Anything Model for 3D Volumetric Medical Images } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper propose a efficient and light-weighting medical sam for interactive 3d volume segmenation. They propose an efficient 3D image encoder with sparse flash attention for distillation to alleviate the high computational cost from original encoder of SAM-Med3D.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper introduces a 6-layer ViT-Tiny medical image encoder with 2 FFN layers and 4 Transformer layers. In addition, a 3D sparse flash attention is proposed to accelerate inference.

    2. This paper propose a layer-wise progressive distillation approach to transfer representational knowledge from 12-layer ViT to 6-layer light-weight 3D medical encoder.

    3. Zero-shot inference on CT and MRI datasets to verify the method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The inference performance displayed by the Table 2 on AMOS and BraTS datasets is too low, making it difficult to convince the effectiveness of the method.

    2. The description of the 3D sparse flash attention is not clear, and the method pipeline is not clearly described, which can only be understood by checking the paper code. In addition, three types of sparse attention shown in the Fig.1 (Lower right) are not mentioned in the paper.

    3. The paper lacks descriptive details of experiments and data preparation and preprocessing.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    It is recommended to add a detailed description of the method pipeline and experimental details.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Please check the experimental evaluation method. The performence of the proposed method and other comparison method is too low (e.g. the dice score of nnUNet [1] on AMOS can reach more than 88% [2], but the FastSAM3D in this article is lower than 50%) and the experimental result is doubtful. In addition, authors could refer Slide-SAM [3] and SegVol [4] which compared with the fully supervised method (e.g. nnUNet) to verify the effectiveness of this efficient interactive segmentation method.

    2. The MobileSAM [5] distillation method is mentioned in this paper, and it is recommended to add a comparison discussion with above distillation method to verify the superiority of layer-wise progressive distillation method proposed in this paper. In addition, what are the advantages of the paper proposed distillation method compared with the SAMI [6] method in EfficientSAM? It is recommended to add a discussion compared with the SAMI method.

    3. The authors should give a proof or ablation study verifing the motivation behind deleting the first 2-layers attention module.

    4. In pactice the network needed to be deployed on a real medical platform and the ops of the network maybe not friendly to hardware. The unfriendly ops can seriously impact the runtimes of the network. Thus, the authors should give a runtime compare with other methods in a real platform.

    5. It is recommended the authors revise the description of the whole paper and add training strategy to increase the readability of the paper.

    [1] Isensee F, Jaeger P F, Kohl S A A, et al. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation[J]. Nature methods, 2021, 18(2): 203-211. [2] Roy S, Koehler G, Ulrich C, et al. Mednext: transformer-driven scaling of convnets for medical image segmentation[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2023: 405-415. [3] Quan Q, Tang F, Xu Z, et al. Slide-SAM: Medical SAM Meets Sliding Window[C]//Medical Imaging with Deep Learning. 2024. [4] Du Y, Bai F, Huang T, et al. Segvol: Universal and interactive volumetric medical image segmentation[J]. arXiv preprint arXiv:2311.13385, 2023. [5] Zhang C, Han D, Qiao Y, et al. Faster segment anything: Towards lightweight sam for mobile applications[J]. arXiv preprint arXiv:2306.14289, 2023. [6] Xiong Y, Varadarajan B, Wu L, et al. Efficientsam: Leveraged masked image pretraining for efficient segment anything[J]. arXiv preprint arXiv:2312.00863, 2023.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Unconvincing experimental results and lack of discussion of other sam-based distillation methods.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    Although some of my questions were answered by the reviewers, the DICE of all methods provided in the paper is too low (less than 55%), which is unconvincing. Especially when fully supervised methods such as nnUNet, clip-driven, TotalSegmentator etc. could achieve better segmentation results (70-90% Dice), this may cause concerns about the use and promotion of this method in the community.

    Based on my comments and the above concerns, I maintain my original score.



Review #2

  • Please describe the contribution of the paper

    This paper presents FastSAM3D to accelerate SAM inference to 8 milliseconds per 128×128×128 3D volumetric image on an NVIDIA A100 GPU. The FastSAM3D introduces a layer-wise progressive distillation scheme to transfer knowledge from a complex ViT-B to a lightweight ViT-Tiny. It also proposes a novel 3D sparse flash attention to replace vanilla attention operators to reduce memory consumption. The FastSAM3D achieves much faster speed without significant performance decline.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is well-written and organized, which is easy to follow.
    • 3D Sparse Flash Attention is interesting. The vanilla attention is computationally expensive, so it is reasonable to diminish the overall number of tokens subjected to the attention process to reduce the computation costs. It is also appreciated that the Parallel Processing with Flash Attention is further used to enhance the efficiency.
    • The proposed FastSAM3D achieves comparable performance to some state-of-the-art methods and considerably reduces the computation costs.
    • The ablation studies are provided to verify the effectiveness of 3D Sparse Flash Attention.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1. The layer-wise progressive distillation scheme has been adopted to transfer knowledge from a teacher to a student in computer vision, applying it to compress SAM for medical image segmentation has some merits. But please consider comparing with the layer-wise distillation schemes such as FITNETS (FITNETS: Hints for Thin Deep Nets. ICLR2014). 2. Please also verify the effectiveness of the layer-wise progressive distillation scheme

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Overall, the proposed method is novel and verified to be efficient without significant performance decline. This is beneficial for clinical use. Though there are some minor issues regarding comparing with the layer-wise distillation methods and the experimental evaluation of the distillation, the overall quality of this paper is good.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed FastSAM3D is interesting and has proven to be efficient and effective in experiments. The ablation and experimental analysis further provide a deeper understanding of how the proposed method works. This is important for clinical use and is the major factor for a weak acceptance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    My concerns on the layer-wise progressive distillation have been addressed in the rebuttal. After reading the other reviews, I still support that this paper has some merits because improving the efficiency of SAM-based methods for medical image analysis can be beneficial for clinical practice when there are unseen segmentation targets. Although the DICE performance of the proposed method may not be comparable to some supervised methods like nnUNet, this drawback is not unique to the proposed method, but widely exists among many general-purpose medical segmentation methods. I agree that improving the segmentation performance of these general-purpose models is crucial, but it is not the focus of this paper and may deserve another work to specifically tackle this issue. Instead, the main research topic of this paper is the efficiency of general-purpose models in 3D medical segmentation which has been addressed. As such, I would like to keep my initial rating of “Weak Accept”.



Review #3

  • Please describe the contribution of the paper

    A distillation approach to transfer knowledge from a teacher architecture to a tiny encoder; Exploiting flash attention instead of the self-attention in SAM

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Using the idea of knowledge distillation for faster inference; Redesigning the typical image encoder form of SAM models which leads to on par performance but much more time efficiency; Changing the typical logit-level knowledge distillation to a layer-wise procedure Using a 3d sparse attention instead of typical self-attention which contributions mainly to the inefficiency of transformers in terms of time/computation complexity; good performance improvements compared to SOTA methods;

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Lack of sufficient ablation study: As par my understanding, the authors use the first 6 layers of teacher/student network for distillation, what about using other layers of teacher network for that? how did you come up with the idea of using the first 6 layers;
    • lack of detailed explanation of attention mechanism: It would be beneficial if authors had some preliminaries explaining the flash-attention mechanism and how it contributes to mitigating the computational burden, some mathematical formulations showcasing the papers procedure would be a plus;
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please see section 6.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please refer to section 5.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors addressed several of my concerns, which led me to lean towards a acceptance of the manuscript.




Author Feedback

  1. Effectiveness of the layer-wise progressive distillation (All): In our ablation study, we compared our layer-wise progressive distillation scheme with (i) layer-wise distillation without progression (i.e. ‘FITNETS: Hints for Thin Deep Nets.’ suggested by R1); and (ii) logits-level distillation (as in MobileSAM [ref.32] suggested by R3). For (i) and comparing to our layer-wise progressive distillation, it converges ~3 times slower and underperforms with a significant 3% decline in Dice score (p<0.05); For (ii), MobileSAM-like logits-level distillation failed to converge in all our experiments. This might be because of the more challenging nature of training on 3D volumes, resulting in quite aggressive distillation factors, combined with the limited data available for distillation. We have modified the manuscript to clarify those points.

  2. Ablation on the layers of the teacher & student models (R3&4): Rather than only using the first 6 of the 12 teacher layers, we use all 12 layers as the teacher model. We eventually configure the student model to 6 layers, because our progressive distillation scheme enables distilling multiple separate layers (m=1,2,3) in the teacher to a single layer in the student. This results in the student model having 12/m layers. When increasing m from 1 to 2, there is no significant performance decline, but the model complexity reduces by ~50%; when m=3, performance declines by more than 70%. Hence we set m=2, resulting in 6 layers in the student model. We have modified the manuscript accordingly to include these details.

  3. Details on 3D sparse flash attention (R3&4): Due to space limitations, we provided detailed descriptions and configurations for our 3D sparse flash attention in the code provided via an anonymized link. We use all three configurations depicted in Fig.1 (lower right), each configuration is assigned to two attention heads. Other details including the hyperparameters, and model checkpoints are also available in the readme file available via the anonymized link.

  4. Low performance (R4): 3D volumetric medical image segmentation is considerably more challenging than 2D segmentation, especially for the segment anything models. This observation, along with a general insight that SAM-like models do not yet perform at the same level in instance segmentation tasks, whether 2D or 3D, is well supported by previous works, such as [ref.27] and [1]. While we agree that both our method and SAMs, in general, will benefit from future research on enhancing performance, the primary contribution of this work is the design of an efficient and lightweight SAM for volumetric segmentation that does not compromise performance. Tab.2 provides quantitative evidence in support of this contribution. [1] “Segment anything model for medical image analysis: an experimental study.” Medical Image Analysis 89 (2023): 102918.

  5. Details about data preparation and preprocessing (R4): We follow the data split and preprocessing pipeline in SAM-Med3D [ref.27] for a fair comparison.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper “FastSAM3D: An Efficient Segment Anything Model for 3D Volumetric Medical Images” introduces a novel and efficient method for accelerating SAM inference in 3D volumetric medical images. The key contributions include the introduction of a layer-wise progressive distillation scheme to transfer knowledge from a complex ViT-B to a lightweight ViT-Tiny, and the development of 3D sparse flash attention to replace vanilla attention operators, significantly reducing memory consumption. The proposed FastSAM3D demonstrates impressive speed improvements, achieving 8 milliseconds per 128×128×128 3D volumetric image on an NVIDIA A100 GPU, without significant performance decline. The paper is well-written, organized, and provides extensive ablation studies and experimental analysis, verifying the effectiveness and efficiency of the proposed method.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper “FastSAM3D: An Efficient Segment Anything Model for 3D Volumetric Medical Images” introduces a novel and efficient method for accelerating SAM inference in 3D volumetric medical images. The key contributions include the introduction of a layer-wise progressive distillation scheme to transfer knowledge from a complex ViT-B to a lightweight ViT-Tiny, and the development of 3D sparse flash attention to replace vanilla attention operators, significantly reducing memory consumption. The proposed FastSAM3D demonstrates impressive speed improvements, achieving 8 milliseconds per 128×128×128 3D volumetric image on an NVIDIA A100 GPU, without significant performance decline. The paper is well-written, organized, and provides extensive ablation studies and experimental analysis, verifying the effectiveness and efficiency of the proposed method.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The proposed method and code release could well benefit the community. The rebuttal has addressed the major concerns of the reviewers.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The proposed method and code release could well benefit the community. The rebuttal has addressed the major concerns of the reviewers.



back to top