Abstract

Recently, the Segment Anything Model (SAM) has demonstrated promising segmentation capabilities in a variety of downstream segmentation tasks. However in the context of universal medical image segmentation there exists a notable performance discrepancy when directly applying SAM due to the domain gap between natural and 2D/3D medical data. In this work, we propose a dual-branch adapted SAM framework, named DB-SAM, that strives to effectively bridge this domain gap. Our dual-branch adapted SAM contains two branches in parallel: a ViT branch and a convolution branch. The ViT branch incorporates a learnable channel attention block after each frozen attention block, which captures domain-specific local features. On the other hand, the convolution branch employs a light-weight convolutional block to extract domain-specific shallow features from the input medical image. To perform cross-branch feature fusion, we design a bilateral cross-attention block and a ViT convolution fusion block, which dynamically combine diverse information of two branches for mask decoder. Extensive experiments on large-scale medical image dataset with various 3D and 2D medical segmentation tasks reveal the merits of our proposed contributions. On 21 3D medical image segmentation tasks, our proposed DB-SAM achieves an absolute gain of 8.8\%, compared to a recent medical SAM adapter in the literature. Our code and models will be publicly released.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1489_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/AlfredQin/DB-SAM

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Qin_DBSAM_MICCAI2024,
        author = { Qin, Chao and Cao, Jiale and Fu, Huazhu and Shahbaz Khan, Fahad and Anwer, Rao Muhammad},
        title = { { DB-SAM: Delving into High Quality Universal Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The manuscript presents the next extension of medSAM for higher accuracy automatic segmentation of medical images. The authors incorporate the second branch to the “original” medSAM, involving channel attention block, bilateral cross attention block, followed by fusion block. This fusion of information between medSAM and the second branch enables higher accuracy segmentation on 2D and 3D datasets from different image modalities and organs of interest.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The manuscript is well-structured, concise and rounded. Apart from the well-described implementation of a second branch, the algorithm was tested on plenty of medically relevant datasets. The manuscript also includes ablation studies, which are of great interest to further develop and potentially improve the segmentation of medical data.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    While DSC and NSD are useful metrics for comparison of methods, it would be important to include metrics which are used in medicine, for example, organ volume. While the overall improvement in the quality of segmentation is visible in all datasets, it is not homogeneous between organs of interest. It would be interesting to investigate why the segmentation accuracy is still poor on some datasets (let’s say DSC is lower than 70%)? Also, I find ablation studies quite interesting. If possible, it would be important to perform them individually. For example, if bilateral cross attenuation is integrated (without the other two), would it improve segmentation accuracy more than channel attention? This type of comparison would be more telling in terms of which specific implementation contributes to the success of this pipeline.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I found the work extremely interesting and of great interest to the biomedical community. To fully profit from it, however, access to the code or ideally its easy-to-plug-and-play implementation would be critical.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The manuscript is well-written, includes a highly relevant next step in medical image segmentation and is tested on a variety of organs of interest and imaging modalities.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper proposes a dual-branch adapted SAM framework for medical image segmentation called DB-SAM. It aims to effectively bridge the domain gap between natural images and 2D/3D medical images when applying SAM(Segment Anything Model). Specifically, the ViT branch captures domain-specific local features, while the convolution branch extracts domain-specific shallow features from the input medical image. The information from two branches is then dynamically combined through a bilateral cross-attention block and a ViT convolution fusion block. The framework exhibits strong generalization capabilities, performs well on multiple datasets and significantly improves segmentation performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    In the ViT branch, to keep the strong feature representation ability of pre-trained ViT, the author freezes the weights of pre-trained ViT and introduces a local adapter module consisting of a channel attention block which is beneficial to extract high-level domain-specific features from the different levels of ViT encoder.

    In the convolution branch, the author utilizes the light-weight convolution block to extract shallow features from resized image Iconv. Then a bilateral cross attention block helps to fuse deep features from ViT branch and shallow features from convolution branch.

    Overall, the combination of the ViT branch and the convolution branch makes the proposed method, DB-SAM, a novel and promising framework for medical image segmentation.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The model input is 2D image, how can the authors avoid information loss in the z dimension (depth) for segmenting 3D data?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    As shown in “weaknesses of the paper”.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Adequacy of the experiment, novelty of the method.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The manuscript describes DB-SAM, an innovative dual-branch framework to enhance the Segment Anything Model for medical image segmentation. By paralleling a ViT branch and a convolution branch, the method effectively bridges the domain gap between natural and medical images. The proposed solution demonstrates a significant 8.8% improvement on 3D segmentation tasks over existing methods, indicating a promising direction for future research in the field. The authors’ commitment to releasing the code will be beneficial for reproducibility and further exploration.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The article is well-organized, well-written, and clear and easy to understand. 2) The authors propose a dual-branch framework, named DB-SAM, that adapts SAM for high-quality universal medical image segmentation, where the dual-branch comprising a ViT branch and a convolution branch. 3) The paper introduces a bilateral cross-attention block to effectively fuse features between the ViT and convolution branches, followed by an automatic selective mechanism for final feature fusion. 4) The results of experiments confirm the efficacy of the proposed DB-SAM, resulting in a consistent enhancement in performance across various 2D and 3D medical image segmentation tasks.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) Why is the size of the convolution branch set to 256x256, and how would the performance be affected if the dimensions were altered? 2) The purpose of the convolution branch is to extract visual features. Can the lightweight convolution be replaced with ResNet-18?

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors have proposed a new SAM framework that significantly outperforms MedSAM across multiple tasks, which will have a positive impact on the segmentation community.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors have proposed a new SAM framework that significantly outperforms MedSAM across multiple tasks, which will have a positive impact on the segmentation community.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank the reviewers for the positive feedback. Our code and models will be publicly released.

To Reviewer #1

  1. 1 Code Availability We will make the source code and pre-trained models publicly available. 1.2 Inclusion of Additional Medical Metrics We acknowledge the importance of including clinically relevant metrics such as organ volume. We will extend our evaluation to include these metrics, which will provide a more comprehensive understanding of the clinical applicability of our method. 1.3 Segmentation Performance across Organs We thank R1 and will further conduct an analysis with respect to segmentation performance across organs 1.4 Further Ablation Studies We thank R1 for the suggestion and will do it in the revised version.

To Reviewer #2 1.1 Handling z-dimension information The reviewer raised an important point regarding the potential loss of information when segmenting 3D data due to our model currently accepting only 2D image inputs. Our current approach involves slicing 3D images into 2D slices, which does indeed omit the z-dimension(depth) information. This could potentially affect the model’s ability to fully capture the spatial complexities of 3D structures. To address this, we are exploring methods to adapt our model to directly handle 3D volumetric data without slicing. This will allow our model to utilize depth information, which we hypothesize could enhance the segmentation accuracy, especially for complex anatomical structures. We plan to incorporate this enhancement in future iterations of our model and will rigorously evaluate the impact of using depth information on segmentation performance.

To Reviewer #3 1.1 Choice of Convolution Branch Input Size This image size of $256\times256$ was chosen for the convolution branch because it matches the resolution of the raw images in our datasets, which allows for direct processing without additional resizing. On the other hand, the raw image should be resized to $1024\times1024$ and then is fed into the ViT branch, since the ViT module of SAM is specifically designed to process images at this resolution. 1.2 Potential Replacement of Lightweight Convolution with ResNet-18 The suggestion to replace the lightweight convolution block with a more complex architecture such as ResNet-18 is intriguing. ResNet-18 could potentially enhance the model’s ability to capture richer feature representations due to its depth and architectural advantages. We plan to conduct experiments to compare the performance of our current model with a version incorporating ResNet-18 to determine if this change offers a significant improvement. This exploration will help us refine our model design and optimize it for even better performance across various segmentation tasks.




Meta-Review

Meta-review not available, early accepted paper.



back to top