Abstract

Tumor lesion segmentation on CT or MRI images plays a critical role in cancer diagnosis and treatment planning. Considering the inherent differences in tumor lesion segmentation data across various medical imaging modalities and equipment, integrating medical knowledge into the Segment Anything Model (SAM) presents promising capability due to its versatility and generalization potential. Recent studies have attempted to enhance SAM with medical expertise by pre-training on large-scale medical segmentation datasets. However, challenges still exist in 3D tumor lesion segmentation owing to tumor complexity and the imbalance in foreground and background regions. Therefore, we introduce Mask-Enhanced SAM (M-SAM), an innovative architecture tailored for 3D tumor lesion segmentation. We propose a novel Mask-Enhanced Adapter (MEA) within M-SAM that enriches the semantic information of medical images with positional data from coarse segmentation masks, facilitating the generation of more precise segmentation masks. Furthermore, an iterative refinement scheme is implemented in M-SAM to refine the segmentation masks progressively, leading to improved performance. Extensive experiments on seven tumor lesion segmentation datasets indicate that our M-SAM not only achieves high segmentation accuracy but also exhibits robust generalization. The code is available at https://github.com/nanase1025/M-SAM.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0762_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0762_supp.pdf

Link to the Code Repository

https://github.com/nanase1025/M-SAM

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Shi_MaskEnhanced_MICCAI2024,
        author = { Shi, Hairong and Han, Songhao and Huang, Shaofei and Liao, Yue and Li, Guanbin and Kong, Xiangxing and Zhu, Hua and Wang, Xiaomu and Liu, Si},
        title = { { Mask-Enhanced Segment Anything Model for Tumor Lesion Semantic Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15008},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes an inner block of mask enhancement for SAM based segmentation. This new network is called M-SAM. The idea of this new block is to better align information from point prompt (positional) with image embeddings to improve mask decoder results. The network is tested on 7 segmentation tasks and compared to 9 well chosen state-of-the-art methods. Dice scores show that M-SAM performs better than other tested methods, and that M-SAM better generalizes than SAM-Med3D.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Interesting comparisons with SOTA apporaches and segmentation task
    • achieve good results on DB transfer (better than SOTA, on same modality and segmentation task)
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • the std are missing in all tables
    • the update of E_P is for me unclear. Only embeddings are updated ?
    • some unclear points : before/after transfer and DSC that does match; links between vocabulary of table 3 and §3.3; iterativement refinement versus training
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Adding std for dice scores in tables can be of interest.
    • If authors can clarified what is exactly done by before and after transfer (and why results of tables 1 and 2 mismatch)… I may have missed an information.
    • I don’t find information about the number of iterations for the refinement (validation)… if any. From my undestanding, the training phase is not clear between updating networks parameters and the iterative refinement.
    • If authors can clarify words used for the ablation study and use similar words in table 3 and §3.3 can help the reader.
    • The proposed resizing scheme for volumes is not symmetrical and is not exactly a “crop-and-pad” strategy. My concern is about the shape deformation that can be generated by the trilinear interpolation. With such an approach, can the method effectively segment organs?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Some unclear points and statistics are missing. These points can be clarified by authors.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors effectively addressed my main concerns.



Review #2

  • Please describe the contribution of the paper

    This paper advanced the segment anything model (SAM) to adapt it to the application of tumor segmentation task in volumetric image such as CT and MRI. The main contribution comes from the novel Mask-Enhanced Adapter (MEA) module, which is designed to iteratively refine the segmentation mask and utilize the refined segmentation mask to facilitate the image feature extraction in turn.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The model is developed in 3D manner. This could be a highlight of this paper since most of the prior works inherited the 2D network architecture from the original SAM, which substantially affected the ability of 3D image analysis.

    • Multiple public datasets were used to evaluate the proposed method. Six public and one in-house CT and MRI datasets were used to evaluate the proposed method, which makes the experiment settings solid.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The design of the key component, Mask-Enhanced Adapter (MEA), is not convincing. The major technical contribution of this work comes from the MEA module. However, the design is not convincing. According to my understanding, the purpose of this design is to realize feature fusion or cross attention between the image features and mask features. A more straightforward way to achieve that purpose is to utilize the cross-attention module. Specifically, we can use the image feature as query and mask feature as key and value to conduct the attention process, and also in an inverse way. The author should compare their MEA module with this baseline to demonstrate the necessity of this design.

    • Many technical details are missing or not well explained This substantially affects the understanding of the proposed method. Please see my itemized comments below for more information.

    • The evaluation part of this paper needs substantial improvement. The experimental evaluation of the proposed method is somehow weak regarding many missing information. For example, there is no data split information. No statistical analysis was conducted on the results to demonstrate the significance of the improvement margins. Only region-based metrics (DSC and IoU) were used for evaluation but no distance-based metrics like ASD and HD. Please see the following itemized comments for more detailed information. All of these issues affect the quality of the experimental results, making the conclusion less convincing.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Section 2.1: “As shown in Fig. 1, … and a randomly-initialized point as the initial point prompt.” This sentence is incomplete. Please double check and rewrite it.

    • Section 2.1: “… a randomly-initialized point as the initial point prompt.” I am wondering how can a randomly initialized point lead to a correct and stable segmentation of the tumor. The point prompt is an important input of the SAM. Given the same image input, if we have different points as prompt, the segmentation results could be highly different. So, I doubt whether and why this kind of “random initialization” can lead to the final correct and stable segmentation of the tumors.

    • Section 2.1: “N_I and N_P denote the dimensions of the image and point embeddings, respectively.” It could be misleading to say “dimensions” here since it sounds like the dimension of the feature vectors (or the channel number). Please use “number” instead of “dimension” here. BTW, what’s the specific value of N_I and N_P in the experiments?

    • Section 2.2: “we modify the residual connections in the original Transformer block to mutual residual connections, which facilitates their fusion.” I am wondering why not just use cross-attention instead of self-attention to realize the feature fusion function? In my opinion, the mutual residual connection design is less exhaustive when compared with cross-attention mechanism.

    • Section 2.3: “the new prompt embedding E^1_P is also generated based on the last segmentation mask” It is unclear how to generate the new prompt embedding based on the last segmentation mask.

    • Section 3.1: The author said that they “employ a crop-or-pad strategy to standardize all images,” but also said “applying trilinear interpolation to resize images that exceed the specified dimensions,” which is actually resampling, not cropping.

    • Section 3.1: “ZNormalization” should be “z-score normalization.”

    • Section 3.1: “… trained on one NVIDIA Tesla V100 GPU.” The author is suggested to explicitly clarify which part of the whole network is tunable and how much memory does it cost for training.

    • Section 3.1: What’s the data split? How many samples were used for training, validation, and testing?

    • Section 3.1: Distance based metrics such as ASD or HD are suggested for the evaluation of segmentation results. Currently, only region based metrics like DSC and IoU are used.

    • Section 3.1: There is no statistical analysis of the results to demonstrate the significance of the improvement margins.

    • Section 3.2: “To further validate our method’s generalizability, we performed transfer experiments from source to target datasets without training on the latter.” I am confused here. If there is no training on the target dataset, why Table 2 shows “before” and “after” transfer results? What’s the exact meaning of “before” and “after”?

    • Section 3.2: Please keep consistent decimal when reporting the numeric results throughout the paper.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    No major flaw is found in this paper. However, given the weaknesses in methodology design, presentation of the method, and experimental evaluation, I tend to render a weak reject first on this paper and expect to see a solid rebuttal from the authors to change my recommendation.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Thanks for the authors’ efforts in rebuttal, which generally addressed my concerns and comments. I would like to raise my score from 3 (Weak reject) to 4 (Weak accept).



Review #3

  • Please describe the contribution of the paper

    This paper provides an effective Tumor Lesion Semantic Segmentation method based on the vision foundation Segment Anything Model and an attempted work for its medical image counterpart (SAM-Med3D). The authors propose an elegant fusion scheme between the image information and the mask information obtained from the foundation models to iteratively improve the quality of segmentation. The authors have evaluated the method comprehensively in several datasets against strong baselines and have shown superior performance. Additionally, the method boasts improved performances with updates to only 20% parameters.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The fusion scheme is very popular in medical image analysis between different modalities of data/information and most definitely not the newest of ideas, but the elegance with which the authors fused the information seems somewhat novel and effective. The method is comprehensively evaluated across several datasets and the results show that the proposed method is superior against the state-of-the-art methods in the topic of study with updates to only 20% parameters.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    While the paper is mostly clear, readers might have problems understanding some methodological details. How does the model handle the samples in inference time? Specifically, how does the prompt encoder function in the absence of ground truths for inference samples? Similarly, the number of points used for the prompt is not mentioned. Perhaps the same as “SAM-Med3D” (10)? Experiments showing the effects of different choices of the number of points would be helpful. Additionally in the main results table are the numbers for baselines generated by the authors or taken from published sources? If the authors implemented the baselines, what hyper-parameters were set for each baseline? Shouldn’t the transformer and CNN-based models be tuned for these hyperparameters to yield the best results for each model? Does the reported number show the average across multiple runs? What are the Standard deviations across the runs?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No codes were provided but implementation details seem sufficient and because the architecture is based publically available methods, the modifications authors suggested look reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please consider the following concerns:

    How does the model handle the samples in inference time? Specifically, how does the prompt encoder function in the absence of ground truths for inference samples?

    Wrong Citation in the introduction: “Tumor lesion segmentation [17]”. The reference points to a paper that doesn’t seem to be related to tumor lesion segmentation

    In Table 1 Are the numbers for baselines generated by the authors or taken from published sources? If the authors implemented the baselines, what hyper-parameters were set for each baseline? Shouldn’t the transformer and CNN-based models be tuned for these hyperparameters to yield the best results for each model? Does the reported number show the average across multiple runs? What are the Standard deviations across the runs?

    The number of points used for the prompt is not mentioned. Perhaps the same as “SAM-Med3D” (10)? Experiments showing the effects of different choices of the number of points would be helpful.

    Experiments showing the zero-shot transfer across datasets with this method would provide more insights into how well this model generalizes. Because this is based on a generalist Medical Image model “SAM-Med3D” and the foundation model “Segment Anything”, how feasible is this method for a foundational Tumor Lesion Semantic Segmentation model? Something to think about perhaps.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents an elegant fusion scheme between image and mask by leveraging the strengths of the foundational model with very few added complexities to facilitate Tumor Lesion segmentation. The results look convincing against the widely used baseline. Additionally, the writing is mostly good and if the authors address some of the concerns as specified, this may be a good read for the people in the community.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors clarified the concerns I had and I am satisfied with their responses. I had accepted the paper for its merits and I stand true to my initial judgement.




Author Feedback

We sincerely thank ACs and all reviewers for their efforts. We are grateful that reviewers appreciate the novelty and effectiveness (R#4), good results (R#1), and solid experiments (R#6) of our work. We will revise our paper accordingly and release the code upon acceptance.

All reviewers 1) Point Prompt: Following SAM-Med3D, to simulate the clinical scenario of interactive segmentation, one point per iteration is randomly sampled: from the foreground in the 1st iteration, and from the error region between the coarse mask and GT in the subsequent 9 iteration, totaling 10 points in 10 iterations. The updated prompt embedding E_P is thus generated by feeding the newly sampled point into the prompt encoder at each iteration. 2) Training/Inference: The above point sampling strategy is used for both training and inference in our experiments, while in real clinical use, it operates interactively with physicians. During training, we calculate the segmentation loss and update the model parameters only after the last (10th) iterative refinement. 3) Writing issues: Thanks for pointing these issues out and we will fix them all in the revised version.

R#1, R#4 STD values: Our results in Tab1 are averaged across 5 runs, with Dice STD less than 0.3% on all datasets. Due to the rebuttal policy, the complete std values will be included in the revised version.

R#1, R#6 1) Before/After in Tab.2: ‘before’ means both training and testing on target dataset, while ‘after’ means training on source and testing on target dataset. A smaller difference between ‘before’ and ‘after’ indicates less performance drop, i.e., better transferability. The inconsistent results for LiTS in Tab.1 (liver tumor + organ) and Tab.2 (liver tumor only) are because Tab.2 focuses on transfer performance between LiTS and MSD datasets, where only tumor data is available in MSD. 2) crop-or-pad: We apologize for the mistake in the original description. The correct one is ‘…and applying cropping for dimensions exceeding 128’ and we’ll fix it in the revised version.

R#4 1) Baseline in Tab1: Most baseline results on the 5 datasets are from published sources. For their missing results on specific datasets (e.g., KiTS19 and LiTS results for UNETR), we reproduce them using the same hyperparameters reported in the original papers for consistency. 2) Transferability between different tumors: attempted transfer learning between lung and brain tumors but observed unsatisfactory results (over 19% drop), likely due to the entirely different expertise required for different tumor types. However, this does not affect our core contributions. We have designed a feasible solution to construct a foundational tumor lesion segmentation model by extending our method to a Mixture of Experts (MoE) framework to handle different tumors with different experts. We will explore this in our future work.

R#6 1) MEA vs. CA: We have previously tried cross-attention for interaction but found limited improvement over raw SAM-Med3D (4% less than MEA). It may be due to MEA realizes precise point-to-point fusion, while CA introduces unnecessary or noisy interaction. 2) Metrics: Since dist-based metrics (like HD) are sensitive to outliers and may skew clinical evaluations, we follow SOTA methods to report commonly-used Dice and IoU. Moreover, the average HD metric values on the five datasets are 11.45 for nnUnet and 9.23 for our method (lower is better). 3) Statistical analysis: We use t-test to compare our model’s average Dice across five datasets with current SOTA nnFormer and obtain a p-value of 7.64e-10 which rejects the null hypothesis and indicates the significant improvement of our model over nnFormer. 4) More details: a) N_I and N_P are 384. b) As we have marked in Fig.1, except for the SAM image encoder (accounts for 79% of the total parameters), the rest parts are tunable. GPU memory is 26G with batch size 8. c) Training:validation:test is 6:2:2 following SAM-Med3D.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    While the technical novelty and performance of the proposed model were commended, the paper faced criticism for methodological clarity, experimental design, and reproducibility details. The rebuttal made by the authors seemed to have addressed the concerns to shift some initial rejections to acceptances.

    I vote for acceptance considering the novelty of the proposed method to align the positional information of the prompt with the semantic information in the input image and the rebuttal quality.

    Please include the clarified details in the final manuscript as much as you can.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    While the technical novelty and performance of the proposed model were commended, the paper faced criticism for methodological clarity, experimental design, and reproducibility details. The rebuttal made by the authors seemed to have addressed the concerns to shift some initial rejections to acceptances.

    I vote for acceptance considering the novelty of the proposed method to align the positional information of the prompt with the semantic information in the input image and the rebuttal quality.

    Please include the clarified details in the final manuscript as much as you can.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    After reviewing the rebuttal, most of the unclear aspects and concerns have been adequately addressed. Given the interesting ideas presented and the consistently positive reviews, I recommend accepting this paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    After reviewing the rebuttal, most of the unclear aspects and concerns have been adequately addressed. Given the interesting ideas presented and the consistently positive reviews, I recommend accepting this paper.



back to top