Abstract

Accurate segmentation of polyps is crucial for efficient colorectal cancer detection during the colonoscopy screenings. State Space Models, exemplified by Mamba, have recently emerged as a promising approach, excelling in long-range interaction modeling with linear computational complexity. However, previous methods do not consider the cross-scale dependencies of different pixels and the consistency in feature representations and semantic embedding, which are crucial for polyp segmentation. Therefore, we introduce Polyp-Mamba, a novel unified framework aimed at overcoming the above limitations by integrating multi-scale feature learning with semantic structure analysis. Specifically, our framework includes a Scale-Aware Semantic module that enables the embedding of multi-scale features from the encoder to achieve semantic information modeling across both intra- and inter-scales, rather than the single-scale approach employed in prior studies. Furthermore, the Global Semantic Injection module is deployed to inject scale-aware semantics into the corresponding decoder features, aiming to fuse global and local information and enhance pyramid feature representation. Experimental results across five challenging datasets and six metrics demonstrate that our proposed method not only surpasses state-of-the-art methods but also sets a new benchmark in the field, underscoring the Polyp-Mamba framework’s exceptional proficiency in the polyp segmentation tasks.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0697_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0697_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

https://drive.usercontent.google.com/download?id=1pFxb9NbM8mj_rlSawTlcXG1OdVGAbRQC&export=download&authuser=0

BibTex

@InProceedings{Xu_PolypMamba_MICCAI2024,
        author = { Xu, Zhongxing and Tang, Feilong and Chen, Zhe and Zhou, Zheng and Wu, Weishan and Yang, Yuyao and Liang, Yu and Jiang, Jiyu and Cai, Xuyue and Su, Jionglong},
        title = { { Polyp-Mamba: Polyp Segmentation with Visual Mamba } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15008},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The author propose a Mamba based model for polyp segmentation, and achieve good experimental results.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Strengths: 1)The abstract and introduction are well written. 2) Introduced a Mamba model to polyp segmentation

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    To be honest, the author is very good at writing abstracts and introductions. I was really looking forward to being verified in the method part, but it didn’t. Major: 1) For the proposed Scale-Aware Semantics Module and Global Semantics Injection Module, I can’t understand why it can address the cross-scale dependencies? Can you explain what the cross-scale dependencies are? Besides, Why can these two modules capture semantic information, or have anything to do with semantics?

    2) As stated by author, the proposed SAS module consists of several VSS blocks, and the VSS blocks was derived from VMamba. So What is the innovation of SAS module?

    3) In the introduction section, the author was claimed that the GSI was designed to address the consistency in feature representations across scales. However, in the 2.4 Sec., the author claim that GSI is introduced to bridge the semantic gap before merging features by cross-attention mechanism. Is it possible to think that the author does not know what problem she/he is solving?

    Minor: (1)The flowchart (Fig.1) is just too confusing; it’s impossible to understand the details of the author’s innovative module. (2)In section 2.4, they said “there exists a significant semantic gap between features {T1, . . . , TN } and scale-aware semantics,” I don’t understand why there is the semantic gap. Additionally, having feature visualization diagrams here would make it more persuasive. (3)The readability of the Section 2.3 is poor. It is recommended to revise the text to emphasize the main structure of the module. (4) Because the author’s innovation primarily focuses on cross-scale feature fusion, the experimental section should showcase more improvements and visualizations of features in the hidden layers.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The reproducibility of the author’s paper is low due to a lack of understanding of the network’s finer details.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    More detailed presentation of the network details and experimental results regarding cross-scale feature fusion should be provided. Please refer to the “Weaknesses” section for specific improvement measures.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please see the weakness part

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper presents Polyp-Mamba, an innovative framework designed for precise polyp segmentation in colonoscopy images. The authors introduce two key models aimed at enhancing polyp segmentation: the Mamba model, which efficiently captures long-range dependencies while maintaining linear computational complexity, and the Scale-Aware Semantics (SAS) module, facilitating multi-scale feature learning and semantic information modeling across various scales. Additionally, the Global Semantic Injection (GSI) module is introduced to inject global semantics into local features, thereby integrating global and local information to enhance pyramid feature representation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The introduction of Mamba to the Polyp Segmentation problems both improve the accuracy and maintain a linear computational cost.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Limited discussion on computational complexity and inference time: While the paper mentions that the Mamba model and VSS blocks enable linear computational complexity, it does not provide detailed analysis or comparisons of the computational complexity and inference time of Polyp-Mamba with other state-of-the-art methods. Such an analysis could be valuable for assessing the practical applicability of the proposed method, especially in real-time or resource-constrained scenarios.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    How does the model handle variations in imaging conditions during colonoscopy, such as differences in image quality or blurriness?

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    provide detailed analysis or comparisons of the computational complexity and inference time of Polyp-Mamba with other state-of-the-art methods

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The author’s conclusion regarding the acceleration of speed lacks supporting results.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposes a framework named Polyp-Mamba for segmenting colonic polyps. The method integrates pixel-level context modeling and semantic relationship mining to address the challenges of polyp appearance variability, via the Scale-Aware Semantic (SAS) Module and the Global Semantic Injection (GSI) Module. Semantic information is discerned across scales and merged with local features to create a hierarchical representation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Mamba has been used in vision tasks demonstrating efficiency in time and memory complexity. Vision Mamba was first applied to polyp segmentation in ProMamba recently (https://arxiv.org/abs/2403.13660). The use of Vision Mamba is considered novel in polyp segmentation in this paper. Experimental results on five polyp segmentation datasets show improvement over previous methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The number of parameters (or model size) and run time are not stated. For example, does the method have a real-time segmentation efficiency (∼50fps) like in PraNet? In method comparison (Tables 1 and 2), different methods have different input sizes in their original papers. The authors use 224x224 for input. PraNet and PolypPVT, for example, use 352x352. Unless the authors reimplement those methods using the same input size, it seems inappropriate to report those metrics results done for different inputs.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The weighted IoU loss is not clearly defined. What is the weighting factor? Time cost or memory footprint was not reported.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    In eqn (1), it would be good to provide equations defining the binary cross-entropy loss and the weighted IoU loss to make it more clear. Furthermore, what is the rationale for choosing these two loss functions rather than other loss functions used in comparison methods? The performance not only depends on the architecture but also on the loss function and parameters for model training. What effects do they have on the ablation studies? For quantitative evaluation, all methods should have the same input size for a fair comparison. Some metrics reported in Tables 1 and 2 are for different input sizes. Since linear computational complexity is a highlight of Mamba, it would be good to state the run time and parameter size.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, the paper is well-structured and validated.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

  1. Parameters and Run Time (r1,r4): Polyp-Mamba, with 16.4M parameters, achieves approximately 90 FPS real-time segmentation efficiency, outperforming PraNet (32M parameters, 50 FPS) and PolypPVT (25.08M parameters, 48 FPS) on a single RTX 4090. Additionally, Polyp-Mamba has a lower computational complexity, with 7.99G FLOPs at an input size of 352×352, compared to PraNet’s 13.01G and PolypPVT’s 10.0G. In terms of training time cost, Polyp-Mamba takes 20 minutes, while PraNet requires 2 hours, and PolypPVT takes 45 minutes. These results indicate that Polyp-Mamba maintains high segmentation performance while being more efficient for real-time applications.

  2. Fair Comparison (r1): To ensure a fair comparison, the combined loss functions are consistent with those used in PolypPVT and PraNet. With an input size of 352×352, Polyp-Mamba achieved a Dice score of 0.941 on the Kvasir dataset, 0.946 on the ClinicDB dataset, 0.833 on the ColonDB dataset, 0.820 on the ETIS dataset, and 0.923 on the EndoScene dataset. These results demonstrate that Polyp-Mamba’s performance remains stable across different input sizes and consistently outperforms other models in all cases.

  3. Loss Functions (r1): The combined loss functions used in our experiments are consistent with those employed in PolypPVT and PraNet. The Binary Cross-Entropy Loss, widely used for binary segmentation tasks, effectively measures the pixel-wise error between predictions and ground truths. Additionally, the Weighted IoU Loss addresses class imbalance issues, ensuring that regions with fewer pixels, such as polyps, receive appropriate attention during training. This combination allows for accurate and balanced segmentation performance.

  4. Major (r3): (1) The semantic inconsistency between different layers is primarily due to differences in feature abstraction levels, receptive field sizes, information processing, and fusion methods. Cross-scale dependency refers to a model’s ability to process features at different scales. Addressing cross-scale dependency means that the model can effectively combine and utilize information from various scales to capture features of an object at different resolutions. (2) The SAS module processes input features through multiple VSS blocks (from VMamba), which are designed to handle a wide range of contextual information, thereby capturing semantic information at different scales. By exchanging information and integrating semantics across multiple scales, the SAS module enhances the model’s understanding of multi-scale semantic information. (3) The GSI module uses a cross-attention mechanism to inject global semantic information into local features, allowing the model to combine global and local features effectively. This process resolves semantic gaps before feature fusion, helping to maintain consistent feature representation across different scales.

  5. Minor (r3): Thank you for your suggestions. We will: (1) Simplify the module structure and add annotations and explanations; (2) Rewrite Section 2.3 to emphasize the main structure and provide a clearer description; (3) Add feature visualization diagrams to illustrate semantic gaps and their solutions; (4) Include visualizations and analyses of hidden layer features in the experimental section to demonstrate the improvements from cross-scale feature fusion.

  6. Methods for Handling Variations in Imaging Conditions During Colonoscopy (r4): The approach includes multi-scale feature fusion and data augmentation. Polyp-Mamba uses the SAS and GSI modules for multi-scale feature fusion, effectively handling variations in image quality or blurriness. The SAS module captures semantic information at different scales, enhancing the model’s robustness to features under various imaging conditions. Additionally, we use data augmentation techniques such as rotation, translation, and blurring during training to simulate different imaging conditions, improving the model’s adaptability.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The three reviewers gave detailed comments on the paper, such as the input size when comparing the models, and the efficiency of the operation; As well as the design stuff of the module, etc. The authors all replied to the reviewers’ comments in detail, but none of the three reviewers commented on the final rebuttal. The Area Chair carefully reviewed the authors’ rebuttal and concluded that the authors had addressed every reviewer’s concern, and that this work was the first time introduced Mamba to address polyp segmentation and achieve the new SOTA performance. The novelty is enough.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The three reviewers gave detailed comments on the paper, such as the input size when comparing the models, and the efficiency of the operation; As well as the design stuff of the module, etc. The authors all replied to the reviewers’ comments in detail, but none of the three reviewers commented on the final rebuttal. The Area Chair carefully reviewed the authors’ rebuttal and concluded that the authors had addressed every reviewer’s concern, and that this work was the first time introduced Mamba to address polyp segmentation and achieve the new SOTA performance. The novelty is enough.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The novelty of the method and evaluation results in the experiments are good enough.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The novelty of the method and evaluation results in the experiments are good enough.



back to top