Abstract

Polyp segmentation in colonoscopy images is essential for preventing Colorectal cancer (CRC). Existing polyp segmentation models often struggle with costly pixel-wise annotations. Conversely, datasets can be annotated quickly and affordably using weak labels like points. However, utilizing sparse annotations for model training remains challenging due to the limited information. In this study, we propose a TextPolyp approach to address this issue by leveraging only point annotations and text cues for effective weakly-supervised polyp segmentation. Specifically, we utilize the Grounding DINO algorithm and Segment Anything Model (SAM) to generate initial pseudo-labels, which are then refined with point annotations. Subsequently, we employ a SAM-based mutual learning strategy to effectively enhance segmentation results from SAM. Additionally, we propose a Discrepancy-aware Weight Scheme (DWS) to adaptively reduce the impact of unreliable predictions from SAM. Our TextPolyp model is versatile and can seamlessly integrate with various backbones and segmentation methods. More importantly, the proposed strategies are used exclusively during training, incurring no additional computational cost during inference. Extensive experiments confirm the effectiveness of our TextPolyp approach.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1293_paper.pdf

SharedIt Link: https://rdcu.be/dV59k

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72120-5_66

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1293_supp.pdf

Link to the Code Repository

https://github.com/taozh2017/TextPolyp

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Zha_TextPolyp_MICCAI2024,
        author = { Zhao, Yiming and Zhou, Yi and Zhang, Yizhe and Wu, Ye and Zhou, Tao},
        title = { { TextPolyp: Point-supervised Polyp Segmentation with Text Cues } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15011},
        month = {October},
        page = {711 -- 722}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper achieves point-supervised polyp segmentation via text cues. The proposed TextPolyp utilizes large models to generate pseudo-labels, and then employs a mutual learning strategy to boost segmentation capabilities. In addition, a Discrepancy-aware Weight Scheme (DWS) is introduced to adjust weights. TextPolyp obtains satisfactory performance on several datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. It is a good idea to utilize large models for generating initial pseudo-labels.
    2. The proposed method seems to have enough generalization ability.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. There are numerous grammatical errors in this paper, such as the caption of Fig.4 and Page 4, Line 1. This weakens the rigorousness of the paper.
    2. I find that the correspondence between the illustrations and the text descriptions is poor, and some details of the proposed method are missing. As a result, it is hard to understand the proposed method comprehensively.
    3. The baseline and compared methods shown in Table 1 and Table 2 are outdated, the authors are suggested to introduce more recent approaches.
    4. More ablation studies are needed to achieve in-depth analyses of the proposed method. Such as the variation of the hyper-parameter \alpha.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The presentation and experiment are not sufficient enough, and I believe this paper needs to be further polished.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The presentation and experiment are not sufficient enough.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    Some concerns still remain, so I keep the original rating.



Review #2

  • Please describe the contribution of the paper

    The paper presents an approach for polyp segmentation that utilizes a weakly supervised methodology combining SAM (Segment Anything Model) and Grounded DINO with a base segmentation network. The proposed method leverages foreground/background point supervision alongside descriptive text cues about polyps. Training is performed using pseudo labels, with the aim to reduce the reliance on fully supervised, pixel-level annotations.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper tackles the challenging issue of reducing dependency on detailed, pixel-level annotations for polyp segmentation.

    • The study demonstrates superiority over several state-of-the-art methods that use point and scribble supervision. These model are retrained using point supervision on polyp segmentation datasets, showcasing clear performance enhancements.

    • The paper reports results on multiple segmentation backbones, thus highlighting the impact of different base segmentation networks. This includes specific models tailored for polyp segmentation.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The paper lacks clarity on whether a standard text prompt describes all images or if prompts are image-specific. The impact of text cues on images without polyps is also not evaluated.

    • The interaction between SAM in pseudo label generation and mutual learning is not well explained. There is a confusion about the roles and integration of the base segmentation network and SAM, especially in their contribution to the final model outputs.

    • Certain choices in the model training process, such as only fine-tuning last layers of the encoder and not the decoder, or the specific use of gamma correction, lack sufficient justification.

    • The paper does not provide enough comparative analysis with both fully supervised and other weakly supervised polyp segmentation models, which limits understanding of the proposed method’s relative performance.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    See weaknesses

    minor page 4: “refined boxes boxes”, remove the second “boxes” page 4: “the with and height” -> “the width and height” page 5: “prompt oo ensure” -> “prompt to ensure”

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The rating of “Weak Accept” is given due to the paper’s approach to a significant problem and its potential impact on reducing the need for labor-intensive annotations in poly segmentation, benefiting the early identification and eradication of polyps in colorectal cancer patients. However, the weaknesses, particularly regarding the clarity and justification of the methodology, prevent a stronger recommendation. Enhancing the exposition of the methods and providing more thorough comparative analyses could potentially elevate the value of this work within the MICCAI community.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Authors addressed my questions. I keep my original rating.



Review #3

  • Please describe the contribution of the paper

    The paper proposed Point-supervised Polyp Segmentation with Text Cues by utilizing Grounding DINO and SAM with semi-supervised techniques. They generate initial mask using foundation model that output segmentation mask from text prompt. Then, the consistency learning improve segmentation performance by the generated mask as clue. In addition, discrepancy-based weighting help training.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • In addition to use SAM, Grounding DINO helps to produce better initial pseudo-mask. Combining two prompt guided foundation models is interesting.
    • Experimental result indicates SOTA performance on comparisons that use point-level supervision.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Does the performance of Grounding DINO is accurate? Since Grounding DINO is developed for general image, the performance of the Grounding DINO is rely on point supervision. The reviewer interested in the discussion regarding the performance of DINO on polyp dataset, and the relation of quality of point annotation and the performance of predicted bounding box.
    • How to obtain point annotation? Did the author add a new point annotation? Since there seems no detailed description about point annotation, I could not understand process of the bouding box elimination.
    • The performance of proposed method may further improved by using SAM-Med2. Will the proposed pseudo-label generation still work effectively when using such improved SAM? The reviewer interested in how this framework works outside of the proposed setting.
    • The fairness of comparisons. The author retraining Weakly-supervised methods trained with scribble-label by using point-level annotation. Since these methods are designed for scribble-label, the performance may reduce. The reviewer recommend the author should also discuss the performance when learning with scribble-label.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The framework is complex, and it seems to hard to reproduce. Therefore, the reviewer recommend to upload source code for reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Grounding DINO is a method for openset domains, but the question remains as to how well it will work in the medical field. I think it will be easier to understand the contribution of the proposed method if the author include examples of grounding DINO’s performance and example of main trends.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Combining text-driven foundation model and SAM is interesting and the method improve the performance compared with previous weakly supervised method.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I was convinced the authors comments.




Author Feedback

Thanks for reviewers & ACs. Code & point-annotated data will be released.

To R1: [Text & impact of image w/o polyps] To simplify, a generic text is used to represent all polyp images. Our focus is on polyp segmentation rather than classification tasks, thus all used images contain polyps. In future work, if images without polyps are included, we will offer appropriate text cues for evaluation purposes.

[SAM-based mutual learning] We utilize a segmentation model to produce the base predicted maps, which serve prompts for SAM to generate SAM-based maps. SAM-based maps provide supervision for the segmentation model. In this framework, the base segmentation model and SAM establish a mutual learning process, with SAM offering supplementary guidance beyond points. Also, the final results are derived from the base segmentation model.

[Certain choice] a) The setting to fine-tune the last two layers of the encoder is based on a comprehensive consideration of model’s learning capacity and stability. We have tested via fine-tuning different layers, while fine-tuning the last two layers of the encoder obtains better results. Also, the SAM decoder is relatively simpler compared to the encoder, thus we directly fine-tune the whole decoder. b) Gamma correction is not mandatory, alternative image transformation methods can be used. When employing methods like Logarithmic Transformation, we still achieve promising performance.

[Comparative analysis] Due to space limit, we are unable to provide results from fully-supervised methods (refer to [3,6,30]). We aim to propose a plug-and-play point-supervised polyp segmentation framework that has been validated for compatibility with various backbones and models.

To R3: [Errors] We will revise them in the final version.

[Illustrations] In Fig. 2, the caption is as follows: an image is initially input to the base segmentation model to generate S_base, which serves as prompts for SAM along with the original and gamma-corrected images as inputs, resulting in SAM-based maps (S_ori and S_gra). As a result, the base model and SAM form a collaborative learning framework, with SAM offering supplementary supervision signals beyond points. More details will be included in the illustrations in the final version.

[Compared models] In fact, we have included some recent methods (SCOD 2023 & PSOD 2022). Note that our TextPolyp is a plug-and-play module, and its effectiveness in seamlessly applying to various commonly used backbones and classic polyp segmentation methods has been validated.

[Effect of \alpha] Follow [10] and tune \alpha with various values, we set it to 0.85 for optimal results.

To R4: [Grounding DINO] We use Grounding DINO to produce the coarse bounding boxes, then we utilize point annotations to refine these boxes, reducing the effect of inaccurate boxes. Due to page limit, we will study its effect in future work.

[Point annotation & box elimination] Point annotation is based on the approximate center position of each polyp. The labeled dataset will be publicly. We utilize the foreground point p_a within the point label to identify the box containing polyps and exclude misidentified boxes, and use p_b to eliminate extraneous boxes.

[SAM-Med2] We focus on validating the effectiveness of our TextPolyp, rather than delving into the effects of different SAMs. In future work, we will study the effect of integrating SAM-Med2 into our method.

[Fairness] There are few available methods with point supervision, especially in the medical field. Thus, we expanded the scope of comparison to include some weakly-supervised methods originally designed for scribble annotations. Point annotations provide less supervision compared to scribbles, leading to a decrease in performance. Overall, we use the same point annotations for all compared methods in a relatively fair manner. In future work, we will also extend our model to scribble-label manner and compare it with scribble-based methods.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The decision was split: two Weak Accept (WA), and one Weak Reject (WR). All reviewers highlighted the interesting idea of combining two prompt guided foundation models as strengths. The reviewers express concerns about the clarity of the paper. The rebuttal addressed many concerns while Reviewer 3 feel some concerns still remain. The meta-reviewer recommends accepting this paper if there is some space. In the final version, the authors should clarify the points requested by the reviewers.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The decision was split: two Weak Accept (WA), and one Weak Reject (WR). All reviewers highlighted the interesting idea of combining two prompt guided foundation models as strengths. The reviewers express concerns about the clarity of the paper. The rebuttal addressed many concerns while Reviewer 3 feel some concerns still remain. The meta-reviewer recommends accepting this paper if there is some space. In the final version, the authors should clarify the points requested by the reviewers.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    It is a study on tumor segmentation using SAM. While it has shown improved performance compared to existing point annotation-based methods, the comparison techniques are somewhat outdated, and there is no comparison with supervised learning methods, making its actual effectiveness uncertain. Additionally, the explanation of the technique needs to be further elaborated.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    It is a study on tumor segmentation using SAM. While it has shown improved performance compared to existing point annotation-based methods, the comparison techniques are somewhat outdated, and there is no comparison with supervised learning methods, making its actual effectiveness uncertain. Additionally, the explanation of the technique needs to be further elaborated.



back to top