Abstract

Recently, deep learning-based methods have demonstrated effectiveness in the diagnosing of polyps, which holds clinical significance in the prevention of colorectal cancer. These methods can be broadly categorized into two tasks: Polyp Segmentation (PS) and Polyp Detection (PD). The advantage of PS lies in precise localization, but it is constrained by the contrast of the polyp area. On the other hand, PD provides the advantages of global perspective but is susceptible to issues such as false positives or missed detections. Despite substantial progress in both tasks, there has been limited exploration of integrating these two tasks. To address this problem, we introduce QueryNet, a unified framework for accurate polyp segmentation and detection. Specially, our QueryNet is constructed on top of Mask2Former, a query-based segmentation model. It conceptualizes object queries as cluster centers and constructs a detection branch to handle both tasks. Extensive quantitative and qualitative experiments on five public benchmarks verify that this unified framework effectively mitigates the task-specific limitations, thereby enhancing the overall performance. Furthermore, QueryNet achieves comparable performance against state-of-the-art PS and PD methods.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1037_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/JiaxingChai/Query_Net

Link to the Dataset(s)

https://github.com/DengPingFan/PraNet

BibTex

@InProceedings{Cha_QueryNet_MICCAI2024,
        author = { Chai, Jiaxing and Luo, Zhiming and Gao, Jianzhe and Dai, Licun and Lai, Yingxin and Li, Shaozi},
        title = { { QueryNet: A Unified Framework for Accurate Polyp Segmentation and Detection } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15008},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper focused on automated polyp segmentation and detection. The authors proposed a new CNN model called QuerryNet.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
      • The early segmentation and detection of the polyp is an interesting idea for multiple medical fields.
    • The paper introduced Mask-refinement Transformer Decoder to improve the quality of feature maps.
    • The proposed method was trained and tested his method with different dataset. -The results shows a better results compared with conventional methods.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The authors didn’t describe the reason of using multiple loss function for segmentation and detection. -The obtained results in Fig.3 are unclear.
    • The authors didn’t describe how F2, F3 and F4 fits the size of different parameters of feature maps with the mask-refinement transformer Decoder.
    • The use of F4 and squeeze and excitation in the mask-refined transformer decoder are unclear.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The idea of different stage of the model (F1, F2, F3 and F4) are unclear. please clarify the reason of using low and high level features in the mask-refinement transformer decoder
    • The use of squeeze and excitation is unclear. if the model didn’t use SE, how the results will be affected. -Please clarify the needs of multiple loss funcions.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper misses a lot of descriptions of the proposed method to trust the results. In addition, The code will not be provided.

  • Reviewer confidence

    Not confident (1)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    The authors answered my questions. However, the lack of novelty and missing of additional experimental results didn’t change my rate.



Review #2

  • Please describe the contribution of the paper

    The manuscript introduces a unified model that seamlessly integrates polyp segmentation with detection. Drawing inspiration from the Mask2Former architecture, the authors have cleverly incorporated an additional polyp detection branch. To enhance the accuracy of the attention mask within the Mask2Former framework, the MrTD module has been developed. This innovative model has demonstrated state-of-the-art performance in its field.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The manuscript is well-structured and accessible, with a clear narrative that guides the reader through the research objectives and findings. The experimental section is robust, providing ample evidence to support the claims made by the authors. The quantitative analysis presented is particularly compelling, as it clearly illustrates the superior performance of the proposed method over other competitors, both in seen and unseen datasets. Furthermore, the ablation study adeptly identifies the significance of each component within the proposed network, shedding light on their individual contributions to the overall model performance.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper presents a modification to the existing Mask2Former model by incorporating a small detection layer and an attention map. While the performance improvements are commendable, the contribution of the proposed method itself, as distinct from the foundational Mask2Former architecture, is not fully elucidated. Moreover, the clarity of the experimental settings is lacking. Specifically, the MrTD module is mentioned to be repeated L times, yet there is neither an explanation of the chosen value for L nor an analysis of the effects of varying this repetition on the model’s performance.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The substantial performance gains reported in the paper are impressive. However, it is not clear whether these improvements are primarily due to the strength of the backbone architecture (Mask2Former) or the novel modifications introduced by the authors. To address this, a comparative analysis with the baseline Mask2Former is necessary to isolate the contributions of the proposed enhancements. Additionally, the clarity and presentation of the figures need improvement. For instance, in the depiction of the MrTD module, the thin orange block should be clearly labeled, just as the larger block is, to avoid confusion and ensure that readers can easily understand the components being represented. In the qualitative results figure, it would be beneficial to annotate the case characteristics, such as complex texture or tiny size, to provide context and highlight the challenges addressed by the method.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The substantial performance gains reported in the paper are impressive. However, it is not clear whether these improvements are primarily due to the strength of the backbone architecture (Mask2Former) or the novel modifications introduced by the authors. To address this, a comparative analysis with the baseline Mask2Former is necessary to isolate the contributions of the proposed enhancements. Additionally, the clarity and presentation of the figures need improvement. For instance, in the depiction of the MrTD module, the thin orange block should be clearly labeled, just as the larger block is, to avoid confusion and ensure that readers can easily understand the components being represented. In the qualitative results figure, it would be beneficial to annotate the case characteristics, such as complex texture or tiny size, to provide context and highlight the challenges addressed by the method.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please see the weakness part

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    Based on Mask2Former, the authors proposed QueryNet which couples both polyp segmentation and detection tasks. The authors utilized instance-related data within object queries to design the detection branch, enabling segmentation to leverage intricate contextual relationships. Furthermore, they enhanced segmentation’s support for detection by introducing the Mask-refinement Transformer Decoder, which refines feature representation derived from the segmentation-focused transformer decoder. Experiments show that the proposed method achieves state-of-the-art performance on five public datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The analysis of the current method is outstanding and the motivation is solid.
    2. The writing is good and the performance is promising.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The beginning of Sec. 2.2 is unclear especially when it comes to the description of the correlation between the original transformer decoder and Fig. 2. Also, a brief explanation of the object query is necessary.
    2. The experiment scale is too small to be impactful. The authors should conduct the experiments on larger datasets, such as [1*], to prove its robustness and generalization ability.
    3. Sec. 3.4 is too long and has too much repeated information which can be told in Tab. 3. [1*] Video Polyp Segmentation: A Deep Learning Perspective
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Since the authors mention the proposed MrTD could solve the discontinuities of the receptive field, the authors should present the corresponding experiments or visualization to prove its efficacy.
    2. It would be better to use different symbols for two Qs (before/after MrTD ) in Fig. 2.
    3. Could the authors explain why the performance is less satisfactory on CVC-ClinicDB and Kvasir-SEG compared to DETR?
    4. The reviewers would like to know why the model performs worse (Re.) on row 3, Tab. 3.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The overall pipeline is novel and the proposed module is effective. However, the experiments can be refined as mentioned in the above weakness.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    Submission 1037 builds upon Mask2Former and proposes an unified method to segment and detect polyps, claiming to thereby mitigate some drawbacks of separate approaches. The claim is supported by the experiments, the results demonstrate state-of-the-art performance in comparison to recent benchmarks across five datasets. Beyond this a Mask-refinement Transformer Decoder is presented to leverage segmentation features for detection.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • state-of-the-art performance in comparison to a current selection of application benchmarks and on five datasets
    • good selection of benchmarks and five datasets employed
    • the concept of leveraging segmentation for detection is interesting
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The main research motivation, that is that individual detection and segmentation have drawbacks, is merely shown anecdotally, more evidence is required for this point.
    • While the selected benchmarks are suitable for the application, DETR is only employed for detection, although this method could also be leveraged for segmentation. DETR remains the second best (sometimes best) detection method, I wonder how it would perform on segmentation.
    • The performance improvements are incremental, what is their relevance for the application?
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • More discussion on the potential and in particular of the limitation of this unified approach and why it works would be interesting.
    • The Figure captions could contain a bit more information, they are fairly short.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Leveraging segmentation features for segmentation is interesting, however, it is not clear how the incremental performance improvements are relevant to the application. While the benchmarks are generally well chosen, DETR can be employed for both detection and segmentation, however, it was only employed for detection in this study and using it for both would undermine the novelty of the unified approach. The drawbacks segmentation and detection individually, are shown anecdotally, however, the claims around this could use more evidence to motivate the study.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    My concerns have been sufficiently addressed.




Author Feedback

We thank area chairs and reviewers for their valuable time and constructive comments. We address key concerns as follows:

Common concern: Reproducibility and additional experiments: We will release github link in final paper. And more results with new methods and results on new datasets will also be posted on github.

To Reviewer#1: 1.Experimental settings: For fairness, we haven’t made any parameter changes compared to baseline. L means Q learns L times from features, if L=6, Q will interact with F1,F2,F3,F1,F2,F3 in turn. We apologize for lacking parameter analysis due to paper length limits. More results and analysis will be released on github. 2.Baseline analysis: In our ablation experiments, No.1 shows the Mask2Former’s performance. The baseline is powerful, but other experiments proved the effectiveness of our novel modification.

To Reviewer#3: 1.Multiple loss function: Dice,bce,giou,L1 losses are commonly used in previous PS and PD works. Dice loss produce smooth boundaries, bce loss improve binary classification accuracy, giou loss emphasizes shape matching, L1 loss is robust to outliers and box estimation. These are all needed in PS and PD, so we utilize a multiple loss func. 2.Low and high level features: Low-level features capture fine details, while high-level features grasp broader context. By continuously looping between low and high level features, queries could learn a more balanced understanding, aiding in accurate and contextually relevant generation. 3.SE layer: G aims at fusing multi-scale receptive fields to generate the refine mask. However, successive interpolations in backbone may shift some key feature corners, leading to a poor fusion. Hence, we utilize SE to align features channelly and spatially. Therefore, if model doesnt use SE, result will slightly decrease. 4.Details of CA: In CA, q=Q(bnq), k=v=F_i(bqhw). sim=qk (bnq,bqhw->bnhw), att=simv^T (bnhw,bqhw->bnq). So different feature sizes don’t affect the process and parameters.

To Reviewer#4: 1.Transformer decoder: It consists of L basic operations, each consists of CA, SA and FFN. Q and feature F_i interact in CA. Attention masks M is also used to accelerate the convergence in CA. M is obtained by multiplying Q and F4. However, this would cause discontinuities in M during convergence. That’s why we proposed G to generate the refinement masks. 2.Query: We can treat queries as instance samples. Each contains multiple infomation for one instance. This is why we can apply seg and det supervision to it simultaneously. 3.Quantitative analysis: Each query represents a single polyp, so all labels of bboxes are 1. The query number, i.e. bboxes number, is fixed, means TP+FN is consist. So when our TP is low, our recall will be significantly lower; It’s interesting in Tab.3-row3. As explained before, TP has a significant impact on recall. Seg may cause the model to focus on particular texture features, which may cause bboxes to shift, resulting in lower TP rate.

To Reviewer#5: 1.Potential and limitation: Unified framework is currently mainstream, our work proves the unified framework is also feasible in polyp field. Furthermore, we will attempt to use bboxes to identify potential lesions not just polyps, which is more clinically meaningful; As shown in paper, our model failed on small, concealed polyps. This is also an urgent problem in polyp field. We would conduct further research on this problem. 2.Fairness: We implemented MMDetection to reproduce det methods. It’s a pitty that there is no DETR’s seg version in MMDet. We will reproduce segDETR’s codes and release the entire results on github.

The above considerations will be addressed in the manuscript. In addition, we would add as many details as possible to the experiments given the space constraints. We thank the reviewers for the constructive feedback and the opportunity to clarify some of the misunderstandings and improve our paper.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    accepts

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    accepts



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top