Abstract

Reliable polyp segmentation in colonoscopy videos is crucial for early detection and prevention of colorectal cancer. While deep learning-based segmentation models show promise, their performance can be inconsistent, and robust methods for assessing segmentation quality without ground-truth annotations are lacking. This paper presents a novel quality control framework for polyp segmentation that leverages the temporal consistency inherent in colonoscopy videos. Our framework utilizes the Segment Anything Model 2 (SAM2), a powerful video segmentation foundation model, to propagate segmentation predictions between adjacent frames. By evaluating the consistency between these propagated segmentations and the original model predictions, we obtain an unsupervised Segmentation Quality Assessment (SQA) score for each frame. Furthermore, we introduce a re-segmentation module that refines low-quality segmentations by leveraging information from high-quality frames, identified based on their SQA scores. Experiments on the SUN-SEG and PolypGen datasets demonstrate a moderate to strong correlation between the SQA scores produced by our framework and the ground-truth segmentation quality. The re-segmentation module also improves overall segmentation performance without requiring model retraining or fine-tuning. This work suggests a step towards building more reliable and trustworthy AI-assisted colonoscopy systems. \textcolor{black}{The code is available at https://github.com/LYJ-NJUST/Seg-Quality-Control.}

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2132_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/LYJ-NJUST/Seg-Quality-Control

Link to the Dataset(s)

N/A

BibTex

@InProceedings{LiYuj_Unsupervised_MICCAI2025,
        author = { Li, Yujia and Zhou, Tao and Wang, Ruixuan and Wang, Shuo and Zhang, Yizhe},
        title = { { Unsupervised Quality Control and Enhancement of Polyp Segmentation in Colonoscopy Videos using Spatiotemporal Consistency } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15969},
        month = {September},
        page = {605 -- 615}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This work proposes a method to extract a segmentation quality assessment (SQA) score for polyps in colonoscopy. This score would provide a metric to establish confidence in the accuracy of segmentation models in clinical practice and is used in the proposed work to identify and refine low-quality segmentations.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Proposed SQA score is well validated through thoughtful experiments
    • Presents downstream application of further segmentation refinement, highlighting the usefulness of SQA in the clinical workflow
    • Unsupervised assessment, so no need for ground-truth annotations
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Unclear clinical relevance to polyp detection
    • Limited/vague discussion of limitations of prior work
    • Insufficient evidence for generalizability, other than the use of a general-purpose foundation model (SAM2)
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The clinical motivation is centered around the high polyp miss rate, but it’s unclear how the proposed segmentation metric (SQA) addresses this issue. Considering the algorithm relies on a specialized polyp segmentation model (HSNet) for further prompting, it seems that detection is dependent on this model rather than the proposed work. A clearer explanation of how segmentation quality correlates with improved detection or decision-making is necessary to understand the clinical relevance of this work.

    Despite the unclear value in the clinical setting, the presented results do effectively demonstrate that the proposed SQA score correlates with the accuracy of segmentation outputs (Table 1 and Figure 2). Considering that the SQA score guides the re-segmentation process, the presented improvements with this supports that this score helps identify and refine low-quality predictions. This highlights the technical soundness of the approach, even if its clinical implications require further clarification. While this does not translate directly to improved detection, identifying and refining low-confidence predictions still has value in translating such models to the clinical setting.

    While the proposed SQA score appears to be a valid representation of prediction quality, the broader claim of generalizability by re-segmenting with a general purpose model (SAM2) is also not sufficiently supported. There is only vague discussion on the limitations/failure cases of related work (“unreliable segmentations”) without specifying what errors cause unreliable results. As a result, if the intended clinical contribution is reframed as improving segmentation quality rather than detection, it remains unclear howthe proposed approach meaningfully advances beyond existing methods. I believe the evidence provided towards this is improved DICE scores after re-segmentation, which reflect overlap accuracy but does not directly demonstrate improved reliability or applicability across diverse datasets. This makes it difficult to fully assess the value of the proposed SQA metric.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    I appreciate the clarifications provided in the rebuttal. However, my primary concerns on the clinical relevance of quantifying polyp segmentation quality remain mostly unaddressed. There is limited quantitative evidence demonstrating how the segmentation quality score and re-segmentation address the key problem of missed polyps during detection (false negatives). The only quantitative results related to re-segmentation are reported using DICE scores, which reflect pixel-wise overlap but do not directly indicate whether detection performance improves. Considering additional experiments/results are prohibited, I do not find sufficient grounds to change my original recommendation.



Review #2

  • Please describe the contribution of the paper
    • Segmentation without additional fine-tuning: The paper adopts two models, one a specialized polyp segmentaion model (loaded with pre-trained polyp segmentation weights) and a general purpose SAM model - without requiring further fine-tuning.
    • Video segmentation quality assessment without relying on ground truth annotations: The method performs segmentation quality assessment across all video frames without the need for ground truth annotations.
    • Refinement using dice score-based feedback: A refinement module enhances segmentation quality by identifying high-quality frames based on dice scores and using them to guide further segmentation improvements.
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Integrates polyp segmentation model and a foundation model (Novel approach to segment complete video): The paper combines a specialised polyp segmentation model with a general-purpose foundation model to segment entire videos effectively.
    • Segmentation quality assessment and refinement: Two modules are introduced to evaluate segmentation quality and refine the results using Dice scores and a subset of high-quality frames.
    • Does not require additional annotations: The proposed quality assessment and refinement process does not rely on any additional annotations.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Some parts of the paper are not clear and needs more detailed description:

    • Temporal window processing ambiguity: It appears that individual frames may be used multiple times across different temporal windows. However, the explanation around how m(j→i) is computed remains unclear. Including a diagram or a more detailed description would greatly enhance clarity.
    • Dependency on the initial segmentation model: If the quality assessment and Dice score computation rely on yi which is produced by the initial polyp segmentation model in Step 1, then the final performance becomes inherently dependent on this initial model. The effectiveness of the proposed method under this constraint should be justified.
    • Issues with Table 1: The current caption of Table 1 does not align well with the content presented. It should be revised to better reflect the table’s purpose. Additionally, using full names for the SQA methods is recommended to avoid confusion with similarly named concepts. Highlighting the best-performing scores would also improve readability.
    • Minor: In section 3.2, “we examin” should be “we examine”.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a novel approach by combining a specialized polyp segmentation model with a general-purpose foundation model for video segmentation without requiring additional annotations. The integration of segmentation quality assessment and refinement is interesting and practical. However, the manuscript lacks clarity in some methodological details, particularly temporal window processing and the dependency on the initial segmentation model for quality assessment.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    While the temporal window processing details require clearer explanation, particularly regarding the reuse of individual frames across multiple temporal windows, the authors have acknowledged this and expressed willingness to revise. Despite this remaining ambiguity, the core idea is solid and innovative, and the results are convincing. Therefore, I recommend acceptance.



Review #3

  • Please describe the contribution of the paper

    The paper introduces an unsupervised framework for assessing and enhancing polyp segmentation quality in colonoscopy videos without the need for groundtruth annotations. It leverages temporal consistency by using SAM2 to propagate segmentation masks bidirectionally between frames, thereby computing a novel Segmentation Quality Assessment (SQA) score that reflects per-frame reliability. Frames with low SQA scores are then refined through a lightweight Re-Segmentation module that propagates high-quality segmentations from adjacent frames. This model-agnostic approach requires no retraining and is demonstrated to improve segmentation performance across multiple datasets and baseline models.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Unsupervised Quality Assessment Without Ground Truth: The paper proposes a novel Segmentation Quality Assessment (SQA) mechanism that leverages temporal consistency to gauge segmentation reliability without relying on manual annotations. This is particularly valuable for medical video data, where extensive, frame-by-frame annotation is impractical and requires considerable clinical expertise.
    • Innovative Use of Foundation Models (SAM2): Rather than using SAM2 for direct segmentation, the authors employ it as a temporal mask propagation engine. This creative integration allows the framework to assess and refine the outputs of specialized segmentation models, exemplifying a synergistic approach between a general-purpose foundation model and task-specific models like HSNet.
    • Re-Segmentation via Temporal Refinement: The framework includes a lightweight Re-Segmentation module that improves low-quality segmentations by propagating reliable masks from adjacent high-quality frames. This temporal smoothing strategy is simple, effective, and requires no additional training data or manual labels.
    • Comprehensive Evaluation: The approach is evaluated on two diverse and challenging video datasets (PolypGen and SUN-SEG), across multiple baseline models. It shows consistent improvements in segmentation quality and strong correlation between the SQA (Segmentation Quality Assessment) score and actual segmentation performance.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) Major Concerns:

    • Dependence on High-Quality Neighbors for Re-Segmentation: The efficacy of the Re-Segmentation module is heavily reliant on the availability of high-quality neighboring frames. In scenarios where a polyp appears only briefly or is sparsely represented in a video sequence, the framework may struggle to recover accurate masks if suitable reference frames are lacking. This limitation is not adequately discussed or quantified. 2) Moderate Concerns:
    • Lack of Evaluation in Negative Frames (non-polyp): While the method is evaluated on full-length colonoscopy videos, the paper does not explicitly assess performance in frames without visible polyps. Since robust polyp segmentation systems should also reliably avoid false positives in normal mucosa, a detailed analysis of the method’s behavior in such cases would enhance its clinical relevance.
    • Potential Semantic Drift During Propagation: The use of SAM2 for mask propagation could be susceptible to semantic drift, especially in scenarios with rapid changes in illumination or texture typical of colonoscopy videos. Without an explicit mechanism to suppress spurious predictions, there is a risk of introducing false positives or over-segmentation when propagating high-quality masks to frames with significantly different content. 3) Minor Concerns:
    • Core Components: Although the framework is well-structured, many of its core components rely on existing techniques. For example, temporal mask propagation using mechanisms similar to those in SAM2 has been explored in prior work. A more in-depth discussion of how this formulation meaningfully differs from earlier efforts would strengthen the paper.
    • Reference Renumbering: The references are not arranged sequentially according to their order of appearance (for example, the introduction starts with “[16]” instead of “[1]”). They need to be renumbered to follow proper order (see LNCS style).
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposes a novel and practical unsupervised framework for assessing and improving polyp segmentation in colonoscopy videos. Its key strength lies in the use of temporal consistency (via SQA) and SAM2-based propagation to refine predictions without requiring ground truth labels or retraining. The method is model-agnostic, scalable, and shows consistent performance gains across multiple datasets and segmentation models. However, the paper lacks clarity on how non-polyp frames are handled, which is important in clinical settings, and does not commit to releasing code or pretrained models, limiting reproducibility. Addressing these points in the rebuttal would significantly strengthen the paper’s contribution.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors provide a well-structured and thorough rebuttal that addresses the reviewers’ major concerns. They offer clarifications, outline future directions, and express a strong commitment to improving the final version of the manuscript. Their commitment to open sourcing the code further enhances the reproducibility and potential impact of this work. Based on the quality of the rebuttal and the method’s relevance for post-procedural analysis and quality control, I now recommend acceptance. Importantly, the authors mention updates to Tables 1 and 2 to reflect false positive and false negative corrections. However, they do not clarify whether additional quantitative metrics (e.g., F1-score, precision, recall) will be included. It would also be important to elaborate more clearly on how their proposed approach compares to previous methods. Including these elements in the camera-ready version would be a valuable improvement.




Author Feedback

We sincerely thank all reviewers for their constructive feedback and recognition of our contributions. Below, we provide detailed responses to the reviewers’ comments and concerns.

  1. Methodological Clarity, Dependencies, and Limitations (R1, R2): We will revise the paper with clearer explanations regarding temporal window descriptions. Additionally, we will address the noted issues in Table 1, correct typographical errors, and ensure proper reference numbering. Our method, functioning as a post-processing quality assessment and enhancement tool, operates on segmentation outputs from initial segmentation models. We evaluated our method on multiple state-of-the-art polyp segmentation models—including HSNet, PolypPVT, and CFANet—and demonstrated consistent performance improvements. Our approach leverages the spatiotemporal consistency inherently present in endoscopic video sequences, enabling effective segmentation quality assessment and refinement for video polyp segmentation. Although the method’s effectiveness may diminish if the initial segmentation model performs very poorly, all currently tested models demonstrate reasonably good performance empirically, making this limitation negligible in practice. To mitigate potential semantic drift, we constrain propagation within a fixed temporal window of appropriate size, set to 10 in all experiments. Adapting the temporal window dynamically based on the characteristics of individual videos and scenarios is a promising direction for future work.

  2. Clinical Relevance, False Positive/Negative Handling (R2, R3): Our goal is to enhance the reliability and quality of segmentation outputs. Clinically, unreliable segmentations can negatively impact decision-making; thus, an automatic measure of segmentation quality is highly valuable. The proposed SQA score acts as an actionable alert system, highlighting potentially inaccurate segmentations for further clinical review.
    • False Positives (FPs): In practice, false-positive segmentations frequently occur due to transient artifacts, reflections, or other inconsistent imaging anomalies. Such inconsistent segmentations typically result in lower temporal consistency, which can be effectively captured by our SQA scores.
    • False Negatives (FNs): False negatives, where polyps are missed, pose significant clinical risks. If these missed polyps are successfully identified in nearby frames within the propagation window, our method is capable of recovering some of the missed segmentations during the re-segmentation step.
    • We will update Table 1 and Table 2 to report how many FPs and FNs are identified and fixed.
  3. Generalizability and Advancement over Prior Work (R3): Our proposed framework uniquely advances beyond previous image-level segmentation quality assessments by introducing an unsupervised, video-based segmentation quality assessment and refinement mechanism suitable for endoscopic videos. We have demonstrated robust generalizability across multiple segmentation models (HSNet, PolypPVT, CFANet) with using various foundation models (SAM2, SAMURAI), as presented in Tables 2 and 3. In revision, we will improve our discussion on how our explicit utilization of video-specific spatiotemporal consistency distinguishes our approach from prior methods for segmentation quality assessment and refinement.

  4. Real-time Applicability (Meta): This is an excellent point. The current bidirectional approach is designed for retrospective or near-offline use cases, such as post-procedure review, quality assurance, and training purposes, as it utilizes future frames for enhanced accuracy. For real-time usage, our method could still offer value with further adjustments, such as using unidirectional propagation and/or smaller temporal window to achieve near-instantaneous feedback. In revision, we will clarify current limitations regarding real-time applicability and potential adaptations.

Code for all experiments will be made publicly available.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    Reviewers appreciated the original idea of combining a general-purpose large segmentation model (SAM) with a specific polyp segmentation model to detect and refine deffective segmentations. Although there is slightly more support for acceptance than for rejection, I do think that this paper would benefit from having a second discussion stage. Specifically, I agree with reviewers on the concern about the clinical significance of the proposed technique. The authors might want to focus their rebuttal on how does their method handle False Positives (what happens if there is no polyp?), and what is the clinical interest of this approach, given that the method needs to look bidirectionally to neighboring frames in order to proceed, but we would ideally want to have a real-time video segmentation tool with quality feedback, so no future information would be available, correct?

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper proposes a novel framework for post-procedural polyp segmentation quality assessment and re-segmentation, addressing an under-explored yet important problem in endoscopic image analysis. The method is well-motivated, technically sound, and supported by convincing experimental results on segmentation quality metrics. The authors have provided a clear and thorough rebuttal that addresses most reviewers’ concerns and demonstrates a strong commitment to improving the final version of the manuscript. Overall, the paper makes a valuable contribution to quality assessment in medical image segmentation, with potential broader impact in post-procedural analysis. I recommend acceptance.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top