Abstract

Automated polyp counting in colonoscopy is a crucial step toward automated procedure reporting and quality control, aiming to enhance the cost-effectiveness of colonoscopy screening. Counting polyps in a procedure involves detecting and tracking polyps, and then clustering tracklets that belong to the same polyp entity. Existing methods for polyp counting rely on self-supervised learning and primarily leverage visual appearance, neglecting temporal relationships in both tracklet feature learning and clustering stages. In this work, we introduce a paradigm shift by proposing a supervised contrastive loss that incorporates temporally-aware soft targets. Our approach captures intra-polyp variability while preserving inter-polyp discriminability, leading to more robust clustering. Additionally, we improve tracklet clustering by integrating a temporal adjacency constraint, reducing false positive re-associations between visually similar but temporally distant tracklets. We train and validate our method on publicly available datasets and evaluate its performance with a leave-one-out cross-validation strategy. Results demonstrate a 2.2x reduction in fragmentation rate compared to prior approaches. Our results highlight the importance of temporal awareness in polyp counting, establishing a new state-of-the-art. Code is available at https://github.com/lparolari/temporally-aware-polyp-counting.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0667_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/lparolari/temporally-aware-polyp-counting

Link to the Dataset(s)

REAL-colon: https://plus.figshare.com/articles/media/REAL-colon_dataset/22202866 SUN database: http://amed8k.sundatabase.org/ LDPolypVideo: https://github.com/dashishi/LDPolypVideo-Benchmark PolypsSet: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/FCBUOR

BibTex

@InProceedings{ParLuc_TemporallyAware_MICCAI2025,
        author = { Parolari, Luca and Cherubini, Andrea and Ballan, Lamberto and Biffi, Carlo},
        title = { { Temporally-Aware Supervised Contrastive Learning for Polyp Counting in Colonoscopy } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15969},
        month = {September},
        page = {542 -- 552}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors claim the following contributions in this paper:

    1. ​​Supervised Contrastive Learning Framework​​: Proposes a supervised contrastive loss framework leveraging polyp entity labels to replace existing self-supervised methods. This explicitly models intra-class relationships to enhance appearance invariance within the same polyp while preserving discriminability across distinct polyps.

    2. ​​Temporally-Aware Soft Target Mechanism​​: Introduces temporal weighting into the contrastive loss to enforce consistency in the embedding space for temporally adjacent but visually dissimilar tracklets, addressing appearance variations caused by motion blur, occlusions, and other colonoscopy-specific challenges.

    3. ​​Temporal-Constrained Tracklet Clustering​​: Integrates temporal adjacency constraints during clustering via a joint visual-temporal similarity matrix, suppressing false associations between visually similar but temporally disconnected tracklets.

    4. ​​Cross-Dataset Training and Robust Validation​​: Combines multi-source public datasets (384 polyps) to improve generalization and adopts leave-one-out cross-validation (LOOCV) to demonstrate significant reductions in fragmentation rate (FR), outperforming state-of-the-art methods in real-world colonoscopy videos.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    ​​Major Strengths of the Paper:​

    1. ​​Temporally-Aware Supervised Contrastive Learning​​ Unlike prior self-supervised approaches, the paper pioneers a ​​supervised contrastive learning framework​​ that incorporates ​​temporally weighted soft targets​​. By leveraging polyp entity information and weighting positive pairs based on their temporal proximity, the method explicitly addresses severe intra-polyp appearance variations (e.g., motion blur, lighting changes) while enhancing inter-polyp discriminability. This shift from self-supervision to label-guided learning ensures robustness against transient visual changes, a critical limitation in colonoscopy video analysis.

    2. ​​Temporal Adjacency Constraints in Clustering​​ The authors integrate ​​temporal penalties​​ into tracklet clustering, a novel strategy that mitigates false associations between visually similar but temporally distant tracklets. Existing methods rely solely on visual similarity for clustering, which often fails when polyps reappear under differing conditions. By combining visual similarity with a temporal decay term, the method reduces fragmentation rates significantly, demonstrating the necessity of modeling temporal coherence in clinical video analysis.

    3. ​​Expanded Dataset Integration and Robust LOOCV Evaluation​​ The work aggregates ​​four public datasets​​ (384 polyps), substantially increasing data diversity and scale compared to prior studies. This enhances model generalization to real-world colonoscopy variability. Additionally, the adoption of ​​leave-one-out cross-validation (LOOCV)​​ addresses data scarcity challenges, ensuring reliable performance estimation. The rigorous evaluation framework underscores clinical applicability, as it mimics real-world deployment scenarios where models must generalize to unseen patient data.

    4. This paper is well organized and easy to follow.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Lack of Related Work Discussion: While the authors draw comparisons between their proposed method and both self-supervised and supervised learning paradigms, the manuscript lacks a comprehensive discussion of existing supervised contrastive learning methods. A detailed review of prior work in this specific area would better contextualize the contribution and highlight the novelty of the approach.

    2. Inadequate Method Illustration: The overview diagram in Figure 1, particularly the lower section, does not adequately present the underlying algorithms or mathematical formulations. Furthermore, the accompanying description of the figure lacks academic rigor and clarity, which diminishes the utility of the illustration in conveying the core methodology.

    3. Incomplete Experimental Evaluation: The experimental setup involves the use of multiple publicly available datasets for training; however, the results are reported on only a single test dataset. This limits the generalizability and robustness of the evaluation, and a more comprehensive assessment across all utilized datasets is needed to substantiate the effectiveness of the proposed method.

    4. Missing Hyperparameter Analysis: The manuscript lacks any discussion or ablation studies related to hyperparameter selection. Since hyperparameters can significantly influence model performance, an exploration of their impact is essential to validate the stability and reproducibility of the method.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    methodology presented in the paper is interesting and inspiring. However, its experiments and discussions are deficient.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The work is novel and inspiring. The feedback from the authors also partially alleviated some of my concerns about the paper. I therefore recommend acceptance of the article.



Review #2

  • Please describe the contribution of the paper

    The authors introduce a supervised contrastive loss that incorporates temporally-aware soft targets and temporal penalty to improve intra-polyp invariance and inter-polyp separability for polyp counting. The two technical contributions focus on weighting the targets based on temporal distance. The experiments show that the proposed method outperforms previous methods on REAL-Colon dataset.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The proposed temporally-aware learning is novel.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The reviewer would like to know why the performance of supervised learning is worse than the proposed temporally-aware learning. Theoretically, the ground truth should achieve the best result even if it neglects temporal relationships. The authors should provide an in-depth analysis to this counter intuitive result, and to convince the reviewers how the proposed soft targets can improve intra-polyp invariance compared to supervised learning. The hypothesis of better learned embedding space is not illustrated. Also, since the performance gap is close, it is crucial to convince the reviewers that the performance is reliable rather than being incidental to grid-search outcomes.
    2. The visualized temporal adjacency matrix in Fig. 1 is confusing. Since the adjacent part between polyp 2 and 3 is close, there shouldn’t be a clear boundary between them. However, the visualized result seems perfectly disentangle the two polyps.
    3. The organization of the manuscript is not clear, rendering the contribution confusing. The two technical contributions: temporally-aware soft targets and temporal penalty are well organized in the method section but hard to be understood in the introduction and abstract section. The paradigm shift to supervised learning in the introduction section is misleading the readers from focusing on the technical contribution in this manuscript.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is novel and the motivation is intuitive. However, the results are not promising and need further explanation.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper addresses polyp counting by incorporating temporal information into both representation learning and clustering, going beyond prior methods that rely solely on visual features. The authors propose temporally-aware supervised contrastive learning and clustering with temporal penalties, enabling more robust association of tracklets. Compared to previous work, the paper also introduces a expanded dataset, a leave-one-out cross-validation protocol, and reproduces prior baselines under the same setting. Together, these contributions lead to notable performance improvements and establish a new state-of-the-art in polyp counting.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The quality of the figures is excellent and effectively illustrates the contributions of the paper.

    • The introduction of temporally-aware supervised contrastive learning and clustering leads to improved performance.

    • This work introduces more training data and a new evaluation protocol. It also enables a fair comparison between different methods.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The definition of fragmentation rate is unclear, and false positive rate is not formally defined.

    • It is not specified how polyp detections are obtained—are they derived from ground truth annotations or generated by a detection model?

    • The paper lacks an explanation of the “No clustering” baseline method.

    • While the hyperparameter range for combining visual and temporal similarity (α) is given, the actual value used in the experiments is not reported. If α is set to 0 or close to 0, it would imply that temporal information alone is sufficient, thereby questioning the necessity of visual features and undermines the contribution of the temporally-aware supervised contrastive learning.

    • There is no comparison with rule-based methods that rely solely on detections or tracklets. Intuitively, combining tracklets with simple heuristics could achieve basic polyp counting. These methods are referred to as “tracking only solution” in [15]. Although [15] reports inferior performance for such solutions, this paper adopts a new LOOCV evaluation protocol, which may lead to different performance outcomes. Therefore, it is worthwhile to include the performance of such methods to further highlight the significance of this study and better demonstrate the advantages of the proposed approach to readers.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • In Table 3, it would be clearer to separate the loss functions and clustering methods into distinct columns, explicitly indicating which combination is used for each ablation setting.

    • It would be helpful to clarify the relationship between polyp counting and polyp re-identification. One of the compared methods explicitly uses the term “Polyp Re-Identification” in its title, and there exists a broader body of work on polyp Re-ID [R1, R2, R3]. This appears to contradict the claim made in the Introduction: “In contrast to polyp detection and tracking [23,1,4,24], the task of polyp counting has received limited attention, with only two studies published to date [15,26].”

    [R1] Xiang, Suncheng, et al. “VT-ReID: Learning Discriminative Visual-Text Representation for Polyp Re-Identification.” ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024. [R2] Xiang, Suncheng, et al. “Learning Discriminative Visual-Text Representation for Polyp Re-Identification.” arXiv preprint arXiv:2307.10625 (2023). [R3] Xiang, Suncheng, et al. “Deep Multimodal Collaborative Learning for Polyp Re-Identification.” arXiv preprint arXiv:2408.05914 (2024).

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposes a methodologically meaningful approach with clear writing and strong experimental results. While there are some concerns regarding the completeness of comparisons, I believe the contribution is solid overall. I recommend a weak accept and consider the paper worthy of passing the first-round review and proceeding to the rebuttal stage. I’m open to raising my score if the authors provide reasonable clarifications in the rebuttal.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I recommend to accept this paper for its good novelty and quality.




Author Feedback

We thank the reviewers for their constructive feedback and for acknowledging the novelty of our method (R1,R2,R3), its state-of-the-art performance (R3), and the clarity of the presentation (R1,R3). We have carefully considered each comment and would like to address the concerns as follows.

Performance of supervised vs temporally-aware loss and paradigm shift (R2).

We would like to clarify that the proposed temporally-aware approach is supervised since relationships between positive samples are obtained from ground truths (Eq. 1) and temporal information is incorporated into the supervised loss (Eq. 2). Hence, the shift is from self-supervised to supervised learning, with temporal awareness enhancing the supervised formulation. Therefore, our temporally-aware supervised contrastive loss is expected to outperform the supervised loss. Regarding the performance gap: In Tab. 3 (top), we report results keeping the clustering algorithm fixed, showing that temporal-awareness yields a substantial and statistically significant (Wilcoxon p<5e-4) 41% improvement over the standard supervised loss, which could be attributed to a more informative embedding space.

Polyp counting vs re-identification and related work discussion (R3,R1).

Polyp counting is the task of determining the number of distinct polyps observed during an entire procedure and is intended for automated reporting. In contrast, polyp re-identification, as intended by [A] and follow-up works cited by R3, aims to match a specific polyp in a large gallery with different cameras and locations, and serves video retrieval. [15] starts from a re-identification formulation, but ultimately groups tracklets for counting or characterization. Thus, [15] and [26] are the only existing works addressing polyp counting directly, both relying on self-supervised learning. We will discuss R3’s references and [A] in related work.

[A] Chen et al. “Colo-SCRL: Self-Supervised Contrastive Repr. Learning for Colonoscopic Video Retrieval.” ICME. 2023.

Experimental evaluation (R1).

We share the concern regarding the limited availability of public data for evaluating polyp counting. To mitigate this constraint, we moved from the simple validation/test split used in [26] to a more robust leave-one-out cross validation (LOOCV) protocol, accepting the additional computational cost. We also note that REAL-Colon is multi-centric and remains the only public dataset offering full-procedure videos with annotations that link each polyp instance to its corresponding entity.

Tracklet construction + “No clustering” and “Tracking only” baselines (R3).

Following [26], tracklets are built from ground-truth annotations to avoid tracker noise and enable fair comparison with prior work. As such, a direct comparison with the “tracking only” method in [15] is not feasible. The “No clustering” baseline in Tab. 3 reflects the initial fragmentation rate before any clustering. We will better specify this, thank you.

Hyperparameter selection (R1) and alpha value (R3).

Clustering hyperparameters were selected via grid search on the validation set; values are reported in Sec. 3.2 as well as encoder parameters, which are identical to [26]. The most impactful hyperparameters are number of views (Tab. 2) and alpha value, which has an average of 0.4833 ± 0.1999 across LOOCV. This indicates that both visual and temporal similarities contribute significantly to the final score and are needed for optimal performance. We will report it in the paper.

Formal definition of evaluation metric (R3).

The fragmentation rate is defined as the average number of tracklets polyps are split into [15, 26]. We will include the full definition in Sec. 3.1 as well as the mathematical formulation for false positive rate following [26].

Clarity and presentation.

  • Fig. 1 (R1,R2). We will add mathematical formalism and address visual inconsistencies.
  • Tab. 3 (R3). We will split the information into separate columns for clarity, thank you.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    After rebuttal, all three reviewers agree that this paper can be accepted. But they raised important questions as the results are not convincing … authors should take this information carefully into account to improve the final version.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top