Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Early detection, accurate segmentation, classification and tracking of polyps during colonoscopy are critical for preventing colorectal cancer. Many existing deep-learning-based methods for analyzing colonoscopic videos either require task-specific fine-tuning, lack tracking capabilities, or rely on domain-specific pre-training. In this paper, we introduce PolypSegTrack, a novel foundation model that jointly addresses polyp detection, segmentation, classification and unsupervised tracking in colonoscopic videos. Our approach leverages a novel conditional mask loss, enabling flexible training across datasets with either pixel-level segmentation masks or bounding box annotations, allowing us to bypass task-specific fine-tuning. Our unsupervised tracking module reliably associates polyp instances across frames using object queries, without relying on any heuristics. We leverage a robust vision foundation model backbone that is pre-trained unsupervisedly on natural images, thereby removing the need for domain-specific pre-training. Extensive experiments on multiple polyp benchmarks demonstrate that our method significantly outperforms existing state-of-the-art approaches in detection, segmentation, classification, and tracking.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/5424_paper.pdf

SharedIt Link: https://rdcu.be/eHw7g

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05141-7_45

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/5424_supp.zip

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ChoAnw_PolypSegTrack_MICCAI2025,
        author = { Choudhuri, Anwesa AND Gao, Zhongpai AND Zheng, Meng AND Planche, Benjamin AND Chen, Terrence AND Wu, Ziyan},
        title = { { PolypSegTrack: Unified Foundation Model for Colonoscopy Video Analysis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15970},
        month = {September},
        page = {466 -- 476}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper constructs a foundation model for a series of colonoscopy downstream analysis based on the feature extraction of DINOv2 and the encoding-decoding architecture of MaskDINO.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Not so much.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

There are some obvious weaknesses of this submission: (1) The novelty is weak. The important module of feature extraction and decoding rely on existing methods. (2) The proposed loss actually seems similar with MaskDINO. (3) The feature extraction of DINOv2 is only an inference process, and this feature should be visualized to show the effectiveness. (4) In Fig. 3, the detection and segmentation of three images seems similar with regular shape, size, and location. The better visualization should contains various polyp conditions.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The framework is totally based on DINOv2 and MaskDINO which lacks novelty, and the important visualization of inference features and polpy segmentation are not shown.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

Based on the foundation model DINOv2 and MaskDINO, I agreed with Reviewer 3 that it is a good application. In Fig. 3, the detection and segmentation of three images seems similar with regular shape, size, and location. The author asked me to see the supplementary to see the variety of polyps. Why the varieties cannot be shown in the 8-page paper, but only in the supplementary?

Review #2

Please describe the contribution of the paper

The paper proposes a joint multi-task model for polyp detection, segmentation, and tracking in colonoscopy videos. It uses a query-based transformer architecture to perform detection, mask prediction, and classification in a unified framework. A conditional mask loss allows flexible training on heterogeneous annotations. For tracking, a query-matching algorithm based on cosine similarity and the Hungarian algorithm is applied at inference time to associate object instances across frames without requiring additional supervision. The model is evaluated on five public datasets and shows competitive performance across all tasks.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The model jointly handles detection, segmentation, and classification in colonoscopy videos, which is well-motivated for clinical use where these tasks are often interrelated.

The conditional mask loss enables the model to be trained on datasets with mixed supervision (e.g., bounding boxes only or full masks), improving real-world applicability.

The tracking mechanism avoids traditional IoU-based post-processing by using cosine similarity between learned object queries, making it more robust to camera motion and occlusion.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Despite being described as a foundation model, the proposed approach is more accurately a multi-task learning framework. It is trained on only 1,450 images from two datasets (Kvasir-SEG and CVC-ClinicDB), which is far too limited in scale and diversity to support the generalization expected of a foundation model. For comparison, Endo-FM (MICCAI ‘23) was trained on over 33,000 video clips from multiple datasets. Additionally, the architecture tightly couples representation with task-specific heads, rather than offering a modular, general-purpose backbone.

The description of the model’s prediction heads—especially the mask head—is vague. Key terms like “intermediate queries” are undefined, and operations such as “multiplied with image features” and “thresholded” are not clearly described. More details are needed on the structure of these heads (e.g., whether they are MLPs, convolutional layers, etc.) and how object queries are mapped to outputs.

The query-based tracking method is only applied at inference, and the model itself is trained on isolated frames. Given that endoscopy is inherently a video-based task, it’s unclear why no temporal modelling was incorporated during training. While the authors mention the lack of densely annotated colonoscopic videos, temporal modelling does not necessarily require dense supervision; weakly supervised or self-supervised approaches, as demonstrated in Endo-FM, can still be effective.

The paper claims to avoid heuristics by using cosine similarity and the Hungarian algorithm for query matching. However, aren’t these still heuristic choices—cosine similarity is a manually selected metric, and Hungarian matching relies on fixed, rule-based assignment? Could the authors clarify why this approach is considered non-heuristic?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

“Note that there are currently no openly available datasets, to the best of our knowledge, to evaluate joint detection, segmentation, classification and tracking together, hence we evaluate on the aforementioned tasks to cover all the tasks.” I think it should be “aforementioned datasets” and not tasks.

In Table 1, the original DINO paper is cited as ICCV ‘21, but the table lists its venue as ICLR ‘23. Could the authors clarify whether this is a citation error or if the result corresponds to a follow-up application of DINO published at ICLR ‘23?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents a well-motivated multi-task model with strong empirical results and clinical relevance. However, the foundation model claim is overstated, and the tracking method is described as non-heuristic despite relying on fixed similarity metrics and matching logic. These points need to be clarified.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Authors have addressed my concerns.

Review #3

Please describe the contribution of the paper

This paper presents a foundation model for detecting, segmenting, and unsupervised tracking in colonoscopic videos. Notably, this paper introduces a novel method of fine-tuning foundation models on training containing different ground truth modalities and of polyp tracking at inference. The paper provides clear quantitative and qualitative evidence that the proposed method outperforms other state-of-the-art models for detecting, segmenting, and tracking polyps.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The proposed conditional mask loss function allows the foundation model to adapt to different ground truth annotation types (e.g., bounding boxes vs. pixel-wise masks) on-the-fly during training. This enables the foundation model to flexibly train across several datasets and jointly optimize for polyp detection and segmentation
- This paper introduces an effective, unsupervised approach to tracking polyps across video frames using cosine similarity. This approach is less affected by occlusions/obstructions and enables polyp tracking without needing large datasets with dense annotations.
- Foundation model was thoroughly evaluated on polyp detection, segmentation, and tracking tasks against several current models and across multiple holdout validation datasets.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- It is unclear how the prediction heads function. Also, it would be beneficial to include a second figure/schematic of the prediction heads
- In the conditional mask loss, it is unclear if the model would over-optimize for one ground modality (e.g., bounding boxes) if it occurs more often in the training dataset than another (e.g., pixel-wise masks).
- Statistical testing is missing
- Did not explain heuristic-based IoU method for polyp tracking comparison
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper proposes a potentially novel foundation model for joint polyp detection, segmentation, tracking in colonoscopic videos. Weakness noted did significantly affect paper’s impact. Paper was well-organized and easy to follow.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank @R1, @R2, @R3 for their constructive feedback. We are excited that they find our work novel & well-evaluated (R2), and clinically motivated (R3).

Concerns & questions-

@R1-Weak novelty; loss similar to MaskDINO:

Novel & Useful Loss Function- As @R2 noted, our proposed loss allows “to adapt to different ground truth annotation types (e.g., bounding boxes vs. pixel-wise masks) on-the-fly during training”, unlike MaskDINO, enabling flexible training across several datasets.

Novel Application- We are first to apply a MaskDINO-inspired framework to polyp detection, segmentation, and tracking, achieving SOTA results. It is effective to leverage open-source models instead of training from scratch like Endo-FM. Inspired by works like MedSAM’s use of open-source SAM, our work also targets a new medical context, aligning with MICCAI’s mission. @R3 finds it “well-motivated” with “strong empirical results and clinical relevance.”

Tracking- We introduce polyp tracking, critical for clinical applications but challenging due to occlusion and viewpoint changes in colonoscopy videos. Existing foundation models have not addressed tracking.

@R3-Described as foundation model, but is rather a multi-task learning (MTL) framework: We see little difference. [Xu et al., ICLR 2024] finds that MTL leads to reduced error in foundation models. MTL prevents over-optimization and improves generalization by exploiting task commonalities, key for robust foundation models.

@R3-Trained on less data unlike Endo-FM: We leverage DINOv2, pre-trained on 140M natural images, which understands “objects” well. Retraining from scratch, like EndoFM, is less optimal—our model outperforms EndoFM by 7 points (Tab.3). In Tabs.1&2, we fine-tune only on 1,450 images for consistency with prior work. But in Tab.4, we use 36k images (28k-KUMC, 8k-CVC300, ColonDB, ClinicDB, KvasirSEG & ETIS). We will clarify this.

@R3-No weakly or self-supervised temporal modeling like EndoFM: MinVIS shows object queries from image-based models are consistent across frames, enabling lightweight tracking without temporal modeling. Our image-based training is versatile and can be extended with a temporal module in future work.

@R3-Cosine similarity, Hungarian algorithm are heuristic choices: By heuristic, we mean suboptimal or non-learning-based methods (e.g., IoU or spatial distance-based costs). The Hungarian algorithm is an optimal bipartite matching solution, not considered a heuristic. Our trained object queries are temporally consistent (as in MinVIS), enabling robust tracking without heuristics like IoU or spatial distances. Cosine similarity is manually chosen, but is a simple metric on learned queries, minimizing manual design.

@R1-Feature extraction of DINOv2 only an inference process. Visualize features: DINOv2 is fine-tuned, not just used for inference. We will add visualizations.

@R1-More variable conditions in Fig 3: More examples in the supp. video.

@R2,@R3-Prediction heads: The mask, box and classification heads are 3, 3, and 1 layered fully connected networks respectively. We will add more details.

@R3-Explain intermediate queries, etc: We will add.

@R2-Statistical testing: For Tab.3, our accuracies (91.5, 90.8, 91.0, 90.4, 90.8; mean 90.9, SD 0.4) are significantly higher than EndoFM’s (84.1), p<0.0001 (two-tailed, α=0.05). For Tab.2 (ETIS), our dice scores (91.4, 91.8, 90.5) outperform QueryNet’s (81.9), p<0.0001. We will add tests for all datasets.

@R2-Explain heuristic-based IoU: IoU-based tracking calculates overlap between segmentation masks or bounding boxes across frames, assigning identities by maximizing IoU scores. High-overlap polyps in consecutive frames share the same identity.

@R2-Over-optimize for 1 GT modality: Our datasets have more bounding box images (28k-KUMC) than both masks and boxes (8k-CVC300, ColonDB, ClinicDB, Kvasir-SEG & ETIS) in Tab.4. To balance this, we scale the mask loss by 10x than the box loss. We will clarify this.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

I agree with R2 on that overusing the term Foundation Model is a not-so-elegant citation-chasing trend, too common these days, but I don’t think this should block a robust work from appearing in MICCAI. I also found that the only reviewer recommending rejection was overly harsh, and I cannot see how the rebuttal is too emotional and ilogic, as R1 mentions.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

This work has two positive reviewers and a negative reviewers. After checking the comments and rebuttals, I agree with positive reviewers to accept this work. The authors are suggested to leverage reviewer comments and add rebuttal texts to revise the paper for submitting final version, especially, variety of polyps in the supplementary materials should be added into the manuscript.

back to top

PolypSegTrack: Unified Foundation Model for Colonoscopy Video Analysis

Author(s):