Abstract

Pre-training on image-text colonoscopy records offers substantial potential for improving endoscopic image analysis, but faces challenges including non-informative background images, complex medical terminology, and ambiguous multi-lesion descriptions. We introduce Endo-CLIP, a novel self-supervised framework that enhances Contrastive Language-Image Pre-training (CLIP) for this domain. Endo-CLIP’s three stage framework—cleansing, attunement, and unification—addresses these challenges by: (1) removing background frames, (2) leveraging large language models (LLMs) to extract clinical attributes for fine-grained contrastive learning, and (3) employing patient-level cross-attention to resolve multi-polyp ambiguities. Extensive experiments demonstrate that Endo-CLIP significantly outperforms state-of-the-art pre-training methods in zero-shot and few-shot polyp detection and classification, paving the way for more accurate and clinically relevant endoscopic analysis. Code will be made publicly available on https://github.com/chrlott/EndoCLIP.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4385_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{HeYil_EndoCLIP_MICCAI2025,
        author = { He, Yili and Zhu, Yan and Fu, Peiyao and Yang, Ruijie and Chen, Tianyi and Wang, Zhihua and Li, Quanlin and Zhou, Pinghong and Yang, Xian and Wang, Shuo},
        title = { { Endo-CLIP: Progressive Self-Supervised Pre-training on Raw Colonoscopy Records } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15970},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    The main contribution of the paper is the introduction of Endo-CLIP, a novel self-supervised framework designed to improve Contrastive Language-Image Pre-training (CLIP) for endoscopic image analysis, specifically in the domain of colonoscopy records. Endo-CLIP addresses challenges such as non-informative background images, complex medical terminology, and ambiguous multi-lesion descriptions through a three-stage framework: 1.Cleansing: Removes background frames from colonoscopy records. 2.Attunement: Leverages large language models (LLMs) to extract clinical attributes for fine-grained contrastive learning. 3.Unification: Employs patient-level cross-attention to resolve multi-polyp ambiguities. Endo-CLIP significantly outperforms state-of-the-art methods in zero-shot and few-shot polyp detection and classification, indicating its potential for more accurate and clinically relevant endoscopic analysis. Additionally, the authors plan to make the code and datasets publicly available.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper’s major strength lies in its introduction of Endo-CLIP, a novel and well-reasoned three-stage framework tailored for colonoscopy image analysis. Addressing unique challenges like irrelevant images, morphological variability, and multi-polyp scenarios, Endo-CLIP leverages LLMs and cross-attention mechanisms to enhance contrastive language-image pre-training. Strong empirical results, supported by ablation studies, demonstrate its superior performance in polyp detection and malignancy classification, particularly in zero-shot settings. The commitment to releasing code and the manually annotated EndoReport50 dataset further amplifies the paper’s contribution, providing valuable resources for future research.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Here are some potential weaknesses of the paper, as I see them: First, the novelty of individual components is limited. While the overall Endo-CLIP framework is novel, the individual components (e.g., using LLMs for text extraction, cross-attention) are not entirely new. The novelty lies in how these components are integrated and adapted for the specific task of colonoscopy image analysis. However, the extent of the novelty could be questioned if the specific implementations of these components are straightforward adaptations of existing techniques. Second, there’s a lack of external validation. The experiments are conducted on a single pre-training dataset and a single downstream task dataset (EndoReport50). While the results are promising, it’s unclear how well Endo-CLIP would generalize to other datasets or clinical settings. External validation on publicly available colonoscopy datasets would strengthen the claims of robustness and generalizability. Third, computational cost. The paper does not provide details on the computational cost of training and inference with Endo-CLIP. Given the use of large language models and attention mechanisms, the model is likely to be computationally intensive. This could limit its practical applicability in resource-constrained clinical environments.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a novel and promising framework for colonoscopy image analysis. The results are compelling, and the approach is well-motivated. However, the limitations regarding the novelty of individual components, the lack of external validation, and the potential computational cost need to be addressed. A more detailed comparison against other similar approaches as well as a deeper analysis of the runtimes with varying parameters could improve the paper. Therefore, the paper could be accepted, but only if the authors address these concerns in the rebuttal and are willing to make revisions based on feedback.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors proposed a CLIP-based self-supervised framework to for colonoscopy records. The proposed framework contains three stages: cleansing, attunement, and unification. The first stage filters background frames using LLM to preprocess each patient’s report. The second stage then employs an LLM to extract morphological attributes of polyps for better polyp representation on single-polyp cases. Last, the third stage refines representation learning for multi-polyp cases using polyp-level cross-attention. The experiments show that the proposed method outperforms baseline methods in down-stream tasks of polyp detection and classification.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The analysis of the challenges in colonoscopy is in-depth and comprehensive.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The mentioned previous works [27, 26, 25, 5] are missing in the experiments. The authors should include these early attempts for better and comprehensive comparison. Otherwise, the authors should explain why these works are missing.
    2. The reviewers would like to know how good the LLM in doing filtering and extraction quantitatively and how the authors handle the error cases.
    3. Even though the down-stream task performance is promising. The clustering in Endo-CLIP’s t-SNE visualization is not distinct and the boundary is not clear, which is not consistent to the description in Sec. 3.3. Could the authors explain how the results demonstrate distinct clustering and clear boundary?
    4. The authors should also present the polyp detection task in ablation study to demonstrate the effectiveness of each design choice.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is reasonable, and the experiment results are convincing. There are some complementary experiments need to be done.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This work proposes a progressive self-supervised pre-training framework, Endo-CLIP, that systematically tackles noise filtering, morphological feature extraction, and multi-polyp matching challenges in colonoscopy records. It leverages large language models to extract and integrate clinical morphological attributes into contrastive image-text learning, thereby achieving a more precise semantic alignment. Moreover, the framework introduces a cross-attention mechanism to aggregate global contextual information from multi-polyp cases, which leads to significant improvements in downstream polyp detection and malignancy classification, even under zero-shot and few-shot conditions.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This work proposes a progressive three-stage framework (Endo-CLIP) that systematically cleanses data, aligns fine-grained morphological features, and integrates global context from multi-polyp cases.

    This work proposes an innovative use of large language models to extract clinical morphological attributes from unstructured diagnostic reports, which enhances image–text semantic alignment.

    This work proposes a cross-attention mechanism to resolve ambiguous multi-polyp matching, leading to strong performance in downstream polyp detection and malignancy classification under zero-shot and few-shot settings.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. This work relies heavily on large language models for extracting morphological attributes but does not investigate how variations in these models or domain-adaptation issues might impact performance.

    2. This work employs a cross-attention mechanism for integrating multi-polyp information, yet similar approaches exist (e.g., in [23, 24]) and no thorough comparison is provided to highlight its distinct advantages.

    3. This work demonstrates improvements on a single dataset without extensive validation on external datasets or runtime efficiency analysis, which limits understanding of its clinical scalability and generalizability.

    4. Evaluate the impact of different large language models or extraction methods on the overall performance to assess robustness against domain-specific language challenges.

    5. Validate the framework on an independent, external dataset from another institution to better demonstrate its generalization ability and clinical applicability.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, I appreciate the paper’s clear motivation and thorough evaluation. The work proposes a progressive three-stage approach that effectively tackles the challenges associated with noisy colonoscopy records and provides strong experimental results on both polyp detection and malignancy classification. Additionally, the innovative use of large language models for extracting morphological attributes and the integration of a cross-attention mechanism to resolve multi-polyp ambiguities contribute valuable insights to the field.

    At the same time, there are areas that could be strengthened. The heavy reliance on a specific language model and the limited comparison with other cross-attention strategies leave some open questions regarding its overall robustness and novelty. Moreover, validating the approach on external datasets would further enhance the demonstration of its clinical feasibility.

    These factors, coupled with the sound experimental design and potential impact on clinical applications, led to my overall recommendation.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

N/A




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top