Abstract

Accurate surgical phase recognition is crucial for computer-assisted interventions and surgical video analysis. Annotating long surgical videos is labor-intensive, driving research toward leveraging unlabeled data for strong performance with minimal annotations. Although self-supervised learning has gained popularity by enabling large-scale pretraining followed by fine-tuning on small labeled subsets, semi-supervised approaches remain largely underexplored in the surgical domain. In this work, we propose a video transformer-based model with a robust pseudo-labeling framework. Our method incorporates temporal consistency regularization for unlabeled data and contrastive learning with class prototypes, which leverages both labeled data and pseudo-labels to refine the feature space. Through extensive experiments on the private RAMIE (Robot-Assisted Minimally Invasive Esophagectomy) dataset and the public Cholec80 dataset, we demonstrate the effectiveness of our approach. By incorporating unlabeled data, we achieve state-of-the-art performance on RAMIE with a 4.9\% accuracy increase and obtain comparable results to full supervision while using only 1/4 of the labeled data on Cholec80. Our findings establish a strong benchmark for semi-supervised surgical phase recognition, paving the way for future research in this domain.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1879_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/IntraSurge/SemiVT-Surge

Link to the Dataset(s)

N/A

BibTex

@InProceedings{LiYip_SemiVTSurge_MICCAI2025,
        author = { Li, Yiping and de Jong, Ronald and Nasirihaghighi, Sahar and Jaspers, Tim and van Jaarsveld, Romy and Kuiper, Gino and van Hillegersberg, Richard and van der Sommen, Fons and Ruurda, Jelle and Breeuwer, Marcel and Al Khalil, Yasmina},
        title = { { SemiVT-Surge: Semi-Supervised Video Transformer for Surgical Phase Recognition } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15969},
        month = {September},
        page = {477 -- 487}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper focuses on the clinically critical task of surgical phase recognition and introduces advanced semi-supervised techniques to reduce annotation costs, validated through extensive experiments on both private and public datasets.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Addressing an Interesting but Underexplored Area: The paper focuses on applying semi-supervised techniques to surgical phase recognition—a relatively underexplored domain—to mitigate the high cost of video annotation. This presents a compelling motivation by targeting a critical challenge in surgical informatics: balancing annotation efficiency with model performance.

    2. Rigorous Experimental Validation: The method is validated across both private and public datasets, with comparative evaluations against fully supervised and self-supervised baselines. These experiments demonstrate the efficacy of the proposed semi-supervised approach in reducing annotation dependency while maintaining competitive accuracy.

    3. High-Quality Presentation: The manuscript features visually appealing figures, clear writing, and logical organization, ensuring readability and accessibility. The well-structured methodology and results sections effectively communicate technical contributions to a broad audience.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Limited Scope of State-of-the-Art Comparisons: The study references methods like SurgFormer and FedCy in the literature review but does not include them in the comparative analysis. Expanding the comparison to encompass these relevant approaches—especially those explicitly mentioned in the introduction—would provide a more comprehensive validation of the proposed method’s superiority.

    2. Opportunities for Enhanced Data Visualization: While Table 3 reports label reduction trends, presenting these results as a line graph would more intuitively illustrate the relationship between annotation effort and performance gains.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper demonstrates strong potential with a well-justified motivation, solid novelty, rigorous experimental design, and clear writing. However, it requires targeted improvements to achieve full impact, particularly in expanding state-of-the-art (SOTA) comparisons. Based on these considerations, I recommend a weak accept in the first-round review.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The main contribution of the paper is the combination of existing semi-supervised learning techniques to build a semi-supervised framework for surgical phase recognition. The proposed method integrates temporal consistency regularization between differently augmented views of unlabeled data using a teacher-student architecture and contrastive learning with class prototypes to refine feature representations using both labeled and pseudo-labeled data. This approach demonstrates strong performance in the RAMIE dataset and competitive performance on Cholec80 with reduced labeled data.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Relevance of the task:

    • The task addressed in this paper is relevant, as annotated surgical workflow data is often scarce, whereas large volumes of unlabeled surgical videos are usually available. There is a gap in current research regarding the application of semi-supervised learning techniques to surgical workflow analysis, which makes the direction of this paper interesting and useful for the community.

    Novelty:

    • While the components of the proposed methodology—such as semi-supervised learning for video modeling using temporal consistency regularization within a Teacher-Student framework and contrastive learning with prototypes—are not novel individually, their integration for surgical workflow analysis represents a valuable contribution.

    Technical correctness:

    • Mathematical formulations regarding the three loss functions that are used during training along with the presentation of the algorithm that covers the entire training workflow are accurate and consistent throughout the methodology section. This contributes to the overall clarity and facilitates a good understanding of the different stages and components of the semi-supervised learning strategy proposed.

    Related work:

    • The paper covers relevant literature by including works pertinent to its scope. It mentions transformer-based supervised methods for surgical workflow analysis, discusses self-supervised and semi-supervised approaches within the surgical domain, brings up semi-supervised approaches for image and video modeling outside the surgical domain (in the natural image domain), and also references semi-supervised approaches that have been useful within the medical image domain (not strictly surgical).

    Ablation studies:

    • Ablation studies are well-designed and conducted across two datasets rather than just one, effectively demonstrating how each component of the proposed method contributes to improving performance across different contexts.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Related work:

    • The related work section would benefit from better integration and analysis of the various approaches highlighted in the paper. While the section adequately covers relevant literature, it would be better to discuss the strengths and weaknesses of previous works rather than just presenting them with brief explanations. Such a discussion would significantly enhance the paper’s positioning within the existing research.

    Experimental Validation: State-of-the-art comparison

    • The supervision type for self-supervised methods (EndoFM, SurgeNetXL) with TeCNO temporal module is ambiguous. It is unclear whether these models are fine-tuned on the labeled training set of the RAMIE dataset or if classification is performed in a zero-shot manner from pretraining alone. If features are extracted from self-supervised pre-trained backbones while the TeCNO module is trained on the labeled dataset, this would not constitute purely self-supervised pretraining as it is showcased in Table 1.

    • The comparison with TeCNO architecture on both RAMIE dataset (Table 2) and Cholec80 (Table 3) appears uneven due to backbone differences. It is unclear whether performance improvements are from the use of a spatio-temporal transformer-based backbone (TimeSformer) against a ResNet-50 spatial-only backbone or from the advantage of training the backbone using the proposed semi-supervised learning strategy. A more informative comparison would have included results using TimeSformer backbone + TeCNO temporal module.

    • The state-of-the-art comparison omits Surgformer [1] across all datasets, despite it being the current state-of-the-art surgical workflow analysis model on Cholec80.

    • Qualitative results suggest that temporal consistency regularization does not directly improve temporal consistency in predictions, which is instead achieved by TeCNO temporal module. However, the increase in video-level Jaccard metrics with temporal consistency regularization (Table 1) indicates improved temporal consistency, creating an interesting inconsistency. The videos selected for qualitative analysis may not adequately represent this finding.

    Writing and presentation:

    • The architectural visualization would be more informative with the inclusion of real input images to clearly demonstrate the practical differences between weak and strong augmentations.

    • The introduction lacks cohesive flow between paragraphs and sentences, requiring refinement to create a stronger narrative that better establishes the problem context, situates the work within relevant literature, and clearly articulates how the proposed approach builds upon existing research to address the identified challenges within a smoother storyline.

    • In the introduction, the statement “Annotating surgical phases is labor-intensive, requiring frame-by-frame review” is too strong and not entirely accurate. Surgical phases are typically annotated using timestamps or time ranges, rather than through a frame-by-frame process. While the task remains labor-intensive—requiring a surgical expert to watch the full procedure and annotate key intervals—it is not as granular as frame-by-frame annotation.

    • The methodology section implies that both weak and strong augmentations are temporal, while the implementation details indicate that visual augmentations are also applied. It would improve clarity to briefly mention in the methodology section that the data undergoes both visual and temporal augmentations. Additionally, the term “rand-m9-n5-mstd0.8-inc1” from the AutoAugment library is not self-explanatory, the paper should better describe which augmentations are applied to the input data.

    References: [1] S. Yang, L. Luo, Q. Wang, and H. Chen, ‘Surgformer: Surgical transformer with hierarchical temporal attention for surgical phase recognition’, in International Conference on Medical Image Computing and Computer-Assisted Intervention, 2024, pp. 606–616.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    To strengthen this work, I would suggest:

    • Clarifying the experimental setup, particularly regarding how self-supervised methods are integrated with TeCNO. The supervision type should be explicitly stated to avoid ambiguity.
    • Improving writing flow in the introduction. Also, enhancing the related work section with critical analysis that identifies specific gaps your method addresses, rather than simply listing previous approaches.
    • Better explaining the discrepancy between qualitative and quantitative results regarding temporal consistency. Perhaps including additional qualitative examples that better demonstrate the quantitative improvements would help.
    • Improving the clarity of augmentation descriptions in the methodology section by explicitly stating which visual and temporal augmentations are applied.
    • Including a comparison with Surgformer, which represents the current state-of-the-art for Cholec80.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper addresses a relevant problem in the surgical domain by proposing a semi-supervised learning approach for surgical workflow analysis, a direction that remains underexplored despite the abundance of unlabeled surgical video data. The integration of a Teacher-Student framework with temporal consistency regularization and contrastive learning with prototypes—while not novel in isolation—represents a valuable contribution to the field when applied in this context.

    The paper is technically sound, with clear and correct mathematical formulations, a well-articulated training algorithm, and comprehensive ablation studies conducted across two datasets. The related work is sufficiently broad in scope, covering the necessary domains and helping situate the paper within the relevant literature, though deeper critical analysis of prior works would improve its positioning.

    Despite these strengths, there are some notable weaknesses. The experimental comparisons lack clarity in terms of supervision levels for competing methods, and differences in backbone architectures make it difficult to fairly assess the source of the proposed method’s improvements. Additionally, the omission of Surgformer, a strong model on Cholec80, weakens the completeness of the state-of-the-art comparison. Writing and presentation aspects—including architectural visualization, augmentations description, and introduction flow—also require refinement for better clarity and readability.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    In this paper the authors propose a self-supervised learning framework for surgical phase recognition in videos. This framework incorporates unlabeled data into training in the following two ways: 1.) by utilizing a teacher-student approach to enforce temporal consistency and 2.) by doing contrastive learning with prototypes to ensure that in the feature space different classes are far away from each other. The authors evaluate their framework on two surgical video datasets, and find that 1.) each of the self-supervised learning components introduced in the paper contributes meaningfully to the classification metrics improvement compared to not using SSL and 2.) this framework allows the authors to beat fully-supervised baseline with only a fraction of labeled videos (25%), utilizing the rest as unlabeled within the SSL framework.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Strong results: the authors demonstrate improvement over fully-supervised, semi-supervised, and pretrained in a self-supervised manner state-of-the-art, establishing new best results for the two datasets considered. Detailed algorithm description, reproducibility promise: not only the authors promise to make their code available, but Algorithm 1 provides a detailed description of the framework for those seeking to re-implement it.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    No major weaknesses

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    A strong paper with interesting results and no major flaws

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We would like to thank all the reviewers for their thorough and thoughtful feedback. We are grateful for the recognition of our work’s novelty, strong clinical motivation, sound experimental design, and clear writing (R#1,3,4). The code will be made publicly available with the camera-ready version.

Inresponse to Reviewer #3,4: We agree that incorporating comparisons with additional state-of-the-art models such as SurgFormer would strengthen our study. SurgFormer, which incorporates a well-designed hierarchical attention module and was trained using 40 labeled videos on the Cholec80 dataset, serves as a strong benchmark. In practice, semi-supervised learning is most effective when the amount of unlabeled data equals or exceeds the labeled portion. Accordingly, we partitioned the Cholec80 training set into labeled and unlabeled subsets. In our experiments, we trained a pure TimeSformer model using 20 labeled and 20 unlabeled videos. Under these conditions, our method did not outperform the performance reported for SurgFormer. However, we believe that re-implementing SurgFormer within our semi-supervised framework could potentially yield improved results. This is a promising direction for future research. Moreover, applying the SurgFormer architecture with our semi-supervised training paradigm to our in-house RAMIE dataset—which contains a larger pool of unlabeled data—may further boost performance. We thank the reviewers for this valuable suggestion.

In response to Reviewer #4: (1) The self-supervised models (EndoFM, SurgNetXL) were fine-tuned on the labeled training set, making their comparison with our proposed semi-supervised approach both appropriate and fair. (2) We experimented with both spatio-temporal and spatial-only backbones; however, due to space limitations, these results were not included in the table. The TeCNO temporal module typically contributes a 1–2% improvement in accuracy. As such, the first row of Table 1 (TimeSformer without additional temporal modeling) can serve as a reference point to approximate the benefit of adding TeCNO. The findings further support our conclusion that semi-supervised learning consistently enhances performance. If space permits, we agree that an explicit comparison of backbones would be a valuable addition. (3) Your comments on temporal consistency are much appreciated. This remains a complex and underexplored aspect to evaluate; indeed, from visualizations do not always reflect consistent temporal improvements. Metrics such as edit score and F1 with overlap are better suited for capturing temporal consistency and will be incorporated in future analyses. The explainability of the temporal regularization mechanism during training is also an open challenge and merits further investigation.

In addition, we will improve the architectural and results visualizations (R#3,4), clarify the augmentation strategies (R#4), refine the writing in the introduction and related work sections (R#4), and expand the comparison with additional state-of-the-art methods (R#3,4). We sincerely appreciate all the constructive suggestions, and the time invested in reviewing our work.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top