Abstract

The complexity and diversity of surgical workflows, driven by heterogeneous operating room settings, institutional protocols, and anatomical variability, present a significant challenge in developing generalizable models for cross-institutional and cross-procedural surgical understanding. While recent surgical foundation models pretrained on large-scale vision-language data offer promising transferability, their zero-shot performance remains constrained by domain shifts, limiting their utility in unseen surgical environments. To address this, we introduce Surgical Phase Anywhere (SPA), a lightweight framework for versatile surgical workflow understanding that adapts foundation models to institutional settings with minimal annotation. SPA leverages few-shot spatial adaptation to align multi-modal embeddings with institution-specific surgical scenes and phases. It also ensures temporal consistency through diffusion modeling, which encodes task-graph priors derived from institutional procedure protocols. Finally, SPA employs dynamic test-time adaptation, exploiting the mutual agreement between multi-modal phase prediction streams to adapt the model to a given test video in a self-supervised manner, enhancing the reliability under test-time distribution shifts. SPA is a lightweight adaptation framework, allowing hospitals to rapidly customize phase recognition models by defining phases in natural language text, annotating a few images with the phase labels, and providing a task graph defining phase transitions. The experimental results show that the SPA framework achieves state-of-the-art performance in few-shot surgical phase recognition across multiple institutions and procedures, even outperforming full-shot models with 32-shot labeled data.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1469_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/CAMMA-public/SPA

Link to the Dataset(s)

N/A

BibTex

@InProceedings{YuaKun_Recognizing_MICCAI2025,
        author = { Yuan, Kun and Chen, Tingxuan and Li, Shi and Lavanchy, Joël L. and Heiliger, Christian and Özsoy, Ege and Huang, Yiming and Bai, Long and Navab, Nassir and Srivastav, Vinkle and Ren, Hongliang and Padoy, Nicolas},
        title = { { Recognizing Surgical Phases Anywhere: Few-Shot Test-time Adaptation and Task-graph Guided Refinement } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {469 -- 479}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The manuscript addresses the task of few-shot learning for surgical phase recognition. It employs PeskaVLP as the backbone and introduces several components: few-shot spatial adaptation, temporal adaptation using task graphs, and test-time adaptation.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The manuscript addresses an important task—few-shot learning. To tackle this challenge, the authors combine several components. Their approach is evaluated on multiple datasets, and an ablation study is conducted to assess the contribution of each component

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Phase recognition algorithms typically follow a two-step approach: first, images are embedded into a latent space, and then a temporal model (e.g., LSTM, TCN++, Transformer) integrates the information over time. In traditional approaches, temporal patterns are learned directly from video data. However, this study tackles the few-shot setting, where learning temporal dynamics from video is not feasible. Instead, the authors propose learning from task graphs. Still, it is unclear what the source of these task graphs is. Constructing a task graph for a specific institution would require analyzing multiple procedures—essentially timing several surgeries—which is comparable to annotating the data. As such, this approach may not truly qualify as few-shot learning. Furthermore, in this study, the source of the task graphs for the analyzed datasets is not clearly described. Second, the authors compare their approach to several other methods. However, I am not convinced these baselines are appropriate. Both models were not trained on surgical data, and it is unclear how the authors adapted them for this task. A more relevant comparison would be few-shot fine-tuning of PeskaVLP, as demonstrated in the original PeskaVLP manuscript using 10% of the data. The authors should consider implementing this for 1, 16, and 32 samples.

    Additionally, there are some minor typos and inconsistencies. For example, the following lines are confusing and do not align: K-shot N-class N-shot means using N × K labeled samples, with K samples per class phase class K in {1, …, K} In some cases, it seems that N and K may have been mistakenly flipped.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I was not convinced that the use of task graphs can be considered ‘few-shot.’ Constructing a task graph typically requires multiple annotated procedures, which contradicts the few-shot premise. The authors should provide a stronger justification for this claim. Additionally, I believe the reference models in Table 1 are not well chosen. If the authors consider their work to be the first to truly address few-shot learning on these datasets, this should be clearly stated, rather than relying on comparisons to external models.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    My main concern was regarding the claim of ‘few-shot’. I feel the paper does not strictly adhere to the conventional definition of few-shot learning. However, in the rebuttal, the authors provide a detailed response. I believe they should include a limitations section to clearly communicate these limitations to the readers of the manuscript.



Review #2

  • Please describe the contribution of the paper

    The paper addresses a challenge in surgical action segmentation; developing generalizable models that can work across different institutions and surgical procedures. The key innovation is the combination of few-shot learning with task-graph guided temporal modeling, which allows the system to adapt to new institutional settings with minimal manual effort while maintaining temporal coherence in surgical phase recognition. They propose Surgical Phase Anywhere (SPA) approach with three modules: spatial adaptation via few-shot learning, temporal adaptation via task-graph guided diffusion modeling, and test-time adaptation. They claim better performance with respect to other few shot approaches.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Approach: SPA network is lightweight with texual info and few annotations. Similar approaches have been successful in non surgical domains previously. The paper introduces approach combining task graphs with diffusion modeling for temporal adaptation, which can be a productive and cost efficient technique in Surgical domain to generate data.
    2. Solution looks clinically feasible for practical use, although, the results can be more elaborate to justify the practical usage.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Results are not clear: RN50 row is populated with values from SOTA papers, Cholec80 [1], Bypass [9]. What does that mean? What are these SOTA methods? Maybe add more recent works?
    2. Do we have some numbers to suggest what surgical phases get most performance bump? What is the number of phases in the institutions reported? How different is the definition of these phases from SOTA cholec80 and bypass. Also, did we consider fact that some bypass have cholec segments in them? Maybe a limitation study or discussion should be there at least.
    3. Writing: I suggest to keep same naming convention for figures and tables. There is inconsistency: Tab, tables, fig…
    4. There is not enough description for task graphs, how they are generated, a same task graph?
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has novel approach of SPA network applied to surgical videos. But lacks validation of results (details are listed in above section).

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

Cost and Generation of Task graph (R1,3) We clarify the task graph generation process and its role in few-shot learning. Both clinician-informed and LLM-generated graphs are low-cost to produce and far less expensive than full frame-wise annotation, supporting their practicality. In this work, task graphs are generated specifically for different datasets and institutions. For Cholec80 and MultiBypass , we construct task graphs based on surgical phase summaries reported in their respective official papers, e.g., the original authors have illustrated different task graphs for StrasBypass and BernBypass to highlight variations in phase transitions across institutions. For AutoLaparo, lacking official summaries, we consulted a clinician to define the hysterectomy task graph, taking few seconds To determine the minimum and maximum durations of each surgical phase, we either rely on reported statistics in the official papers or use experience-based estimations (e.g., “this phase usually lasts between X and Y minutes”). Importantly, generating these task graphs does not require annotating full surgical video frames. Instead, it is a high-level process grounded in procedural knowledge that clinicians are already trained in. We verified through informal feedback from several senior clinicians that creating such task graphs and estimating the relative duration of each surgical phase from a workflow map in their mind is not annotation intense given their expertise. Task graphs can also be automatically generated using encoded knowledge from LLMs (e.g., GPT-4o). We prompt the LLM to infer phase transitions and estimate durations, generating graphs to guide our refinement module. On Cholec80, using LLM-generated graphs yields F1 scores of 54.84 (16-shot) and 55.30 (32-shot), outperforming models without refinement (51.89 and 51.23) and matching those using manual graphs (55.05 and 55.69), highlighting the value of LLM-generated knowledge while reducing annotation effort.

Baseline Methods (R1,3) We clarify that all baselines are SOTA few-shot adaptation methods (Tip-Adapter, Linear Probing, LP-Text) that fine-tune the PeskaVLP encoders on surgical data with 1, 16, or 32 samples per class, ensuring a true few-shot setting. This setup is more realistic and challenging than PeskaVLP’s “10%” protocol, which still involved thousands of frames and full annotation of several videos. The RN50 SOTA row in Table 1 refers to results reported by prior fully supervised, full-shot methods that use a ResNet-50 backbone without temporal modeling. These methods train a RN-50 model end-to-end on the entire labeled dataset and do not involve temporal models like LSTMs or Transformers. These serve as fair visual-only baselines for our few-shot setup with SPA. We exclude recent works like TeCNO, which use extra temporal modules to highlight SPA’s visual adaptation ability. We will clarify it in final version.

Surgical Phase Definitions (R2,3) Phase definitions differ across datasets due to procedure and institutional variations, e.g., Cholec80 (7), AutoLaparo (8), Bypass (9–10). Bypass videos do not have Cholec phases, so models aren’t directly transferable; SPA addresses this gap. We observe the largest performance gains in mid-procedure phases like Calot Triangle Dissection and Clip and Cut, which are typically visually ambiguous. A limitation is that task graph quality can affect performance. While our method reduces annotation effort, minimal expert input is still needed when adapting to new domains. We will discuss this in final version.

Comparison to Prior Works (R2) None of the cited works(Teed & Deng, CVPR 2023; Kalfaoglu, ICCV 2021; Luo et al., AAAI 2024) appear to exist, suggesting a serious miscitation in the review. We novelly combine the multi-modal mutual-agreement TTA with task-graph-guided diffusion refinement, both absent in prior work. We strictly follow TTA standard protocols to operate solely on unlabeled test data with no data leakage.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top