List of Papers Browse by Subject Areas Author List
Abstract
We introduce the Surgical Action Planning (SAP) task for cholecystectomy procedures, which generates future action plans from visual inputs to address the absence of intraoperative predictive planning in current intelligent applications. SAP shows great potential for enhancing intraoperative guidance and automating procedures.
However, it faces challenges such as understanding instrument-tissue relationships and tracking surgical progress. Large Language Models (LLMs) show promise in understanding surgical video content but remain underexplored for predictive decision-making in SAP, as they focus mainly on retrospective analysis. Challenges like data privacy, computational demands, and modality-specific constraints further highlight significant research gaps.
To tackle these challenges, we introduce LLM-SAP, a Large Language Model-based Surgical Action Planning framework that predicts future actions and generates text responses by interpreting natural language prompts of surgical goals. The text responses potentially support surgical education, intraoperative decision-making, procedure documentation, and skill analysis. LLM-SAP integrates two novel modules: the Near-History Focus Memory Module (NHF-MM) for modeling historical states and the prompts factory for action planning.
We evaluate LLM-SAP on our constructed CholecT50-SAP dataset using models like Qwen2.5 and Qwen2-VL, demonstrating its effectiveness in next-action prediction. Pre-trained LLMs are tested in a zero-shot setting, and supervised fine-tuning (SFT) with LoRA is implemented. Our experiments show that Qwen2.5-72B-SFT surpasses Qwen2.5-72B with a 19.3% higher accuracy.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0426_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/XuMengyaAmy/SAP
Link to the Dataset(s)
https://github.com/XuMengyaAmy/SAP
BibTex
@InProceedings{XuMen_Surgical_MICCAI2025,
author = { Xu, Mengya and Huang, Zhongzhen and Zhang, Jie and Zhang, Xiaofan and Dou, Qi},
title = { { Surgical Action Planning with Large Language Models } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15968},
month = {September},
page = {566 -- 575}
}
Reviews
Review #1
- Please describe the contribution of the paper
The paper introduces Surgical Action Planning (SAP), a novel approach that builds upon LLMs to generate long-horizon sequences of discrete surgical actions directly from visual inputs. To address data privacy constraints common in clinical settings, the authors emphasize supervised fine-tuning rather than relying on zero-shot capabilities. Central to the proposed method is the LLM-SAP framework, which incorporates a Near-History Focus Memory module to capture recent surgical context, and a Prompt Factory for generating structured and context-aware action plans. Additionally, the authors propose a new evaluation metric, Relaxed Accuracy, designed to more flexibly assess the correctness of predicted actions by considering their appearance also in near-future steps.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper presents a clearly defined problem statement, with well-articulated objectives and success criteria.
- The paper is well-written and thoughtfully structured, with a particularly clear and accessible methodological section. The mathematical formulations are presented in a concise and intuitive manner, making the proposed approach easy to follow even for readers less familiar with the domain. Additionally, the figures and diagrams are well-designed (perhaps too many colours), effectively illustrating the workflow and key components of the system, which significantly enhances readability and comprehension.
- The introduction of Relaxed Accuracy reflects a thoughtful adaptation of evaluation methodology, accounting for the temporal flexibility inherent in surgical action sequences. It recognizes the value of capturing whether a predicted action occurs in the immediate or near-future steps.
- The reported performance gains through supervised fine-tuning are notable; however, it remains unclear why smaller models (e.g., 32B) often outperform larger ones (e.g., 72B) - is this due to overfitting, architectural differences, or the nature of the fine-tuning data? (Please also publish the 118 samples (fine-tuning) in your codebase)
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The set of surgical actions already constitutes structured label information. It is unclear why an LLM, which typically excels at unstructured text generation, is essential for this task, particularly when its outputs are constrained to a predefined set of structured labels.
- The utility of LLMs in this context would be more compelling if the action labels were unknown a priori or if the label space was highly diverse and variable. The paper does not provide sufficient background on the frequency, ordering, or variability of surgical actions. Including basic descriptive statistics or a visual representation (e.g., a Markov chain diagram) could help assess the added value of the proposed method compared to traditional machine learning approaches.
- The experimental section is relatively limited, relying solely on Qwen models without comparison to existing surgical action prediction baselines. This lack of benchmarking makes it difficult to assess the true performance and contribution of the proposed method within the broader context of prior work.
- The proposed Relaxed Accuracy metric, while well-motivated, may not meaningfully alter the evaluation outcome. It appears to scale the overall performance but does not fundamentally change model ranking or insight. If incorrect predictions rarely occur in the subsequent steps, the metric may have limited practical benefit.
- Code availability is currently lacking. The anonymous GitHub repository (accessed April 13, 2025) contains only a placeholder README.md file with the header “# SAP,” which limits reproducibility and transparency.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
While the paper is clearly written with a strong methodological presentation and introduces thoughtful components like the Near-History Focus Memory and Relaxed Accuracy metric, it falls short in several critical areas. The motivation for using LLMs over standard classification methods is not convincingly argued, the experimental evaluation is thin and lacks comparison to existing surgical action prediction models, and the absence of a functioning codebase limits reproducibility. These gaps make it difficult to fully assess the validity and impact of the proposed approach.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Reject
- [Post rebuttal] Please justify your final decision from above.
I appreciate the authors’ effort in responding comprehensively to the critiques and for making the code and fine-tuning results available. However, I remain unconvinced on several core aspects, particularly regarding the paper’s methodological fit and scientific framing. I will outline the key reasons below.
-
Unclear Motivation for LLM Use in a Structured Prediction Task: While LLMs can offer valuable text-based justifications, this is not sufficient motivation alone to choose them over well-established methods for structured action prediction. The core issue is this: the task, as currently framed, is inherently a structured prediction problem with a closed label set—yet it is approached using an open-ended, generative method. This creates a mismatch between method and problem. Moreover, if the label space is predefined, one would expect the use of standard metrics like precision, recall, F1-score, and confusion matrices to rigorously evaluate factual correctness prior to justifications. Justifications should complement—not substitute—accurate predictions.
-
Contribution Remains Conceptually Ambiguous: The paper currently blends two research goals without fully committing to either: (A) Can an LLM predict surgical actions? This would be a classification-style benchmark problem and requires comparative evaluation with existing structured prediction baselines in surgical workflow modeling. (B) Can LLMs generate helpful explanations for predicted surgical actions? This would lean toward human-centered AI or clinical decision support evaluation—perhaps involving a human study or qualitative assessment. Right now, the paper attempts to answer both but ends up doing neither rigorously. Without a clearer focus, either toward outperforming baselines in prediction or demonstrating the value of generated text in clinical contexts, the contribution is difficult to interpret and justify.
-
Appreciation for Released Results and Models The inclusion of AntGPT, Most Prob, and PS results is a welcome addition. It does demonstrate that Qwen models are especially well-suited for this task.
I acknowledge the promising direction of integrating language models into surgical planning workflows. However, I believe the current paper does not convincingly resolve the methodological mismatch nor provide sufficient evidence for the added value of its approach. I suggest the authors consider splitting the framing more clearly in future revisions: one version focused on prediction accuracy, another on human-AI interaction and explainability. Thank you again for the thoughtful rebuttal.
-
Review #2
- Please describe the contribution of the paper
The paper introduce the SAP task, incorporating prospective decision-making into robot-assisted minimally invasive surgery and filling the gap in intraoperative predictive planning in current research. The LLM-SAP framework, through the integration of the NHFM module and prompts factory, effectively combines visual and language models, demonstrating the potential of LLMs in surgical action planning.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The contribusion of the paper includes: 1) A interesting task Surgical Action Planning (SAP) of Robot-assisted Minimally Invasive Surgery (RMIS) is introduced. While existing work are more on retrospective analysis, this task is to generate future surgical action plans from visual inputs, which can be for decision making intro-operatively; 2) LLM-SAP is developed for the SAP task, the meothd consists of Near-History Focus Memory module (NHFM) to model historical states and prompts factory to generate action plans; 3) A evaluation metric ReAcc and a dataset ChelecT50-SAP is introduced in this work.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1.How about the proposed method compared to existing methods (such as sequential models like LSTM/Transformer or SAP approaches based on VLM/diffusion models)? Extensive comparision is need to clearly demonstrate the technical advantages of LLM-SAP in surgical planning. 2.The self-constructed dataset (225 samples) and SFT tuning samples (118) are too small in scale and are only focused on a single surgical procedure (cholecystectomy), thus not verifying the model’s generalization ability to other surgical scenarios, and limiting the reliability of the conclusions. 3.The criteria for distinguishing between “near” and “distant” history in the NHFM module are not clearly defined, and the description of the visual-text fusion mechanism (Equations 1 and 3) is ambiguous, making technical reproduction difficult; moreover, the framework’s inference latency and hardware requirements are not discussed, raising concerns about its practicality in actual surgical settings.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper is well presented and the proposed task and method are interesting. The contribution lies in both methodology and application part. However, 1) with this big model, if the evaluation dataset is not big, the accuracy of fine-tuned model is most probaboly high? 2) if the dataset is not diverse, it is easy for the model to be overfitted? 3) how about the existing methods related to the task and methodology? It is suggested to present the contriubtion of the work compared to sota methods more clearly. 4) GPT-4o is used, how about the other models? What would be the accuracy if other models are used?
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
Most of my concerns are answered.
Review #3
- Please describe the contribution of the paper
This paper introduces Surgical Action Planning (SAP) as a novel task in robot-assisted minimally invasive surgery (RMIS), aimed at generating future surgical action plans from visual inputs to support intraoperative decision-making. To address challenges in modeling instrument-action relationships, temporal dependencies, and data privacy, the authors propose LLM-SAP, a Large Language Model-based framework that interprets natural language prompts of surgical goals to generate predictive and interpretable textual responses. LLM-SAP incorporates two key modules: the Near-History Focus Memory Module (NHF-MM) for capturing recent contextual information, and the Prompts Factory for dynamically guiding the model’s planning capabilities. The framework is evaluated on a modified dataset, CholecT50-SAP, adapted from CholecT50 for surgical action planning, using both LLMs and vision-language models (VLMs) like Qwen2.5 and Qwen2-VL. The authors also introduce a Relaxed Accuracy (ReAcc) metric that accounts for the natural variability in surgical workflows.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The proposed LLM-SAP framework is methodologically sound and introduces two original components: the Near-History Focus Memory Module, which effectively models recent surgical context, and the Prompts Factory, which enables dynamic, goal-conditioned planning. Additionally, the paper is clearly written with nice illustrations, and the experimental results are in depth.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
While the paper introduces a promising framework, there are several notable weaknesses. Firstly, the link referenced in the abstract is empty. I’m interested to see an example of the dataset. Secondly, although the authors present Surgical Action Planning (SAP) as a general task, the study is limited to the CholecT50 dataset, which only contains laparoscopic cholecystectomy procedures. This narrow scope makes the term “Surgical” Action Planning somewhat misleading, and the task should be explicitly framed as specific to cholecystectomy. Thirdly, in Section 2.2, the LLM-SAP is described as predicting the next action, which aligns more closely with action anticipation or classification, rather than planning of a surgery. I think standard classification metrics such as precision, recall, and confusion matrices should be used as evaluation metrics to provide better comparison with other methodologies. Lastly, on page 6, under Section 3 Dataset, the dataset statistics are shown. The dataset used is actually very small, with only 225 samples and 5 action classes, raising concerns about the robustness and generalizability of the model’s performance, particularly for a classification task.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
The authors may consider incorporating additional datasets such as SARAS-ESAD and JIGSAWS to broaden the scope of Surgical Action Planning and improve model generalizability. Additionally, aligning SAP terminology and task definition more closely with established benchmarks in action anticipation literature may enhance clarity and comparability with prior work.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
While the paper is well written with a sound methodology, my overall recommendation is tempered by concerns about the limited scope and scale of the study. The model is evaluated on a small dataset with only five action classes from a single surgical procedure, which restricts the generalizability and robustness of the proposed approach. The lack of broader data and standard classification metrics also limits the strength of the evaluation. Expanding the dataset and refining the task formulation would significantly strengthen the impact and applicability of the work.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
AC: We are encouraged that most reviewers lean toward acceptance.
Reviewer #1
- Compare the proposed method with existing methods Ans: We have incorporated comparative results from existing models under the supervised fine-tuning, such as AntGPT [1], Most Prob [2], and PS [3]. In the zero-shot setting, these baseline models fail to generate valid outputs because their predictions cannot be properly mapped back to the original action labels. In our LLM-SAP, the action prediction output is accompanied by clear and explainable text-based justifications, as the focus extends beyond numerical results alone. This explanatory capability is absent in many baseline models. SLAcc VLAcc ReSLAcc ReVLAcc AntGPT 19.10 19.20 33.17 29.76 Most Prob 49.25 44.86 62.81 59.33 PS 48.15 46.70 65.28 60.26
- The dataset are small, raising concerns about the generalizability. Ans: (1) Annotating large scale data is challenging. (2) The zero-shot experiments demonstrate that LLM-based methods achieve generalization without requiring retraining. (3) Explainable text-based justifications help compensate for small size settings. (4) The framework design can be adapted to other surgical types.
- The “near” and “distant” are not defined; the visual-text fusion is ambiguous; the inference latency is not discussed. Ans: (1) The ‘near’ history refers to the most recent historical action clip. The ‘distant’ history comprises all historical action clips that precede the ‘near’ history. (2) The fusion mechanism dynamically aligns and integrates visual features with textual embeddings through cross-modal attention. (3) Inference latency: 10s per inference, encompassing progress assessment, safety evaluation, and next-action generation; less than 1s for action prediction only. SAP’s value may lie in assisting automated task completion rather than real-time execution.
Reviewer #2 1.The link is empty. Ans: The code and dataset are available at https://anonymous.4open.science/r/SAP-C 2.The task should be framed as specific to cholecystectomy. Ans: We will change the SAP into SAP-C.
- Standard classification metrics should be used. Ans: In our LLM-SAP, providing text-based justifications alongside action prediction outputs is critical. This explanatory capability is absent in many classification models.
Reviewer #3 1.The motivation for using LLMs over standard classification methods. Ans: When compared with traditional classification methods, LLMs offer key advantages, such as: (a) Explainable text-based justifications: In our LLM-SAP, the action prediction output is accompanied by clear and explainable text-based justifications. This explanatory capability is absent in classification methods. (b) MLLM can handle open-ended actions beyond fixed labels, adapt to unseen scenarios without retraining, and integrate text and video data seamlessly.
- The experimental evaluation lacks comparison to existing surgical action prediction models. Ans: We have provided comparative results from existing models under the supervised fine-tuning, such as AntGPT [1], Most Prob [2], and Probabilistic Sequence (PS) [3]. In the zero-shot setting, these baseline models fail to generate valid outputs because their predictions cannot be properly mapped back to the original action labels. Notably, LLM-based methods yield more flexible outputs and text-based justifications than classification models. The focus should extend beyond numerical results alone.
SLAcc VLAcc ReSLAcc ReVLAcc AntGPT 19.10 19.20 33.17 29.76 Most Prob 49.25 44.86 62.81 59.33 PS 48.15 46.70 65.28 60.26- The absence of a codebase limits reproducibility. Our code is available at https://anonymous.4open.science/r/SAP-C
References: [1] https://arxiv.org/abs/2307.16368 [2] https://arxiv.org/abs/2304.09179 [3] https://ieeexplore.ieee.org/document/9065078
Meta-Review
Meta-review #1
- Your recommendation
Invite for Rebuttal
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Reject
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #3
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
Reviewer comments and suggestions should be included in the final version if the paper is accepted. In particular, concerns expressed by R3.