Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Visual Planning for Assistance (VPA) in Robot-Assisted Minimally Invasive Surgery (RMIS) holds significant potential for introoperative guidance and procedural automation. This paper presents the Collaborative Surgical Action Planning (CSAP) task, which focuses on generating cooperative action plans based on linguistic surgical goals, highlighting the crucial need for coordinated multi-tool interactions in surgical procedures. CSAP task emphasizes two core challenges: understanding tool-action interdependencies in the timeline and managing concurrent multi-tool interactions. To address these challenges, we propose CSAP-Assist, a VLM-based framework consisting of two key modules: a Recency-Centric Focus Memory Module (ReFocus-MM), which prioritizes recent surgical history while summarizing distant events to improve performance in complex scenes and long sequences; and a Hybrid Multi-Agent Module (HMM), featuring a central agent that provides an initial plan, prompting a dialogue with local agent instruments to iteratively refine their collaborative actions. We evaluated CSAP-Assist on datasets that include phantom and real surgical scenarios. Our extensive experiments show that CSAP-Assist substantially outperforms the baseline method, achieving a 15% higher planning precision for surgical action planning. The source code and dataset are available at https://github.com/einnullnull/Collaborative-Surgical-Action-Planning-Assist.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0427_paper.pdf

SharedIt Link: https://rdcu.be/eHw04

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05114-1_14

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/einnullnull/Collaborative-Surgical-Action-Planning-Assist

Link to the Dataset(s)

https://github.com/einnullnull/Collaborative-Surgical-Action-Planning-Assist

BibTex

@InProceedings{ZhaJie_CSAPAssist_MICCAI2025,
        author = { Zhang, Jie AND Xu, Mengya AND Wang, Yiwei AND Dou, Qi},
        title = { { CSAP-Assist: Instrument-Agent Dialogue Empowered Vision-Language Models for Collaborative Surgical Action Planning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {139 -- 148}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper addresses the novel Collaborative Surgical Action Planning (CSAP) task in robot‑assisted minimally invasive surgery, where multiple instruments must coordinate to achieve user‑specified goals. The authors propose CSAP‑Assist, a vision‑language model (VLM) framework composed of an Recency‑Centric Focus Memory Module (ReFocus‑MM)and Hybrid Multi‑Agent Module (HMM).
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1) Very Novel and Useful Task + Challenges Defined.

2) ReFocus‑MM effectively balances short‑term detailed frames with long‑term summaries, reducing context overload. HMM leverages multi‑agent dialogue to refine plans, introducing a reflection mechanism that improves alignment fidelity. — Both modules are well motivated and carefully integrated.

3) Persuasive experiments + reasonable metrics defined. Introduction of CA‑mAcc and CA‑ED aligns evaluation with surgical priorities (strict for critical actions, relaxed for idle steps).
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

1)There is no discussion of how CSAP‑Assist handles ambiguous or conflicting visual cues (e.g., occlusions) beyond anecdotal qualitative results. 2)Explain how the size of the distant summary and the number of HMM dialogue rounds affect performance and computation. 3)What collaboration protocol (CP) messages are you sending between the central and local agents? Are they free‑form LLM prompts or structured JSON? A more structured schema (e.g., “tool_id”, “proposed_action”, “confidence”) could reduce LLM ambiguity.

4) I performed some literature review and find some work missing:

https://arxiv.org/pdf/2503.18296 https://arxiv.org/pdf/2405.00716 https://arxiv.org/pdf/2503.17900
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

Suggestions (for your follow up work on this) & Minor questions:

Rather than one monolithic LLM for each tool, consider parameter‑efficient fine‑tuning of a shared base model with per‑tool adapters.

For HMM, you might gain robustness by hybridizing prompting with gradient‑based fine‑tuning on a small number of annotated clips.

How exactly do you summarize “distant” action labels? Is it a fixed‑length learnable embedding (e.g., via an LSTM or Transformer encoder) or simple averaging?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Very cutting-edge and useful work. Quoting Yann Lecun: “Future of LLM is new architecture, reasoning and planning.” This paper falls into that category (planning). Must-accept work with minimal flaws. Would definitely raise to strong accept and after my questions has been answered.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

The paper introduces the CSAP task and the CSAP-Assist framework, which effectively address the challenges of multi-tool collaborative planning and long sequence dependencies in surgical scenarios, demonstrating clear clinical significance.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper introduces CSAP, a novel formulation of surgical action planning that explicitly addresses the multi-instrument coordination problem in robotic-assisted minimally invasive surgery (RMIS). While previous research of VPA focuses on retrospective analysis of operative videos, this work contribute on intra-operative AI assistance and is more clinically applicable. CSAP-Assist framework contains Recency-Centric Focus Memory Module (ReFocus-MM) and Hybrid Multi-Agent Module (HMM), for efficient long-term memory management and coordinated multi-instrument planning.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

1.However, the comparative analysis with the latest applications of VLMs in surgical planning—such as dynamic visual target prediction—is insufficient, which may weaken the argument for the method’s cutting-edge nature. 2.Although the experiments cover both phantom and real scenarios, the dataset scale is small (e.g., SAR-RARP50 includes only 3 sequences), and the data sources or annotation details are not disclosed, which may affect the reproducibility of the results. Moreover, the rationality of the new metrics, CA-mAcc/CA-ED, requires further validation, for example, by comparing them with assessments made by clinical experts. Besides, it is recommended to compute the existing mAcc and ED also. 3.The description of the interaction mechanism between the “Collaboration Protocol” (CP) and the local agents in the HMM module is vague, and the termination conditions for the iterative process in Equations (3) and (4) are not clearly specified, which may hinder the reproducibility of the technique. In addition, the framework lacks a discussion on the integration of real-time visual feedback (e.g., delayed processing), which impacts the evaluation of its clinical applicability.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The problem the paper trying to solve is clinically valuable and the proposed method is interesting. However, 1) The evaluation part. what whould be the gold stand for the evaluation of the methods? How to prove the proposed CA-mAcc/CA-ED are reliable? 2) Seems this is a interesting application/integration of VLM based planning framework? What would be the main contributions of the methodology part?
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The paper describes a novel method for collaborative surgical action planning, utilising a Visual Language Model with novel components allowing it to better infer local context and handle historical actions.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Identifies weaknesses of competing action planning methods when considered in the specific context of collaborative surgical action planning, and proposes an interesting pipeline.

The contribution is novel, using a visual language model with components allowing it to better infer local context (which the authors label a Hybrid Multi-Agent Module) and a component which improves the LM’s ability to rely on historical context without becoming overloaded by long action sequences (which the authors label a Recency-Centric Focus Memory Module).
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Given LMs reputation of hallucinating when presented with inadequate contexts, it would have been useful to mention examples where the algorithm fails, and in what manner.

Furthermore, the authors claim that the algorithm achieves acceptable clinical performance, but do not clearly qualify what this means in practice.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

HMM is bound to confuse readers with Hidden Markov Models, even if appropriately defined in the article. I would suggest another acronym (e.g. HMAM).

A short description of mAcc and ED would be appreciated (and an appropriate reference for mAcc)

This kind of paper would greatly benefit from a video demonstration of your result. Perhaps you could consider providing such a video alongside the source code and dataset.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Contribution is novel and solves an interesting clinical problem. Methodology could be further elaborated, but is probably sufficient for the purposes of the conference.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

Thanks for the positive feedback.

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

CSAP-Assist: Instrument-Agent Dialogue Empowered Vision-Language Models for Collaborative Surgical Action Planning

Author(s):