Abstract

In recent years, Visual Question Localized-Answering in robotic surgery (Surgical-VQLA) has gained significant attention for its potential to assist medical students and junior doctors in understanding surgical scenes. Recently, the rapid development of Large Language Models (LLMs) has provided more promising solutions for this task. However, current methods struggle to establish complex dependencies between text and visual details, and have difficulty perceiving the spatial information of surgical scenes. To address these challenges, we propose a novel method, Surgical-MambaLLM, which is the first to combine Mamba2 with LLM in the surgical domain, that leverages Mamba2’s ability to effectively capture cross-modal dependencies and perceive spatial information in surgical scenes, thereby enhancing the LLMs’ understanding of surgical images. Specifically, we propose the Cross-modal Bidirectional Mamba2 Integration (CBMI) module to leverage Mamba2 for effective multimodal fusion, with its cross-modal integration capabilities. Additionally, tailored to the geometric characteristics of surgical scenes, we design the Surgical Instrument Perception (SIP) scanning mode for Mamba2 to scan the surgical images, enhancing the model’s spatial understanding of the surgical scene. Extensive experiments demonstrate that our Surgical-MambaLLM model outperforms the state-of-the-art methods on the EndoVis17-VQLA and EndoVis18-VQLA datasets, significantly improving the performance of the Surgical-VQLA task.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2478_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{HaoPen_SurgicalMambaLLM_MICCAI2025,
        author = { Hao, Pengfei and Wang, Hongqiu and Li, Shuaibo and Xing, Zhaohu and Yang, Guang and Wu, Kaishun and Zhu, Lei},
        title = { { Surgical-MambaLLM: Mamba2-enhanced Multimodal Large Language Model for VQLA in Robotic Surgery } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {576 -- 586}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a Mamba2 Integration module (CBMI) as an addition to VLM architectures and a custom scanning mode (SIP).

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper proposes two clear methodological contributions that are well defined, explained and ablated. The evaluations show improved performance compared to the previous SOTA.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Dataset Choice: Unfortunately the evalutations are only conducted on two similar datasets, EndoVis17-VQLA and EndoVis18-VQLA. Different and more diverse datasets would have given a more complete picture on the true capabilities of the proposed model (e.g. Cholec80-VQA, SSG-VQA, PSI-AVA-VQA, etc.).

    Limited Fundamental Novelty: The proposed method is sound and does improve the performance, but can be understood as a variation on established techniques and approaches (Mamba2, Scanning Modes). Because of this, the work does not seem to provide any surprising novel insights into Surgical VQLA.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Even though the contrbituions are well described and show improved performance, the paper does not provide any surprising novel insights, which has me lean slightly toward recommending rejection.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The study focuses on Surgical VQLA tasks using a novel approach of combining the mamba2 architecture with LLM. They propose a Cross-modal Bidirectional Bamba2 Integration module and a scanning mode for surgical images.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Surgical-VQLA is an emerging development in Surgical VQA to provide localization along with the answer to better comprehend the answer and its reliability. This study uses the recent architecture mamba2 to solve this task. They use the mamba2 model for the text and vision fusion using it in a Cross-model Bidirectional Mamba2 Integration (CBMI) module and also proposing a Surgical Instruction Perception (SIP) scanning mode for the vision embedding. The proposed Surgical-MambaLLM also outperforms the SOTA methods for EndoVis17-VQLA and EndoVis18-VQLA tasks.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The paper misses to explain the different components used. It does not describe what is the vision encoder used, what is the LLM used. Are both of these also mamba based?
    2. When the study is already using a large language model which is capable of providing the answer as text, what is the need of the answer head and location head ?
    3. It is not clear at which stage is the action head and location head trained. It is not mentioned in the training strategy.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    There are spelling mistakes in training strategy section 2.2 “vision” encoder is misspelled. Also in Table 2 fusion module is written as “moudule”

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The study provides novel components to the surgical VQLA task like using the mamba2 architecture in the feature fusion and proposing a scanning method which is specific to surgery. There are some components which are missing the explanations and could add value to the paper and for better readability.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose Surgical-MambaLLM that combines Mamba2 with LLM for surgical data Visual-Question Localized Answer (VQLA) and are the first to integrate Mamba2 with LLM for surgical data. This work introduces a cross-modal bidirectional Mamba2 Integration (CBMI) module, which contains a Surgical Instrument Perception scanning mode for increasing spatial awareness of Mamba2. The Surgical-MambaLLM method is evaluated on the expanded Endovis17-VQLA and Endovis18-VQLA datasets (from the Surgical-VQLA++ paper). The proposed method shows improved results compared to the Surgical-VQLA++ method that could be of interest to the community.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The authors do a good job with motivating the problem by describing the limitations of the current VQLA approaches that rely on Transformer-based methods(focus on global features while neglecting local information)
    • The approach is easy to follow, with most components being off-the-shelf models integrated with CBMI, and the figures provide sufficient context for the different operations in the CBMI module.
    • The use of geometric priors through the Surgical Instrument Perception scanning mode within CBMI is interesting and promises improved results for spatial reasoning
    • Results show detailed comparison with other methods, and the ablations study the benefits of the CBMI module and the proposed SIP scanning mode
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • There is no mention of the loss functions used for training. A thorough description is beneficial for reproducibility, even it is reusing losses from the original papers
    • Based on the results from the Surgical VQLA++ paper, the results appear to be either about the same or lower. Could the authors comment on this difference, considering they are very close?
    • Authors do not mention if the codebase will be made public.
    • Please specify the computational complexity of the ops and if all 6 3090s were used for training? This appears to be drastically different from the single 3090 needed for Surgical VQLA++
    • Could the authors please clarify how the N term is specified for SIP mode?
    • It would be helpful to include 1-2 sentences on where the model is currently limited for the two datasets – either with location of instruments, instrument type, action or something else
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • Rephrase the second sentence in the first paragraph – describes locating instrument twice
    • Typo - Section 2.2, line 3 has “spision” encoder instead of vision encoder
    • Typo - Table 2, column 3 header reads Fusion “Moudule” instead of module
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Comments above – overall a good paper in the Surgical VQA space

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

Thank you for your review comments. We have incorporated your valuable feedback and provided detailed explanations in the final version of the paper. Your input has been essential in refining our research work. We sincerely appreciate your thorough review and guidance.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    If the paper is accepted, the final version should thoroughly address all reviewer comments and concerns, particularly those regarding clarity and reproducibility.



back to top