Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Automatic analysis of cholecystectomy surgical videos has significant clinical value. However, current models are limited to simple tasks like single-frame phase recognition and multi-tool classification, failing to effectively utilize video context for complex clinical reasoning. They lack the ability to integrate medical textual knowledge with cholecystectomy images and long surgical videos. We propose CholecMamba, a model that compresses video feature sequences through the Mamba architecture and deeply integrates with large-scale reasoning language models to achieve multimodal reasoning capabilities for surgical videos. Our main contributions include: 1) Designing a novel architecture that enables visual feature compression and knowledge feature injection, supporting multi-task video analysis of varying lengths; 2) Innovatively incorporating segmentation category information generated by large language models into the decoder, enhancing surgical video understanding and reasoning segmentation capabilities through medical knowledge logical reasoning; 3) Proposing the Surgical Reasoning Synthesis method, which leverages physician annotations and reinforcement learning with large language models to create the CholecReason dataset containing 49K multi-round dialogues, establishing a new benchmark for surgical video understanding and reasoning segmentation. Experimental results demonstrate that our model achieves optimal performance on existing datasets and CholecReason, with a closed-test score of 0.822, significantly outperforming the best competing model’s score of 0.728. Our code is available at https://github.com/displaywz/CholecMamba.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4415_paper.pdf

SharedIt Link: https://rdcu.be/eHw01

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05114-1_11

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/displaywz/CholecMamba

Link to the Dataset(s)

N/A

BibTex

@InProceedings{WanZip_CholecMamba_MICCAI2025,
        author = { Wang, Zipei AND Pan, Sitian AND Fang, Mengjie AND Zhang, Ruofan AND Tian, Jie AND Dong, Di},
        title = { { CholecMamba: A Mamba-based Multimodal Reasoning Model for Cholecystectomy Surgery } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {107 -- 116}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper introduces a multitask framework for integrating multimodal data (text, image, video) to facilitate semantic understanding of surgical videos. It presents an annotation approach aimed at capturing surgical reasoning for visual question answering (VQA) and segmentation. Using this protocol, it constructed a dataset of approximately 49K multimodal data points, consisting of dialogues paired with images and videos. It then implemented a hierarchical Mamba-based network to demonstrate the effectiveness of the proposed approach across various downstream tasks, including segmentation and VQA.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Technical Contribution: The authors propose “CholecMamba”, a fusion of Mamba and LLM for multimodal learning, targeting semantic comprehension of surgical activities through segmentation and visual question answering.
- Non-technical Contribution: The paper introduces “CholecReason”, a dataset built upon raw cholecystectomy data and annotated with around 49K multimodal data points using a protocol termed Surgical Reasoning Synthesis (SRS).
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Novelty and Originality
- The model combines existing architectures (Mamba and LLM), with no clear novel components in model design.
- The dataset constitutes a new addition to the domain of multimodal surgical VQA; however, there is no comparative analysis demonstrating that the proposed SRS protocol is superior to existing ones.
- Overall, the paper presents incremental progress in multimodal learning and VQA within the surgical domain.
Technical Quality
- The method is technically sound, and the anonymized code integrates well-established frameworks and third-party tools.
- The model is evaluated against strong baselines, but it neglects several relevant recent works within the surgical domain.
- Multiple claims are presented without theoretical or empirical justification, including:
- Accuracy and efficiency of the SRS protocol
- Labeling the identified limitations of current surgical video analysis as the 3 major ones
- The assertion that current methods lack domain-specific surgical knowledge
- The assertion of the limitation of visual token stacking with respect to dynamic video length
Clarity and Communication
- The manuscript lacks a thorough review of related surgical VQA works, many of which would be more appropriate baselines.
- While the paper follows a logical structure, the introduction lacks paragraph-level coherence and logical flow.
Experimental Evaluation
- The segmentation baseline is reasonable, but the VQA section omits key domain-specific methods such as:
- SurgicalGPT (Seenivasan 2023)
- SurgicalVQA (Seenivasan 2022)
- Surgical-VQLA and VQLA++ (Bai 2023, 2025)
- SSG-VQA-Net (Yuan 2024)
- PitVQA-Net (He 2024)
- Text in several tables (e.g., dialogues) is too small to read. Missing values in some tables raise concerns of potential cherry-picking.
- Experimental details are only partially reported. While code availability aids reproducibility, failure cases are not discussed.
- The limitations of the SRS protocol and its applicability to other procedures are not addressed.
Impact and Significance
- The paper has potential to inspire future research, but it lacks discussion on its real-world relevance and practical utility.
- More emphasis is needed on bridging the gap between academic research and clinical impact, outlining how the proposed method addresses practical surgical challenges.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Section-by-Section Analysis

Title
- The title highlights only the model and omits the equally important contribution of dataset creation and formalization of reasoning.
Abstract
- The introductory statement inaccurately claims that prior works are limited to simple or single tasks. This contradicts evidence from Alabi et al. (2025) which surveys deep multitask learning in surgical vision.
- With the above point, the abstract lacks a clear problem statement.
- Contributions 1 and 2 are redundant, as both relate to the same model architecture.
- The significance of the results is not discussed, making it difficult to gauge practical impact.
Introduction
- The paper mistakenly conflates multitask learning with multimodal learning. This confusion pervades the entire manuscript and should be corrected.
- Paragraph transitions are weak, and the narrative lacks logical progression.
- Results are mentioned prematurely, without any insight on the evaluation protocols used.
- Code and data release information is missing.
Methods
- Annotation Protocol: Claimed accuracy and efficiency of SRS are unsupported. Evaluation through inter-rater agreement, speed analysis, or robustness testing is needed.
- Protocol Novelty: No comparisons are made with existing VQA datasets (e.g., Cholec80-VQA, SSG-VQA, EndoVis18-VQA, PitVQA) to establish the novelty of SRS CholecReason benchmark.
- Missing Citations: Pretrained Qwen-RL-32B is used but not cited.
- Dataset Choice: The rationale for selecting CholecT45 and CholecSeg8k is unclear. Were existing labels inherited from the datasets? Why not larger versions of the datasets? Why limit the study to only cholecystectomy procedure?
- Equations:
- Equation 2 introduces unexplained variables (L, P).
- Equations 6 and 7 fail to describe how channel mixing is implemented, it is clearly not intuitive with just dropout and activation functions.
- Equation 8 uses an undefined ‘o’ operator.
- Architecture Discussion: The claim that Fig. 2 demonstrates “efficient feature mapping” is vague. Efficiency should be quantified, not inferred from architectural illustrations.
- Figure Readability: Dialogue text in Fig. 2 is not readable.
Experiments
- Missing details on dataset preprocessing, data splits, and size. No way to ascertain the absence of data leaks.
- Missing experimental setup details: device configuration, GPU usage, training time, loss functions, optimizers, batch size, learning rate, regularization strategies, etc.
Results
- Table 1 omits data for some models, possibly skewing comparisons.
- Figure 3 lacks readability and a supporting discussion.
- Abbreviations like SRS are used before (or without) introduction.
- Ablation Studies:
- Details on the “w/o MixMamba” setup are unclear on how such model was constructed.
- It’s unclear what data the ablation model “w/o SRS variant was trained on.
- The rationale for selecting 20% data ablation is unexplained. Why only 20%?
- No qualitative visualizations are provided for segmentation results, a core aspect of this work.
References
- Alabi, O., Vercauteren, T., & Shi, M. (2025). Multitask learning in minimally invasive surgical vision: A review. Medical Image Analysis, 103480.
- Seenivasan, L., Islam, M., Kannan, G., Ren, H.: Surgicalgpt: end-to-end language-vision gpt for visual question answering in surgery. In: International conference on medical image computing and computer-assisted intervention. pp. 281–290. Springer (2023)
- Seenivasan, L., Islam, M., Krishna, A.K., Ren, H.: Surgical-vqa: Visual question answering in surgical scenes using transformer. In: International Conference on Med- ical Image Computing and Computer-Assisted Intervention. pp. 33–43. Springer (2022)
- Bai, L., Islam, M., Seenivasan, L., Ren, H.: Surgical-vqla: Transformer with gated vision-language embedding for visual question localized-answering in robotic surgery. In: 2023 IEEE International Conference on Robotics and Automation (ICRA). pp. 6859–6865. IEEE (2023)
- Bai, L., Wang, G., Islam, M., Seenivasan, L., Wang, A., Ren, H.: Surgical-vqla++: Adversarial contrastive learning for calibrated robust visual question-localized answering in robotic surgery. Information Fusion 113, 102602 (2025)
- Yuan, K., Kattel, M., Lavanchy, J.L., Navab, N., Srivastav, V., Padoy, N.: Advancing surgical vqa with scene graph knowledge. International Journal of Computer Assisted Radiology and Surgery pp. 1–9 (2024)
- He, R., Xu, M., Das, A., Khan, D.Z., Bano, S., Marcus, H.J., Stoyanov, D., Clarkson, M.J., Islam, M.: Pitvqa: Image-grounded text embedding llm for visual ques- tion answering in pituitary surgery. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 488–498. Springer (2024)
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- incremental addition in data and model to the domain of multimodal learning and VQA
- Inappropriate baselines for VQA
- Lack of comparison of benchmark SRS protocol to existing datasets
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

The rebuttal failed to justify their choice of excluding a wider comparison with existing methods and datasets.

Review #2

Please describe the contribution of the paper
1. This paper propose a hierarchical Mamba visual architecture, which serves as the vision encoder of video MLLMs.
2. This paper introduce a new CholecReason dataset containing 49K multi-round dialogues, establishing a new benchmark for surgical video understanding and reasoning segmentation.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. This paper is well orgainized.
2. This paper introduce a new instruction-following dataset for surgical videos, which is beneficial to the community.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The paper does not compare with existing surgical instruction-following datasets such as [1].
2. Many important training details are missing, including the amount of data used for each stage. While the training pipeline seems similar to LLaVA (pretraining + SFT), the authors do not clarify how much data was used in either stage.
3. The model uses a randomly initialized vision encoder, whereas most MLLMs use CLIP-ViT with vision-language pretraining. Without such alignment, it’s unclear why this new encoder would perform better. Further justification is needed.
4. All baseline models are image-based. No video MLLMs (e.g., LLaMA-VID [2]) are included, which weakens the evaluation given the temporal nature of surgical videos.
5. UNet and ConvNeXt-UNet are included in Table 2 under the reasoning-guided segmentation setting, but these models do not take prompts and cannot perform reasoning. It’s unclear how their results were obtained. Clarification is needed.
[1] Li, Jiajie, et al. “LLaVA-Surg: towards multimodal surgical assistant via structured surgical video learning.” arXiv preprint arXiv:2408.07981 (2024). [2] Li, Yanwei, Chengyao Wang, and Jiaya Jia. “Llama-vid: An image is worth 2 tokens in large language models.” European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Although this paper introduces a new dataset and proposes a novel video MLLM for surgical video understanding, it lacks many important training and evaluation details. For example, it does not report the amount of data used in different training stages, nor does it provide comparisons with other datasets or existing video MLLMs and etc. See more in weaknesses section.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Thanks for the authors’ rebuttal. Although I feel that the paper lacks many implementation details and comparisons with existing datasets, I find its ability to perform reasoning-based segmentation assistance in the surgical domain quite promising. Therefore, I choose to accept it.

Review #3

Please describe the contribution of the paper
- The paper addresses the challenge of long-sequence comprehension and risk localization in the analysis of cholecystectomy videos - a task currently hindered by single-frame annotations and the limited ability of visual encoders to perform long-term understanding and knowledge reasoning.
- It highlights the limitations of pre-trained foundation models, particularly their lack of domain-specific knowledge, making them unsuitable for direct application in surgical video analysis.
- The authors propose CholecMamba, a novel model that enables multimodal fusion analysis (text, image, and video), facilitating complex reasoning/recognition tasks such as bleeding detection.
- The paper also introduces a new benchmark dataset, CholecReason, which supports question-answering tasks in surgical video contexts.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Leveraging professional surgeons’ voice-overs to annotate surgical frames is a strong point that grounds the model in expert clinical knowledge.
- The paper includes a well-designed Streamlit tool that showcases the model’s performance and usability.
- CholecMamba achieves state-of-the-art performance on video question-answering tasks by incorporating multi-frame reasoning, thus advancing the field of surgical video understanding.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The rationale behind the inclusion of mathematical data to enhance reasoning remains unclear. The manuscript lacks a methodological explanation of how this data impacts the model’s performance or reasoning capabilities.
- The voice-over annotations are provided only for single frames. Given the focus on long-term understanding and temporal relationships, the dataset would benefit significantly from sequential frame annotations with corresponding audio. Relying on an AI model to infer temporal context from isolated frame-level descriptions may omit critical information - such as surgical technique consistency, action effectiveness, or stylistic cues - that human annotators cannot express when presented with single frames.
- Minor: In the caption: “Table 2. Performance on segmentation of models across mDice, mIoU, and F1 score.” Consider rephrasing to: “Segmentation performance of models measured by mDice, mIoU, and F1 score.”
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The work shows good results on question answering tasks, however, it still builds upon frame-wise annotations. Thus, the temporal data relies on cues made upon single frame descriptions.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

While the paper makes an incremental methodological advance, it offers a clear and timely emphasis on data-centric reasoning in surgical video understanding. The CholecReason dataset and the proposed CholecMamba architecture contribute to risk localization and long-term comprehension. In particular, the authors’ explanation of the “physician-reviewed, coherent temporal narratives” as an extension of single-frame annotations clarifies a valuable methodological choice.

Author Feedback

Thanks for the valuable feedback. Our responses:

Regarding Novelty (R1, R3) We acknowledge R1’s citation of the Alabi et al. survey, which analyzes GPT2/BERT variants in surgical VQA and anticipates large model development challenges, aligning with EJC Editorial Article [1] on GPT-4o’s potential in complex surgical task accuracy and medical reasoning. Alabi et al. foresee (Discussion 6.1, 6.3) multi-turn, multi-task video analysis and stress (6.4, 6.5) the need for datasets with surgical complexity and full-video reasoning. Our framework models such multi-turn reasoning in variable-length gallbladder videos, and our SRS dataset (multi-task, variable lengths, complex reasoning) directly addresses these needs, better measuring large models’ clinical benefits. CholecMamba: Novel MixMamba for variable-length videos (addresses Tang [2]’s long video challenge), enabling end-to-end video reasoning and innovative LLM decoder integration for segmentation guidance. CholecReason: Addresses R1’s SRS concerns. It’s the first dataset for complex multi-step cholecystectomy reasoning. SRS’s advantages lie in its generation/annotation (leveraging existing labels, VLMs, efficient physician speech supervision, and LLM-structured multi-turn dialogue, echoing [3]’s accuracy/efficiency), surpassing SSG-VQA (single-frame, auto scene graphs, accuracy issues) and Cholec80-VQA (single-frame, triplets). Ablations (Table 3) show SRS boosts VQA score >10%, offering public, physician-validated reasoning data and proving efficiency by surpassing baselines with only 20% training data.

Comparison with Baselines (R1, R2) R1/R2 find VQA baselines insufficient. We focus on gallbladder surgery reasoning. Suggested baselines (SurgicalGPT (GPT-2), SurgicalVQA (BERT), SSG-VQA-Net (classification), LLaVA-Surg (general videos)) have different focuses or older architectures, unsuitable for reasoning. EndoVis18-VQA and PSI-AVA-VQA target non-human kidney/prostate, outside cholecystectomy and are frame-level, unlike CholecReason’s second-to-hour-level videos. We fine-tune latest VLMs (e.g., GPT4o) on CholecReason, superior for long-video multi-shot tasks [4], and use updated baseline versions (e.g., GPT4 vs. GPT2) for fairness. LLaMA-VID (R2) is weaker than our Gemini/GPT4 comparisons ([4] benchmarks). Table 1 (R1) omits LLaVA1.6 video results due to its issues with long visual tokens and weaker performance than Qwen. In Table 2 (R2), Unet-like encoders/decoders replace CholecMamba’s.

Experimental Details (R1, R2, R3) Training/Dataset (R1, R2): Preprocessing, segmentation (CholecT45 standard), data size, params match VLM standards (2.1, Fig 1). Visual encoder: pre-trained cholecystectomy segmentation (2.2), not random init. Math Augmentation (R3): Rationale: Enhanced LLM math reasoning yields broad benefits; medical data alone insufficient for large medical models [5]. Ablations “w/o MixMamba”/”w/o SRS” (R1): Sec 3.4 details: MixMamba replaced by MLP; fine-tuning w/o SRS. Temporal Info (Single-Frame Labels, R3): LLMs merged initial keyframe voice annotations to physician-reviewed, coherent temporal narratives. Our released code and CholecReason dataset provide evaluation criteria and models for cholecystectomy video understanding in the large model era. We aim to elevate video analysis from basic tool/phase recognition to deeper clinical reasoning.

[1] Zhu N, et al. “OpenAI’s GPT-4o in surgical oncology: Revolutionary advances in generative artificial intelligence.” Eur J Cancer (2024). [2] Tang Y, et al. “Video understanding with large language models: A survey.” IEEE T Circ Syst Vid (2025). [3] Deitke M, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv:2409.17146 (2024). [4] Fang X, et al. “Mmbench-video: A long-form multi-shot benchmark for holistic video understanding.” (NeurIPS 2024). [5] Akter S N, et al. MIND: Math Informed syNthetic Dialogues for Pretraining LLMs. (ICLR 2025).

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

CholecMamba: A Mamba-based Multimodal Reasoning Model for Cholecystectomy Surgery

Author(s):