Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Endoscopic video-based tasks, such as visual navigation and surgical phase recognition, play a crucial role in minimally invasive surgeries by providing real-time assistance. While recent video foundation models have shown promise, their applications are hindered by (1) computational inefficiencies and (2) suboptimal performance caused by limited data for pre-training in endoscopy. To address these issues, we present EndoMamba, a foundation model designed for real-time inference while learning generalized spatiotemporal representations. First, to mitigate computational inefficiencies, we propose the EndoMamba backbone, optimized for real-time inference. Inspired by recent advancements in state space models, EndoMamba integrates Bidirectional Mamba blocks for spatial modeling within individual frames and vanilla Mamba blocks for past-to-present reasoning across the temporal domain. This design enables both strong spatiotemporal modeling and efficient inference in online video streams. Second, to improve data efficiency, we propose a self-supervised hierarchical pre-training diagram that enhances EndoMamba’s representation learning. Specifically, our approach combines masked reconstruction with auxiliary supervision, leveraging low-level reconstruction to capture spatial-temporal structures and high-level alignment to transfer broader knowledge from a pretrained general-video domain foundation model. Extensive experiments on four downstream tasks—classification, segmentation, surgical phase recognition, and localization—demonstrate that EndoMamba outperforms existing foundation models and task-specific methods while maintaining real-time inference speed. The source code is available at https://github.com/TianCuteQY/EndoMamba.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2130_paper.pdf

SharedIt Link: https://rdcu.be/eHw1w

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05114-1_22

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/TianCuteQY/EndoMamba

Link to the Dataset(s)

N/A

BibTex

@InProceedings{TiaQin_EndoMamba_MICCAI2025,
        author = { Tian, Qingyao AND Liao, Huai AND Huang, Xinyan AND Yang, Bingyu AND Lei, Dongdong AND Ourselin, Sebastien AND Liu, Hongbin},
        title = { { EndoMamba: An Efficient Foundation Model for Endoscopic Videos via Hierarchical Pre-training } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {224 -- 234}
}

Reviews

Review #1

Please describe the contribution of the paper

The authors propose a foundation model designed for real-time endoscopic video analysis. The method focuses on efficient spatiotemporal representation learning and inference, aiming to make it suitable for online applications. A hierarchical self-supervised pretraining strategy is also introduced to enhance generalization using both endoscopic and general video data.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The key strengths of this work include the creation of a large-scale dataset for endoscopic video analysis and the consequent efficient training of a foundation model specifically designed for endoscopic tasks. The integration of Mamba-based blocks for spatiotemporal modeling enables online inference, making the approach both practical and scalable for clinical use. While Mamba has been explored in other domains, its adaptation to medical imaging can still be better explored. The effectiveness and adaptability of the proposed strategy are demonstrated across multiple surgical datasets and tasks, highlighting its potential for generalization in diverse clinical scenarios.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

A weakness of the work is the absence of qualitative results, making it difficult to assess how the reported improvements translate into practical performance in real surgical scenarios. Additionally, although the authors created a large dataset for training, the evaluation is limited to only a few surgical datasets. It would strengthen the work to include results on widely-used benchmarks, namely Cholec80. Furthermore, assessing performance on datasets not used during training would help validate the generalizability of the proposed foundation model.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I recommend a weak acceptance for this paper. The work presents a meaningful contribution through the development of a large-scale dataset and an efficient foundation model tailored for endoscopic video analysis. The integration of Mamba-based spatiotemporal modeling for real-time inference is novel within the medical imaging domain and shows promising potential. However, the paper lacks qualitative results, making it difficult to assess the practical impact of the reported improvements. Additionally, the evaluation is limited to a small set of datasets, without demonstrating the model’s generalizability on broader benchmarks. Despite these limitations, the strengths justify acceptance with minor revisions
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This paper introduces EndoMamba, a foundation model pre-trained on a mixed dataset with 12 endoscopic videos. For the architecture, EndoMamba integrates Bidirectional Mamba (Bi-Mamba) for spatial modeling within individual frames and vanilla Mamba for causal reasoning across the temporal domain. For the training strategy, they adopt a masked autoencoder strategy like VideoMAE and leverage general-domain video foundation models, trained on large-scale vision-text data, to align features effectively. EndoMamba provided a pre-trained foundation model showing great effectiveness for multiple downstream tasks, indicating great research impact for the research based on the modality of endoscopic videos.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The model exhibits great downstream performance for multiple tasks. With released code and pre-trained weights, this foundation facilitates future endoscopic-based studies boosting the performance and expediting the research procedure. This is of great value to the research community.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
There is no obvious fundamental flaw in this work. However, I do have some minor questions about the architecture and study design.
1. The architecture design: My understanding for not directly using VideoMamba but making a specific design for EndoMamba is to improve the inference efficiency as shown in Table 5. However, is there a sacrifice of model performance of this modification? For example, if a VideoMamba is trained on the same training data and training regime, will this perform better than EndoMamba? Or instead of training a VideoMamba from scratch using the same training regime but training VideoMamba pre-trained on the general data, will that perform even better?
2. What does the word “Hierarchical” in “Hierarchical Pre-training” represent? To me, it seems there is no multi-stage or layered design for the training. (In section 3.3 there is also a typo for the word Hierarchical.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

For surgical phase recognition, there is some new one-stage methods that can achieve better performance than the hierarchical architectures like SKiT. The mechanisms applied there might be more suitable for this work.

These methods includes: [1] Surgformer: Surgical Transformer with Hierarchical Temporal Attention for Surgical Phase Recognition [2] Neural Finite-State Machines for Surgical Phase Recognition [3] MoSFormer: Augmenting Temporal Context with Memory of Surgery for Surgical Phase Recognition
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper show potential for great impact for the research community. The experiments are also adequate to showcase the power of the proposed foundation model.

There is no obvious fundamental flaw identified in the paper.

I did not rank strong accept because of some minor questions I had.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The authors proposed a foundation model named EndoManba with architectural design of incorporating both vanilla and bidirectional Mamba for Endoscopic Videos. To enhance representation learning, the authors proposed to align EndoMamba’s features with VideoMamba using a cosine similarity loss. The experiments show that the proposed model trained on a combined dataset of 12 public datasets achieved the SOTA performance and demonstrated the effectiveness of data scaling and feature alignment.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The experiments are solid, and the performance of the proposed method is outstanding.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The format is not correct. Please use the MICCAI format for the manuscript.
2. The authors should also conduct ablation study on the architectural design of vanilla and bidirectional Mamba blocks to demonstrate the effectiveness of the design choice.
3. The phrase “data limitations” is not a good motivation for the proposed pretraining diagram since it can only enhance representation learning instead of solving the data scarcity problem.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The experiments are solid (four downstream tasks and performance gain) and the method is intuitive. There are only few revisions need to be done, so I voted for accept.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank the reviewers for their thoughtful feedback and positive comments. We address the main concerns as follows:

Evaluation protocol (R1). We clarify that there is no overlap between the training and testing datasets. Specifically, we intentionally excluded Cholec80 from testing because it shares substantial overlap with CholecTriplet, which is included in our training dataset to align with prior work EndoFM. To avoid data leakage, we instead evaluated surgical phase recognition on AutoLaparo, a well-established benchmark with no overlap with our pretraining data. All downstream evaluations in our study follow established and accepted practices in the endoscopic video analysis community, and we selected benchmarks that align with our goal of assessing both performance and real-time applicability across diverse tasks.

Architecture design (R2, R3). EndoMamba incorporates Bi-Mamba blocks for spatial modeling within individual frames and vanilla Mamba blocks for causal reasoning across the temporal axis. This architectural design ensures a balance between expressive spatiotemporal modeling and efficient real-time inference. Although substituting Bi-Mamba with vanilla Mamba in temporal modeling slightly decreases performance due to limited temporal context, this compromise is necessary to enable efficient recurrent inference without recomputing past frames. Experimental results confirm that EndoMamba consistently delivers superior performance and significantly higher inference speeds compared to transformer-based or fully Bi-Mamba architectures, making it particularly suitable for real-time clinical deployment.

Hierarchical pretraining (R2). We use the term “hierarchical pretraining” to describe our two-level representation learning strategy. This strategy involves (1) low-level video reconstruction to capture spatiotemporal information and (2) high-level feature alignment, transferring broader knowledge from a pretrained general-domain video foundation model. This hierarchical supervision effectively enhances EndoMamba’s generalization capability, particularly beneficial when training on relatively limited endoscopic data.

Motivation regarding data limitations (R3). We clarify that our hierarchical pretraining addresses data scarcity by enhancing representation learning, identified for data-efficient learning [1]. Our hierarchical pretraining explicitly enhances representation learning by combining low-level reconstruction with high-level feature alignment, thereby making more effective use of the available endoscopic data. Our experimental results demonstrate that EndoMamba achieves superior performance despite dataset size limitations compared to models pretrained on large-scale general-domain datasets.

Other issues. We appreciate the suggestions for additional ablation studies on architectural design (R3), pretraining strategies (R2), and qualitative analysis (R3). Due to space constraints, we prioritized comprehensive quantitative results across four representative tasks. We plan to include more qualitative analyses and in-depth ablations in an extended journal version. As stated in the abstract, we will release the source code and pretrained weights to support reproducibility. [1] Henaff, Olivier. “Data-efficient image recognition with contrastive predictive coding.” ICML 2020.

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

Dataset, interesting methology and extensive experiments.

back to top

EndoMamba: An Efficient Foundation Model for Endoscopic Videos via Hierarchical Pre-training

Author(s):