Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Reconstructing 3D scenes from monocular surgical videos can enhance surgeon’s perception and therefore plays a vital role in various computer-assisted surgery tasks. However, achieving scale-consistent reconstruction remains an open challenge due to inherent issues in endoscopic videos, such as dynamic deformations and textureless surfaces. Despite recent advances, current methods either rely on calibration or instrument priors to estimate scale, or employ SfM-like multi-stage pipelines, leading to error accumulation and requiring offline optimization. In this paper, we present Endo3R, a unified 3D foundation model for online scale-consistent reconstruction from monocular surgical video, without any priors or extra optimization. Our model unifies the tasks by predicting globally aligned pointmaps, scale-consistent video depths, and camera parameters without any offline optimization. The core contribution of our method is expanding the capability of the recent pairwise reconstruction model to long-term incremental dynamic reconstruction by an uncertainty-aware dual memory mechanism. The mechanism maintains history tokens of both short-term dynamics and long-term spatial consistency. Notably, to tackle the highly dynamic nature of surgical scenes, we measure the uncertainty of tokens via Sampson distance and filter out tokens with high uncertainty. Regarding the scarcity of endoscopic datasets with ground-truth depth and camera poses, we further devise a self-supervised mechanism with a novel dynamics-aware flow loss. Abundant experiments on SCARED and Hamlyn datasets demonstrate our superior performance in zero-shot surgical video depth prediction and camera pose estimation with online efficiency. Project page: https://wrld.github.io/Endo3R/.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0780_paper.pdf

SharedIt Link: https://rdcu.be/eHw1a

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05114-1_17

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/0780_supp.zip

Link to the Code Repository

https://github.com/wrld/Endo3R

Link to the Dataset(s)

N/A

BibTex

@InProceedings{GuoJia_Endo3R_MICCAI2025,
        author = { Guo, Jiaxin AND Dong, Wenzhen AND Huang, Tianyu AND Ding, Hao AND Wang, Ziyi AND Kuang, Haomin AND Dou, Qi AND Liu, Yun-Hui},
        title = { { Endo3R: Unified Online Reconstruction from Dynamic Monocular Endoscopic Video } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {170 -- 180}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper presents a unified 3D base model, called Endo3R, for consistent online scale reconstruction from monocular surgery videos, addressing challenges such as dynamic deformation and untextured surfaces in endoscopic scenes. In addition, we provide a dual memory mechanism for uncertainty perception in long-term incremental dynamic scenarios, and a self-supervised training strategy with novel dynamic perception of optical flow loss. Extensive evaluations of SCARED and Hamlyn datasets show that the model has advantages in zero-shot depth prediction and attitude estimation while achieving online computational efficiency. The authors promise to provide the code in a future release.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The proposed Endo3R framework jointly predicts globally aligned pointmaps, scale-consistent video depths, and camera parameters within an end-to-end framework, eliminating reliance on calibration, instrument priors, or multi-stage optimization pipelines. It introduces an uncertainty-aware dual memory mechanism for long-term online dynamic scene reconstruction and proposes a self-supervised training strategy with a novel dynamics-aware optical flow loss to mitigate label scarcity in datasets.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

1.The proposed indeterminate sensing dual memory mechanism is used for long-term online dynamic reconstruction. How does this conclusion be supported in the experiment? What content shows long-term and dynamic problems being overcome? It is recommended that the author clarify the corresponding discussion and analysis. 2.The authors use optical flow as a self-supervised signal. However, the output of optical flow lacks stability guarantee. The optical flow error may occur in low texture area, high light area, or during fast dynamic transformations of video. How did the authors address this issue in a self-supervised process? Or does it make little difference? It is suggested that the author to provide more clarification.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The experimental results presented in the paper look good and the metrics are competitive.
Reviewer confidence

Not confident (1)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

The authors introduce Endo3R, a unified 3D reconstruction model designed for online and scale-consistent reconstruction from monocular endoscopic video, operating without the need for priors or offline optimization. The key innovation lies in an uncertainty-aware dual memory mechanism that extends pairwise 3D reconstruction to dynamic long-term sequences. This memory system effectively distinguishes between short-term dynamics and long-term stable scene elements, improving robustness in highly deformable surgical scenes. The framework is trained in a hybrid supervised + pseudo-supervised manner, leveraging a novel dynamics-aware flow loss to enforce temporal consistency. Experiments demonstrate superior performance on video depth across SCARED and Hamlyn datasets, and pose estimation only on SCARED.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The dual-memory mechanism, guided by Sampson-distance-based uncertainty filtering, is shown as a good approach to handling dynamic content in surgical scenes, which are particularly challenging due to non-rigid deformations and occlusions.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The paper might overclaim in “zero-shot” reconstruction and does not sufficiently explore dataset generalization and different surgical procedures. (1) The quantitative results on the SCARED [14] and Hamlyn datasets are strong for depth estimation, but the method was trained on the training split of SCARED [14]; thus, its zero-shot generalization capability is tested only with Hamlyn. Additionally, quantitative results for pose estimation are reported exclusively on SCARED [14]. (2) The only two datasets with reported results are from laparoscopic procedures, leaving out evaluations on other surgical modalities such as endoscopy, despite endoscopy datasets like Endomapper [3] or C3VD [4] being included during training. Given these points, the claim that the method achieves superior performance in zero-shot surgical video depth prediction and camera pose estimation without any prior information might be overly ambitious. Additionally, the chosen baseline Monst3R appears for the first time in the results section, but including it in the introduction and highlighting its differences from the current method would benefit the paper As the ultimate goal of the article is to provide online reconstructions, I advise showing some RMSE metrics on the accuracy of the final joint point clouds in addition to the per-frame results in Table 1.

I think it is somewhat confusing to name “self-supervised training” to a method using prior knowledge from pretrained [21] and [7]. None of the training losses are based, for example, on photometric errors on the images, which are the raw input to the system. I am also curious if the system improves the performance of [21] or [7] on its own.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents a novel and well-motivated approach with strong quantitative results, particularly in depth estimation. The method shows potential for real-world surgical applications, and the idea of leveraging geometric consistency and domain adaptation is relevant. However, the evaluation lacks breadth in terms of dataset diversity and generalization, especially across different surgical modalities. Some additional discussion of pseudo-supervised training and joint surface accuracy could increase the impact of the paper. If the authors can address these issues in the rebuttal, the paper could make a solid contribution to the field.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This paper presents a deep-learning model to predict video depth, camera intrinsics, camera pose and dense reconstructions from monocular endoscopic videos online. Although the method follows the general framework presented in DUSt3R [24], architectural changes such as dual memory mechanism are proposed to handle challenges unique to surgical videos. In addition, the authors show how a self-supervised learning scheme can be adapted in the absence of labeled data. The proposed method is compared to the state-of-the-art using publicly available datasets, and both quantitative and qualitative results suggest that the proposed method outperforms the state-of-the-art in terms of accuracy while maintaining the efficiency (frame rate).
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Broadly, the proposed method is based on the DUSt3R framework [24]. However, the authors propose important structural changes to handle challenges in surgical videos. In addition, the authors propose a self-supervised learning strategy when labeled data is not available.
2. The proposed method is compared to the state-of-the-art using two publicly available datasets. The results suggest that the proposed method is more accurate without sacrificing efficiency.
3. The paper reads well with an adequate introduction, clearly defined methods and well-presented experimental results.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Limitations of the proposed method are not discussed. Failure cases are not presented.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
1. The paper begins with an adequate introduction to the clinical problem and cites prior-art with their limitations highlighted. The contributions are identified. The methods are well described and sufficient for a conference paper. The results are summarized and presented well. Overall, the paper reads well and contains adequate information for a conference paper.
2. In Table 1, the authors compare the performance of the proposed method to the state-of-the-art. When the proposed architecture is based on that published in [24], one would expect results for the original method. If the results are way off, at least mention it in the text.
3. Without comments on the limitations of the method with some failure cases, the paper seems incomplete. Also, add some pointers to future research directions.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper proposes deep learning architecture to predict video depth, camera intrinsics, camera pose and dense reconstructions from monocular endoscopic videos. Although based on [24], the authors propose architectural changes to address challenges present in surgical videos. In addition, the superiority of the proposed method is demonstrated using publicly available datasets. Therefore, the paper contains adequate scientific novelty and contributions to be presented at MICCAI. I would still encourage the authors to address the minor concerns listed under additional comments.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We sincerely thank all reviewers and AC for the recommendations on our work. The reviewers remark that our approach is novel and good at handling dynamic scenes, with state-of-the-art and well-presented experimental results. We address the major concerns and will carefully incorporate the suggestions into the camera-ready version. R1: Thanks for the constructive suggestion. 1) Long-term reconstruction: To support our claim of effective long-term dynamic reconstruction, we conducted zero-shot evaluations on the Hamlyn dataset, which features highly dynamic surgical scenes with instrument motion, tissue deformation, and illumination changes. Notably, while the compared methods [8][19] were trained on the Hamlyn dataset, our method was not. The ablation study in Tab. 3 also demonstrates the effects of dual memory mechanism to improve the reconstruction performance. We will revise the manuscript to highlight these results and discuss how it demonstrates the advantage of our dual memory mechanism in preserving spatial consistency under dynamic conditions. 2) Flow Loss: Yes, the optical flow can be less accurate in some conditions, but it exhibits enough robustness in our method. First, our dynamics-aware flow loss is applied only to consecutive frames, where changes are relatively small due to video continuity, making the RAFT-predicted flow sufficiently stable for supervision. Second, we restrict the loss computation to valid regions, which excludes areas affected by occlusions or large motions, to reduce the influence of unreliable flow. Additionally, the experimental results in Tab. 3 show that the flow loss could help improve the performance of our method. As a future direction, it is also valuable to learn the optical flow within our model with confidence estimation, which could further enhance robustness in highly dynamic scenes. R2: Thanks for the constructive suggestion. We will include the limitations and future directions in our final version. Like other online methods, Endo3R may suffer from performance degradation over very long sequences due to the lack of global bundle adjustment, which can lead to drift in camera pose estimation. This issue is exacerbated in highly dynamic surgical scenes, where transient motion makes it hard to maintain long-term static tokens for 3d consistent stable predictions. Therefore, it will be an interesting direction to enhance our Endo3R with global bundle adjustment to deal with extra-long sequences with large dynamics. R3: Thanks for the constructive suggestion. 1) Zero-shot claim: We follow the previous depth estimation works [8][19][20] to adopt the commonly used SCARED dataset for training/evaluation and the Hamlyn dataset for zero-shot validation. Due to the lack of GT poses in the Hamlyn dataset, we only evaluate the pose estimation on the SCARED Dataset. In our future work, we will also evaluate our method on other surgical modalities with diverse datasets. 2) The difference with Monst3R will be included in our final version. 3) Thanks for the suggestion. Due to the page limits, we will report the reconstruction metrics in our future work. 4) Self-supervision claim: We use the raw RGB data to generate the pseudo GT depth and optical flow with the pretrained model of [21] and [7]. These pretrained models are not updated or fine-tuned during training. Therefore, there is no GT data used for training, which aligns with the broader use of self-supervised training. We will revise the manuscript to describe our training process to avoid confusion.

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

Endo3R: Unified Online Reconstruction from Dynamic Monocular Endoscopic Video

Author(s):