List of Papers Browse by Subject Areas Author List
Abstract
Limited perspectives and complex tissue deformations pose significant challenges in accurately reconstructing monocular dynamic surgical scene. Many existing methods fail to fully exploit inter-frame relationships, resulting in suboptimal performance in processing complex tissue deformations and synthesizing novel views. To address these challenges, we propose Endo-GSMT, an accurate and high-quality method for dynamic endoscopic reconstruction from monocular surgical videos. Our method begins by comprehensively extracting both intra-frame information and inter-frame relationships from the raw monocular videos. We incorporate monocular depth priors and dense displacement field priors to generate the pixel-wise 3D trajectories during the training phase. Then, we design a set of compact and low-dimensional $\mathrm{Sim}(3)$ motion bases, with each point’s motion represented as a weighted combination of these motion bases. Furthermore, we develop a novel depth loss function to address the scale inconsistency inherent in monocular depth priors. We evaluate our method using two distinct evaluation strategies, the experimental results demonstrate that our method achieves state-of-the-art reconstruction quality. The code is available at \url{https://github.com/M11pha/Endo-GSMT}
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1655_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: https://papers.miccai.org/miccai-2025/supp/1655_supp.zip
Link to the Code Repository
https://github.com/M11pha/Endo-GSMT
Link to the Dataset(s)
Endonerf Dataset: https://github.com/med-air/EndoNeRF
StereoMIS Dataset: https://zenodo.org/records/7727692
BibTex
@InProceedings{GouHao_EndoGSMT_MICCAI2025,
author = { Gou, Hao and Wang, Changmiao and Yang, Jiahao and Liu, Yaoqun and Jia, Fucang and Xiao, Deqiang and Qin, Feiwei and Luo, Huoling},
title = { { Endo-GSMT: Endoscopic Monocular Scene Reconstruction with Dynamic Gaussian Splatting and Motion Tracking } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15968},
month = {September},
page = {213 -- 222}
}
Reviews
Review #1
- Please describe the contribution of the paper
The paper presents Endo-GSMT, a novel approach for dynamic reconstruction of endoscopic surgical scenes from monocular videos, addressing key challenges such as limited camera views and complex tissue deformation. It introduces depth and displacement field priors to effectively exploit inter-frame relationships, utilizes a compact set of motion bases to efficiently model the motion of canonical 3D Gaussians, and proposes a new ordinal depth loss to handle depth inconsistencies across frames.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- integrates inter-frame relationships using dense displacement field and depth priors.
- Introduces a compact Sim(3) motion base representation for dynamic modeling.
- Proposes a ordinal depth loss to address inter-frame scale inconsistencies.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- This paper claims to address limited perspectives and complex tissue deformations in dynamic surgical scene reconstruction. However, the proposed method fails to resolve the issue of limited perspectives. As for the complex tissue recon, whether the 2D traj loss can reflect the complex 3d motion is questioned.
- The link provided in the abstract is invalid and leads to nothing. (April 13 EST time)
- The order-based loss is directly taken from modgs/sparsenerf, and the novelty is minimal.
- This paper appears to be inspired by modgs and applies the idea to the endoscopic domain, raising questions about its originality.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not provide sufficient information for reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
In fact, after two years of development in dynamic tissue reconstruction, I strongly question the practical application and the novelty of this task. And when the psnr > 35, the significance of the image quality is limited.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #2
- Please describe the contribution of the paper
This paper proposes Endo-GSMT, a novel framework for dynamic 3D scene reconstruction from monocular endoscopic surgical videos. The approach leverages 3D Gaussian Splatting (3DGS) with enhancements: it integrates depth and dense displacement field priors, models motion using Sim(3) motion bases, and introduces an ordinal depth loss to address inter-frame scale inconsistency. Evaluations on EndoNeRF and StereoMIS datasets demonstrate clear improvements in both novel view synthesis and reconstruction quality, outperforming state-of-the-art NeRF- and 3DGS-based methods. Novel part of the paper can be summarized as: Incorporation of dense displacement priors for capturing inter-frame relationships. Sim(3) motion bases offer an efficient way to model complex deformations. Ordinal depth loss effectively addresses scale inconsistency in monocular depth estimation.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Both frame extraction and novel view synthesis (NVS) strategies are used, strengthening the validity of results. Ablation studies demonstrate the effectiveness of each component. Strong quantitative and qualitative improvements over prior methods are clearly shown by using two different datasets.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
The biggest weakness is runtime and efficiency discussion. Since this is 3d reconstruction problem during surgery real time efficacy in terms of timing matters. The other thing is only applicability for stereo camera. It’s unclear how this approach generalizes to non-stereo or significantly different surgical domains. A comment on extensibility to non-endoscopic or robotic settings would be appreciated.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Table 3: Slight improvements beyond 60 motion bases seem negligible—consider adding error bars to clarify statistical significance. The abstract could better emphasize the practical impact for real-time surgical navigation. Typo in Section 3.2: “conducte ablation study” → “conduct an ablation study.”
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
This is a well-executed and impactful paper with a technically sound and novel approach. The integration of inter-frame cues via Sim(3) motion bases and ordinal depth loss enhances both reconstruction and motion tracking quality. Minor additions—particularly on computational efficiency and robustness—will elevate the clarity and applicability of the work.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
This paper proposes another 3D Gaussian splatting-based method for dynamic scene reconstruction. This method represents motion of each Gaussian by a linear combination of displacement field vectors. The model is evaluated on frame extraction and novel view synthesis against SOTA models and performance improvement is demonstrated across 2 datasets.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Nice ablation study.
- Elegant and compact solution to storing dynamic Gaussians.
- Demonstrated good performance improvements across a variety of metrics and datasets over several comparison models.
- Visualization results strong but limited in number.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Seems like initial trajectories are operating in image space i.e. optical flow. What about motion in z axis for computed basis?
- Seems like both intrinsic calibration and motion tracking is entirely relying on DROID-SLAM rather than novelty in the proposed method, slightly differing from the implication of the paper title.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Strong results demonstrated and interesting formulation but description of methodology is slightly ambiguous.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
Thank all reviewers for your valuable feedback. We appreciate the comments that our work is “technically sound and novel” (R1) and your recognition of the evaluation strategy and experimental results (R1, R2). Our responses are below:
Runtime (R1): Ours ~0.5h. This is slower than SOTA (~5min), but typical for 3DGS methods and faster than NeRFs (hours). Our strength is high-quality reconstruction, suiting applications prioritizing quality over real-time (e.g., postoperative analysis, surgical skill assessment, case review, and medical training).
Only applicability for stereo camera (R1): Our method reconstructs from monocular videos, requiring no stereo input. It relies solely on monocular streams and priors (Abstract).
Scalability (R1): Our method has excellent scalability. Its core components—the fusion of two types of priors, Sim(3) motion base and the use of GS for scene representation—are scene-agnostic.
DROID-SLAM Dependence (R2): DROID-SLAM provides only calibration (Sec. 3.1). “Motion Tracking” primarily uses Sim(3) motion base and trajectory optimization: 1. Fuse two priors for 3D trajectories. 2. Generate Sim(3) motion bases from these trajectories. Gaussian motion is a weighted, optimized combination of these bases. This is key to our dynamic scene representation and “Motion Tracking”.
Z-axis Motion (R2): Initial trajectory generation is not limited to 2D optical flow in image space. We lift 2D pixels to 3D using monocular depth priors (providing initial Z-coordinates, Sec. 2.1), then combine with dense displacement field priors to generate initial 3D trajectories.
Address Limited Perspectives (R3): Our original intention is not to claim to have completely “solved” the problem. In most cases, optimizing dynamic 3D Gaussians from a single video is severely ill-posed. In order to cope with this ambiguity, our method maximizes the mining and utilization of information under limited perspectives: Monocular depth estimation and dense displacement field priors can provide complementary but noisy signals of the underlying 3D scene, our method fuses these two priors and, through the design of Sim(3) motion base and loss functions, forms a globally consistent representation of scene geometry and motion. SOTA novel view synthesis (quantitative/qualitative) shows our model learns underlying 3D structure/dynamics, substantially mitigating these limitations.
Invalid Link (R3): We sincerely apologize and will open source the code if paper is accepted.
2D Trajectory Loss Reflects 3D Motion (R3): Our 2D L1 loss (Eq. 7) supervises projected inter-frame Gaussian positions, not a standalone 2D motion model. It operates within our 3D framework, supervising 2D projections of dynamic 3D Gaussians (transformed by learned 3D Sim(3) bases, Sec. 2.2). This ensures 3D model dynamics match 2D observations, while fundamentally optimizing 3D parameters/motion.
Order-based Loss Novelty (R3): First (to our knowledge) apply an order-based depth loss to 3DGS-based endoscopic reconstruction, and making key improvements for the task. MoDGS Eq.3 uses tanh and α to approximate sign. A problem is that a pair of pixels with the correct order but a small predicted depth difference still contributes a non-zero loss value. In endoscopic scenes, because the tissue surface is often smooth, lacks texture, or is affected by lighting, the depth values output by monocular depth estimators in many areas will be very close and noisy. This defect in the formula is amplified in such situations. Endo-GSMT Eq.5 directly uses sign to determine the order of predicted depths. This “purely order” design aims to avoid excessively fitting the small and unreliable depth differences common in this task. It allows the model to focus more on pixel pairs with significant depth differences and more reliable ordinal relationships, thereby learning the overall structure of the scene more stably.
We will also fix minor problems for better clarity (R1, R2).
Meta-Review
Meta-review #1
- Your recommendation
Invite for Rebuttal
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
This paper is borderline+ with no strong support for acceptance or rejection. The rebuttal clarified there is no dependency on DROID-SLAM for the motion tracking (answer to R3) and the z-axis motion been taken into account(answer to R2) while acknowledging the limited perspectives is not fully solved (answer to R3). There are several other works at the moment focusing on 3D Gaussian Splatting for endoscopic approaches, however the differences to 3DGS/MoDGS were clarified, and despite being incremental the inter-frame relationships an efficient (R2) deformation models hase raised interest among reviewers (R1,R3). An important weakness of the method is the slow computational efficiency, which was not mentioned in the rebuttal and should be integrated in a revision of the paper. In the final version of the paper the authors should also correct/clarify the significance of PSNRs > 35, and discuss the practical application of this research in an OR.
Meta-review #3
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A