Abstract

Efficient three-dimensional reconstruction and real-time visualization are critical in surgical scenarios such as endoscopy. In recent years, 3D Gaussian Splatting (3DGS) has demonstrated remarkable performance in efficient 3D reconstruction and rendering. Most 3DGS-based Simultaneous Localization and Mapping (SLAM) methods only rely on the appearance constraints for optimizing both 3DGS and camera poses. However, in endoscopic scenarios, the challenges include photometric inconsistencies caused by non-Lambertian surfaces and dynamic motion from breathing affects the performance of SLAM systems. To address these issues, we additionally introduce optical flow loss as a geometric constraint, which effectively constrains both the 3D structure of the scene and the camera motion. Furthermore, we propose a depth regularisation strategy to mitigate the problem of photometric inconsistencies and ensure the validity of 3DGS depth rendering in endoscopic scenes. In addition, to improve scene representation in the SLAM system, we improve the 3DGS refinement strategy by focusing on viewpoints corresponding to Keyframes with suboptimal rendering quality frames, achieving better rendering results. Extensive experiments on the C3VD static dataset and the StereoMIS dynamic dataset demonstrate that our method outperforms existing state-of-the-art methods in novel view synthesis and pose estimation, exhibiting high performance in both static and slightly dynamic surgical scenes. Our code is available at~\url{https://github.com/vamWu/EndoFlow-SLAM}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3495_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/vamWu/EndoFlow-SLAM

Link to the Dataset(s)

C3VD Dataset: https://durrlab.github.io/C3VD/ StereMis Dataset: https://zenodo.org/records/7727692

BibTex

@InProceedings{WuTao_EndoFlowSLAM_MICCAI2025,
        author = { Wu, Taoyu and Miao, Yiyi and Li, Zhuoxiao and Zhao, Haocheng and Dang, Kang and Su, Jionglong and Yu, Limin and Li, Haoang},
        title = { { EndoFlow-SLAM: Real-Time Endoscopic SLAM with Flow-Constrained Gaussian Splatting } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {202 -- 212}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    Authors claimed the introduction of optical flow for geometric constraint, and a hybrid appraoch to mitigate the scale-less nature of the depth estimation with final refinement step for better scene representation.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The authors conducted a thorough quantitative evaluation on two datasets (C3VD and StereoMIS), ensuring the robustness of their proposed method.
    2. The use of keyframes for global refinement is innovative and effective. The ablation study convincingly demonstrates the efficacy of this approach.
    3. The figures are well-designed, clearly conveying complex ideas and enhancing the readability of the paper.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Despite claiming to be a SLAM system, the paper does not provide benchmarks for speed, such as frames per second (fps) or training speed, which are crucial for SLAM applications.
    2. The loss formulation is not sufficiently novel. The MICCAI 2024 paper “Free-SurGS: SfM-Free 3D Gaussian Splatting for Surgical Scene Reconstruction” has demonstrated similar results with a more innovative loss formulation
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While EndoFlow-SLAM has several strengths, including comprehensive benchmarking and effective figure design, it falls short in terms of speed benchmarking and loss formulation innovation

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    Performance did not match my expectation



Review #2

  • Please describe the contribution of the paper

    The authors proposed an EndoFlow-SLAM method to achieve real-time SLAM for endoscopic videos. A key contribution is the use of optical flow loss to constrain camera pose updates. To mitigate the scale ambiguity introduced by monocular depth estimation, gradient differences of the depth map are supervised.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper uses optical flow loss to constrain camera pose updates. To mitigate the scale ambiguity introduced by monocular depth estimation, gradient differences of the depth map are supervised.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1.The authors adopted different optimization strategies for keyframes and non-keyframes. How are these keyframes selected? Why is the flow constraint only applied to keyframes?

    2.As the camera moves, new scenes are typically introduced. How does the authors achieve the expansion and initialization of Gaussian points?

    3.The authors do not seem to mention the training time or total training iterations. Considering that the authors introduced optical flow and additional Keyframe Optimization and Global Refinement compared to conventional SLAM, the reviewer would like to know whether this prolongs the training time.

    4.The authors should specify the training iterations for Keyframe Optimization and Global Refinement in the paper. While increasing additional training iterations clearly improves reconstruction quality (as seen in the ablation study), the reviewer notes that the baseline methods being compared do not employ such extra training. If these additional iterations are removed, the “w.o. Refine” results appear worse than the original baseline models. Is this comparison fair?

    5.Free-SurGS [1] is another method that uses flow to constrain surgical video reconstruction and should be included in the comparative experiments.

    6.Does EndoFlow-SLAM account for dynamic scenes? The authors do not describe any design for dynamic scenarios in the method section, yet dynamic scenario evaluation is tested in the experiments.

    7.What learning rate is used for camera pose parameters? Since camera movements in surgical scenes are generally smaller compared to natural scenes.

    8.Regarding depth constraints, the authors emphasize the gradient difference loss. Since scale-invariant loss is a commonly used approach, the reviewer is curious about the effectiveness of gradient difference loss. However, the ablation study’s “w.o. Depth” does not reflect this.

    9.The ATE metric results differ significantly between the C3VD dataset and the StereoMIS dataset. Can the authors explain this discrepancy?

    [1] Guo J, Wang J, Kang D, et al. Free-SurGS: SfM-Free 3D Gaussian Splatting for Surgical Scene Reconstruction[C]//International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2024: 350-360.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    According to the experimental results presented in the paper, EndoFlow-SLAM demonstrates improved quality in surgical scene reconstruction. However, the paper appears to lack sufficient novelty, as the approach of using optical flow to constrain camera poses has already been adopted in prior methods. The authors should explicitly clarify how their method differs from existing approaches, and justify the fairness and experimental validity of their claimed contributions.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.
    1. The authors provide new experimental data on Free-SurGS proposed by the reviewers in the rebuttal, which does not meet the requirements. The reviewers did not know these restrictions when they asked for comparison, and the authors could simply ignore these experimental requirements. However, the authors presented the experimental results, which is unfair to other rebuttal papers.
    2. The authors did not respond to the reviewers’ questions about the design of dynamic scene reconstruction, and the reviewers assumed that there was no additional design. Then the experimental results on StereoMIS seem meaningless, especially compared with other static scene reconstruction algorithms.
    3. ATE of 15mm on Stereo MIS seems unreasonable, although the authors claim that this is due to the deformation of the scene. However, such a path tracking error seems to be wrong, and the reconstruction seems to have failed. The reviewers questioned that there is no need for such an experiment to be compared.



Review #3

  • Please describe the contribution of the paper

    The paper introduces EndoFlow-SLAM, a real-time SLAM framework for endoscopic surgery. The key contribution is the integration of optical flow as a geometric constraint into a 3D Gaussian Splatting (3DGS)-based SLAM system, improving robustness to soft tissue motion such as breathing. The method also addresses scale ambiguity in monocular settings through depth normalization and scale-invariant loss. The approach outperforms prior methods on both static (C3VD) and dynamic (StereoMIS) datasets, achieving state-of-the-art results in tracking accuracy and novel view synthesis.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Novel Integration of Optical Flow Constraints: Introduces a geometric constraint via optical flow into 3DGS-based SLAM, enabling robust tracking in dynamic endoscopic scenes (e.g., breathing motion), which is a key limitation in prior SLAM systems. Depth Regularization with Scale-Invariant Loss: Tackles monocular scale ambiguity using a hybrid depth normalization and regularization strategy, improving geometric accuracy in the absence of reliable depth sensing. Targeted 3DGS Refinement: Proposes a two-stage bundle adjustment refinement that prioritizes suboptimal keyframes, enhancing novel view synthesis and rendering fidelity without compromising real-time performance.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The paper’s novelty is in integrating existing components rather than introducing a fundamentally new formulation. The use of optical flow as a constraint draws on prior work such as GaussianFlow [4], and the scale-invariant depth loss is based on known formulations like Ranftl et al. [17]. The core SLAM architecture closely follows 3DGS-based systems like EndoGSLAM [22], making the contribution more incremental in nature. A limitation is the reliance on an initial RGB-D frame for scale initialization, all mentioned methods suffer from. This assumption may not hold in typical monocular endoscopic systems, potentially limiting the method’s applicability unless stereo endoscopy or additional calibration is available. There is not much discussion on how it would affect their pipeline. While the method improves robustness to smooth tissue deformations (e.g., due to respiration), it does not explicitly handle more complex dynamic elements, such as moving instruments or rapidly deforming anatomy. Without a mechanism for segmenting or modeling independently moving objects, performance may degrade in such cases. This acknowledge at the end of the paper but the authors should modify claim on handling dynamics scenes throughout the manuscript. The method is evaluated on high-end hardware (RTX 4090), but the paper does not report runtime (FPS?) or memory usage. As a result, it is unclear whether the system can maintain real-time performance on standard clinical computing platforms. Finally, some minor clarity issues remain. For instance, details such as the specific optical flow algorithm used and the choice of hyperparameters for the loss functions are not fully explained, which may affect reproducibility.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents a solid and well-executed SLAM system, EndoFlow-SLAM, designed for challenging endoscopic environments. However, the novelty is incremental, largely combining existing techniques in a new context. The system also relies on an initial depth frame, which may limit applicability in typical monocular endoscopy.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #4

  • Please describe the contribution of the paper

    This paper presents a 3D Gaussian Splatting (3DGS)-based SLAM system for endoscopic videos that pose challenges to dynamic non-Lambertian scenes. The major technical contributions are threefold: optical flow minimization, depth regularization, and post-processing global pose and Gaussian alignment. The system is evaluated using endoscopic video benchmarking datasets (C3VD and StereoMIS) and compared against existing neural-rendering and 3D Gaussian-based SLAM systems.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed regularization for dynamic non-Lambertian scenes looks straightforward but well-founded and technically sound. It is reasonably evaluated using existing datasets in comparison to existing approaches.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The paper structure is slightly confusing and does not clearly and rigorously indicate where the technical contributions lie. For example:

    • Fig. 2 consists of too many colors and line types to highlight the added contributions to existing 3DGS-based SLAM systems. It does not show global bundle adjustment (Sect. 2.5), and the Active/Frozen indicators are misleading because they are noted for both T_i and T_i+1. I would advise the authors to use fewer colors and line types and to indicate specifically which elements are Active/Frozen.
    • Sect. 2.4 and 2.5, “Keyframe-Oriented Local Bundle Adjustment,” seem redundant. They can be combined.
    • NICE-SLAM has an updated version, namely NICER-SLAM. Why did you not compare with the latest one?
    • MonoGS [15] is not included in the evaluation. Why?
    • No performance (i.e., processing speed) evaluations are provided.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The overall quality is good and worth reporting at MICCAI. If the authors provide rebuttal comments regarding the questions I had, it would be greatly appreciated.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We sincerely thank all reviewers for their constructive feedback.

Common Questions 1.Efficiency (R1,R2,R3,R4): On C3VD dataset, our work achieves tracking and mapping speeds of 171.7 ms/frame and 198.6 ms/frame, respectively, with rendering performance exceeding 100 FPS. 2.Compare Free-SurGS (R1,R4): First, in 3D Gaussian Splatting, each 2D pixel is determined by multiple projected 3D Gaussians. Therefore, we associate each 2D pixel with a set of deformable and dynamic 3D Gaussians. Conversely, in Free-SurGS, each 2D pixel corresponds to a single fixed 3D point, resulting in a flow loss computed solely from this fixed 3D point. This loss is only an approximation and not accurate enough. Our formulation is more reasonable as we calculate the weighted sum of multiple flow errors through the alpha-weighted blending of the contributions from the set of 3D Gaussians. Second, in the tracking stage, Free-SurGS optimizes only the camera poses while keeping the 3D Gaussians fixed. As a result, errors in the 3D Gaussians are propagated to the camera pose estimation. By contrast, our method jointly optimizes both 3D Gaussians and camera poses using a flow loss, resulting in improved accuracy. In particular, we observe the performance improvement is more obvious in dynamic scenes. Our formulation effectively models subtle dynamic deformations such as respiratory motion and minor tissue movements. On the dynamic StereoMIS dataset, our method achieves superior performance with a PSNR of 21.96 dB and an ATE of 15.47 mm, compared to 19.38 dB and 17.93 mm of Free-SurGS.

R1 1.For keyframe management: Keyframes are selected based on the covisibility between frames, measured as the intersection-over-union (IoU) of observed 3D Gaussians, and relative translation. 2.Flow Constraint on Keyframes Only: Our current strategy is most cost effective. We found that applying the flow constraint optimization to all frames incurs high computational cost with limited accuracy gains. 3.Expansion and initialization of Gaussian points: We identify newly observed 3D regions based on inter-frame covisibility and determine the corresponding 2D pixels. For these pixels, we initialize the centers of 3D Gaussians by back-projection using monocular depth estimates, the colors by the RGB of pixels. 4.Comparison without Refinement: Both EndoGSLAM and our method incorporate optional refinement strategy. We conduct two types of experiments on C3VD dataset. When both method disable refinement, we achieved 21.26 dB PSNR versus EndoGSLAM’s 17.64 dB. When both methods use refinement, our methods reached 25.18 dB compared to EndoGSLAM’s 22.16 dB. In the main paper’s ablation study, we compare EndoGSLAM with refinement and our method without refinement, the lower performance of our method is explainable. 5.ATE Discrepancy: The performance gap is mainly from dataset complexity. C3VD captures static scenes, StereoMIS involves real clinical data with higher noise and dynamics. 6.Depth ablation study: We find gradient regularization is more important, as it leverages local differential constraints and emphasizes depth gradient variations rather than absolute values. We will include detail ablation study in final version.

R2 Comparison with other methods: We outperform NICER-SLAM and MonoGS in accuracy, as we additionally consider geometric constraint.

R3 1.Novelty Clarify: GaussianFlow only optimizes the 3DGS, our method jointly optimizes both 3D Gaussians and camera poses using a flow loss. 2.Limitations of RGB-D reliance: We clarify that our method relies on monocular depth estimation rather than depth cameras. To address scale ambiguity, we incorporate scale-invariant and gradient regularization. 3.Current limitations of dynamic scenes: We acknowledge the limitations of our method in highly dynamic scenes and plan to incorporate dynamic masking for explicit foreground separation. 4.GT Flow: We use GMFlow to obtain it.

R4 Please refer to the common questions.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Considering the scores, this is a borderline paper. I make my recommendation based on the found violation (reporting new experimental results in the rebuttal is not allowed). It is up to PCs to decide if such a violation is a serious one.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This AC has read the rebuttal, confidential comments, original reviews, and the manuscript in detail. This AC agrees with the confidential comments and has taken them into consideration.

    According to https://conferences.miccai.org/2025/en/REBUTTAL-GUIDELINES.html (accessed June 5, 2025), it is stated that “New/additional experimental results in the rebuttal are not allowed, and breaking this rule is grounds for automatic desk rejection. It is, however, allowed to amend the presentation of existing results.” Indeed, the authors have introduced new results in their rebuttal; however, as it was done at the request of the reviewer, and these results can be used to amend the presentation of existing results, this AC considers it to be a non-issue for this rebuttal.



back to top