Abstract

3D scene reconstruction from stereo endoscopic video data is crucial for advancing surgical interventions. In this work, we present an online framework for real-time, dense 3D scene reconstruction and tracking, aimed at enhancing surgical scene understanding and assisting interventions. Our method dynamically extends a canonical scene representation using Gaussian splatting, while modeling tissue deformations through a sparse set of control points. We introduce an efficient online fitting algorithm that optimizes the scene parameters, enabling consistent tracking and accurate reconstruction. Through experiments on the StereoMIS dataset, we demonstrate the effectiveness of our approach, outperforming state-of-the-art tracking methods and achieving comparable performance to offline reconstruction techniques. Our work enables various downstream applications thus contributing to advancing the capabilities of surgical assistance systems.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0373_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0373_supp.zip

Link to the Code Repository

https://github.com/mhayoz/online_endo_track

Link to the Dataset(s)

https://zenodo.org/records/10867949

BibTex

@InProceedings{Hay_Online_MICCAI2024,
        author = { Hayoz, Michel and Hahne, Christopher and Kurmann, Thomas and Allan, Max and Beldi, Guido and Candinas, Daniel and Márquez-Neila, Pablo and Sznitman, Raphael},
        title = { { Online 3D reconstruction and dense tracking in endoscopic videos } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This submission proposes a method for tracking anatomical landmarks in robotic surgery video (stereo and kinematics are required).

    The method is based on the popular Gaussian Splatting technique for 3D scene reconstruction, which is extended with dynamical addition of Gaussians as new parts of the scene are visualised and a Gaussian deformation field that enables modelling non-rigid tissue over time.

    The method is compared against 2D tracking techniques on the StereoMIS dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • For the most part, the methodology is very clearly explained with careful and rigorous notation.

    • The method is a substantial extension to Gaussian Splatting that may find other uses than 2D point tracking

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The experimental comparisons against PIP++ and RAFT are not entirely fair since neither of these methods fully exploit the information from StereoMIS, namely camera kinematics.

    • Despite evident effort in explaining the methodology, further details need to be explained for reproducibility (see detailed comments)

    • Some design choices with potential impact on method performance (e. g. control point random selection) have not been evaluated in the submission (see detailed comments)

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Under the assumption that no code will be released (no info on paper), reproducibility is dependent on further clarification of some details.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    — Formulation details —

    • I’d suggest explicitly defining H as the quaternion space, since this symbol may not be immediately familiar to a broad audience
    • The rotation vector field is defined as a map from R^3 -> H. Why not a H -> H map, and what are exactly the 3 input parameters for this mapping, wrt to output quaternion?
    • Control points are initialised as a random subset of all Gaussians? Why random, this seems a risky and sub-optimal choice? Don’t you run into the risk of not covering well certain portions of the scene? Wouldn’t it make more sense to enforce some guarantees that points are sampled across the whole scene? All these questions would warrant testing the method with different random seeds and assess whether performance is affected.
    • There are no details on how the rotation deformation (delta_q) of control points is initialised. zero?
    • Why is isometric deformation (eq 8) only enforced for translation, and not rotation as well?

    — Experiments —

    • Authors select 200 frame subsequences from StereoMIS, but no justification or reproducible details are provided. Why limiting to 200 frames, is it a field of view issue? Which specific 200 frames are selected, is this a manual selection? How can readers reproduce this exact experiment?
    • No limitations about the experimental setup are discussed. Authors simply say that method “outperforms baselines”, however, there are very key biases in this experiment, namely that the proposed method uses camera pose information as part of the tracking process while competing approaches do not. This limits the applicability of this method to robotic surgery scenarios (or with accurate stereo laparoscope tracking - which is rare), and opens up discussion whether we can really claim that the fundamental approach is really superior, since others could also be adapted to include camera pose information.
    • The method is in essense a 3D reconstruction pipeline, where 2D tracking is done by simple post-processing point projection. This opens up to comparison against deformable SLAM methods, where similar tracking could be achieved e. g.: Rodriguez et al. “SD-DefSLAM: Semi-Direct Monocular SLAM for Deformable and Intracorporeal Scenes”
    • The incremental and dynamic extension of the Gaussian map as new portions of the scene come into view is definitely interesting, but I feel this aspect hasn’t been appropriately validated. In the current experimental setting, only landmarks in the viewed in the first frame are tracked, so it would seem that newly added parts of the scene will not interfere with the measured accuracies/tracking results.
    • From the videos shared in supplementary material, I can see that surgical instruments are removed, which seems similar to EndoNerf and related works. However, this is never mentioned in the paper. Do you require instrument segmentation masks to run your method?
    • The supplementary material, would seem to not strictly follow the very strict MICCAI guidelines - “Authors will be able to submit supplementary materials in the form of supporting images, tables, and proof of equations that do NOT represent additional results…” - experiments vs. EndoSURF/Nerf should have been in the main paper. For review fairness, I’m discarding suppl. mat. experiments and methodology details from my assessment.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method contribution is definitely interesting and relevant, and the results are interesting. However, I have concerns about some of the decisions made, namely about utilising random control points, and the selection of baselines for experiments. I could change my assessment if response in rebuttal is convincing.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors response is satisfactory wrt to my concerns.

    For the final version of the manuscript I’d highly suggest:

    • Please release reproducible information about the manual selection of StereoMIS (e. g. indices of images used). Such information can easily be provided in the documentation of the released code. This will facilitate future comparisons aginst your algorithm.
    • Please put the EndoSURF/EndoNerf quantitative results in the main paper, not SM (you’ll have extra space in the final version)



Review #2

  • Please describe the contribution of the paper

    The paper presents an online framework for 3D scene reconstruction and tracking using stereo endoscopic video data, aiming to improve the understanding and assistance during surgical interventions. The authors introduce a method that extends the traditional scene representation using Gaussian splatting while integrating tissue deformation through a sparse set of control points. The online fitting algorithm adjusts scene parameters, optimizing the tracking and reconstruction accuracy. They highlight their approach’s performance over traditional affine reconstruction methods, with faster fitting times and more reliable tracking, validated using the StereoMIS dataset. Though somewhat overclaimed and straightforward, the idea and implementation are pretty good.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The proposed anchor Gaussians are interesting, since most previous dynamic endoscopic reconstruction methods do not consider physical constraints.
    2. Dynamic tissue reconstruction is hushing to the 3D-GS century. While many methods simply modify the 4DGS to endoscopic videos, this paper introduces a promising way for rigid transformation.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. illegal supplementary material. As described in https://conferences.miccai.org/2024/en/PAPER-SUBMISSION-AND-REBUTTAL-GUIDELINES.html the supplementary material may not satisfy the guideline.
    2. Missing important citations: many previous closely related methods are not cited in this paper, i.e.,
    3. Neural lerplane representations for fast 4d reconstruction of deformable tissues
    4. Lightneus: Neural surface reconstruction in endoscopy using illumination decline
    5. Efficient deformable tissue reconstruction via orthogonal neural plane
    6. Semantic-SuPer: a semantic-aware surgical perception framework for endoscopic tissue identification, reconstruction, and tracking
    7. Ambiguous description in the title. Online 3D reconstruction and dense tracking in endoscopic videos this title is more like a SLAM method. But actually, the task is deformable tissue tracking rather than camera tracking.
    8. The authors claim “dense tracking – the latter being essential for most downstream applications.” in the introduction. But I wonder, since the deformation of tissues are non-rigid transformation, does the dense tracking of soft tissues have any specific benefits? In fact, I think this method is more suitable for instrument tracking rather than tissue reconstruction.
    9. In the introduction, “Unlike traditional methods that assume a fixed topology at initialization [9,14]” I don’t think [14] assumes a fixed topology at initialization.
    10. In section 3, “In each frame, we manually annotated 3 to 4 distinct landmarks to evaluate the tracking.” As this paper claims dense tracking, the evaluation protocol “3 to 4 distinct landmarks” seems not reasonable.
    11. Over-claim. The speed “resulting in an average processing time of 2 seconds per frame” does not meet the claimed online real-time reconstruction.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?
    1. Sh setting
    2. The choice of anchor Gaussians. I wonder how the anchor Gaussians are selected. Randomly chosen points may not meet the anchor requirements.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please refer to the “weaknesses” part.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I insist that this method may do better with rigid transformation rather than tissue reconstruction and the wording of this paper should be more rigorous. Many related papers are not cited, the eval is not convincing, and the performance is not superior to 4dgs-based methods. But considering that this area is flourishing and requires more effort to advance this field, I lean to accept this paper.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    After considering the rebuttal, I have decided to maintain my original opinions. This region still needs more effort to promote.



Review #3

  • Please describe the contribution of the paper

    The paper describes a method that leverages Gaussian splatting for online 3D reconstruction and tracking in endoscopic videos. The authors extend existing work by accounting for tissue deformations, which is very challenging. Experiments on a public dataset show very impressive quantitative and qualitative results, clearly outperforming state-of-the-art approaches.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Strong fundamental backing
    • Clear focus and description of the work
    • Impressive results for a challenging problem
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Contributions w.r.t. existing (cited) computer vision papers could be better emphasized
    • Quantitative evaluation is hard to follow if you are not familiar with relevant work
    • Limitations are not explicitly described
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    First of all, I’d like to congratulate the authors with an excellent piece of work. Reading the paper was a delight and the results are very impressive. I do have some minor comments that could potentially further improve the paper.

    1 EXPERIMENTAL SETUP First, for me, it is not entirely clear how the data set was used in the development of the proposed approach. As I read in [11], the data set consists of 16 recorded sequences of human and porcine subjects. But it is unclear whether all of these recordings were used in the development process of the proposed method. If so, there’s a risk that the results are overly optimistic, as the authors had access to the performance while developing the method. This could be more clearly described, or, at least be mentioned as a potential limitation. Preferably, evaluation on an external set is performed and added.

    2 DESCRIPTION OF THE METHOD AND CONTRIBUTIONS The authors build upon several papers in the field of computer vision, especially [6], but it is not entirely clear how the proposed method improves on those papers. While for some methods the authors clearly mention that they have specific drawbacks for this application (e.g. requiring a static scene), this is not clear for all of the papers. I would recommend the authors to (more) explicitly state their contributions and differences w.r.t. existing works. Additionally, for readers not so familiar to these related works, it might be hard to follow what exactly is the input and output of the proposed approach. Figure 1 may give the impression that the task is more or less trivial, as an image and a depth map go in, and should be predicted for the same time stamp. However, the authors enable 3D tracking and scene reconstruction, which is not clearly shown or described as a potential output of the algorithm. One of the applications is only briefly mentioned at the end of Section 2.

    3 QUANTITATIVE EVALUATION For me, it is unclear how exactly the metrics are computed, as I am not completely familiar with [5]. One of the first questions I have, is how the ground truth reference is acquired here. The authors already indicate that no 3D tracking could be performed, but it is unclear what then is performed, based on the information in the paper. To me, it is a mystery how a metric in millimeters is obtained in Table 1, which I also couldn’t find in [5]. Quantitative evaluation of 3D reconstruction / tracking methods in endoscopy is extremely challenging, hence, it is important to clearly communicate on the boundaries of the presented evaluation.

    4 CLINICAL APPLICABILITY While, as the authors rightfully point out, 3D scene reconstruction would have a wide variety of applications within endoscopy / laparoscopy, it is unclear how close the proposed method is to actually enabling those. The authors briefly mention that their method requires the actual camera pose, but do not elaborate on how this would be obtained with current endoscopy systems. To me, this seems still quite a challenge and I would be very curious on a brief discussion on how to overcome the existing challenges that would hamper the road to clinical practice. I understand space is limited here, but I would recommend the authors to include such a discussion in a follow-up journal paper.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Paper proposes a mostly novel and fundamentally rigorous method for a challenging and relevant clinical problem. Although the experimental setup and quantitative evaluation could have been more clearly described in this paper (without referring too much to other work), it results are very convincing and promising.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    I will stand by my original decision.




Author Feedback

We would like to thank the reviewers for their constructive feedback and provide answers to the following major concerns.

Clinical Application [R1/R5]:

The reviewers have concerns regarding our method’s clinical value. Dense tissue tracking is necessary for numerous downstream tasks, such as AR for surgical guidance. For instance, in scenarios where information from pre-operative images must be overlayed intra-operatively, dense tissue tracking becomes essential to dynamically update the displayed location and shape throughout the surgery.

Our method, as demonstrated with 3D semantic segmentation, holds promise for such applications. However, we acknowledge the need to address certain limitations and advance its practical utility. Future efforts will focus on enhancing speed, testing on extended sequences, and validating robustness in real-world scenarios encompassing challenges like smoke and bleeding.

While it is noted that utilizing the endoscope pose as input may limit applicability to robotic surgery, we believe it serves as a valuable foundation for further exploration and refinement in clinical settings.

Sampling Anchor Gaussians [R3/R5]:

R3 and R5 express concerns about the random sampling of anchor Gaussians potentially leading to suboptimal modeling of deformations due to underrepresentation of certain regions. While we acknowledge this possibility, our experiments have not shown significant underrepresentation issues.

In our experiments, we subsampled with factor 64 which leads to fairly dense coverage, considering that tissue deformations are typically smooth, except on the borders of anatomical structures. While not mentioned in our manuscript, running our method on each sequence 100 times with different random seeds leads to slightly improved average performance than those reported in Table 1, but induce standard deviations 3.8px for MTE, 5.1% for \delta_AVG, and 5.8% for survival which we consider to be fairly robust. We will clarify this in Table 1. We believe our approach strikes a balance between computational efficiency and effective deformation modeling.

Choice of Baselines [R3/R5]:

We acknowledge the suggestion to include citations for relevant methods like Neural LerpPlane, Orthogonal Plane, and SemanticSuper. However, we maintain that our selected baselines are well-aligned with the focus of our evaluation.

Our method is specifically tailored for robotic surgeries utilizing forward kinematics and stereo endoscopes, distinguishing it from techniques designed for non-robotic surgeries. We emphasize that our intention is not to claim superiority over offline methods, including 4D-GS. Rather, we highlight the value of our method’s online (incremental) paradigm in facilitating the clinical applications outlined in our above response.

Regarding comparisons with RAFT and PIPS++, we acknowledge potential biases due to missing camera poses or depth information and we will state this more clearly in the revised manuscript. However, we point to our comparisons against methods like [15], which demonstrate the competitiveness of our approach with SOTA methods utilizing the same inputs.

Evaluation and Reproducibility [R5]:

The reviewer expresses doubt about reproducibility of our work, but code and data will be released upon acceptance (footnote p.2).

In our evaluation, we manually selected sequences of 200 frames to capture challenging key moments with various factors such as camera movement, tissue deformations, and camera loops. This length was chosen based on datasets used in previous 3D reconstruction methods (ie.[14]). While our method is not inherently limited to this length, longer sequences may pose challenges for long-term tracking.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All reviewers recommend acceptance, and I recommend acceptance based on the universal agreement.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    All reviewers recommend acceptance, and I recommend acceptance based on the universal agreement.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All reviewers agreed for acceptance. Reviewers recommend to add reproducible information about the manual selection of StereoMIS in the documentation and placing the EndoSURF/EndoNerf quantitative results in the main paper instead of the supplementary materials.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    All reviewers agreed for acceptance. Reviewers recommend to add reproducible information about the manual selection of StereoMIS in the documentation and placing the EndoSURF/EndoNerf quantitative results in the main paper instead of the supplementary materials.



back to top