Abstract

Real2Sim is becoming increasingly important with the rapid development of surgical artificial intelligence (AI) and autonomy. In this work, we propose a novel Real2Sim methodology that leverages 3D Gaussian Splatting to provide fully controllable 3D reconstruction of surgical instruments from monocular surgical videos. To maintain both high visual fidelity and manipulability, we introduce a geometry pre-training to bind Gaussian point clouds on part mesh with accurate geometric priors and define a forward kinematics to control the Gaussians as real instruments. Afterward, to handle unposed videos, we design a novel instrument pose tracking method leveraging semantics-embedded Gaussians to robustly refine per-frame instrument poses and joint states in a render-and-compare manner, which allows our instrument Gaussian to accurately learn textures and reach photorealistic rendering. We validated our method on 2 surgical videos and 4 videos collected on \textit{ex vivo} tissues and green screens. Quantitative and qualitative evaluations demonstrate the effectiveness and superiority of the proposed method.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3069_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/jinlab-imvr/Instrument-Splatting

Link to the Dataset(s)

N/A

BibTex

@InProceedings{YanShu_InstrumentSplatting_MICCAI2025,
        author = { Yang, Shuojue and Wu, Zijian and Hong, Mingxuan and Li, Qian and Shen, Daiyun and Salcudean, Septimiu E. and Jin, Yueming},
        title = { { Instrument-Splatting: Controllable Photorealistic Reconstruction of Surgical Instruments Using Gaussian Splatting } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15962},
        month = {September},
        page = {305 -- 315}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors present a novel application of Gaussian splatting for reconstructing surgical instruments in endoscopic (robotic) video sequences, requiring only an RGB sequence and the instrument’s CAD model as input. Their key contributions include: a geometric pretraining module that leverages a procedural algorithm to identify instrument pose and joint angles in the camera reference frame, training a 3DGS representation of the surgical instrument; and a tracking module initialized by solving a PnP problem with manually selected 3D-to-2D correspondences, which subsequently refines and tracks the instrument’s Gaussian point cloud throughout the video frames. The authors evaluate their approach in two scenarios: real surgical videos and controlled environments with chroma-keyed instruments against a green screen background.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper presents a novel approach for representing surgical instruments with 3D gaussians. Compared to the related work, the method presented is very different and novel.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The paper presents a multi-component methodology encompassing kinematics, Gaussian representations, initial pose estimation, tracking, and feature alignment. However, these critical elements are described with excessive brevity and insufficient detail, obscuring the significance of the contributions and impeding clear comprehension. I recommend that the authors leverage their figures more effectively to convey concepts, as the current visualizations are underdeveloped and fail to communicate the richness of the approach.

    Some specific concerns regarding the methodology:

    • The approach appears constrained by the requirement for manual selection of 2D-3D correspondences during pose estimation initialization. The paper fails to clearly specify whether this manual process occurs exclusively during training or if it’s also necessary (and how it’s implemented) during inference.

    • The work lacks a quantitative ablation study, making it impossible to evaluate the effectiveness and relative importance of each proposed key contribution. Without this analysis, it remains unclear which components are truly essential to the method’s performance and to what degree they contribute to the overall results.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    An ablation study demonstrating the individual and combined contributions of each proposed optimization is missing from the paper. This analysis would be essential to understand the relative impact and necessity of each component within the overall framework.

    Additionally, the paper requires a thorough revision of both language and graphics to enhance clarity and comprehension. The current presentation makes it difficult for readers to fully grasp the technical contributions and their significance.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    The authors failed to adequately address the major concerns raised in the review, providing poor justifications such as “it is not needed” or “it is not meaningful,” without explaining why. For this reason, I maintain my recommendation for rejection



Review #2

  • Please describe the contribution of the paper

    The authors demonstrate the effectiveness of their method not only through qualitative renderings but also via rigorous quantitative evaluation. They adopt a dual-loss strategy (silhouette and RGB) to supervise the texture learning of the Gaussian point cloud, achieving highly photorealistic reconstructions. Furthermore, they introduce a newly collected in-house dataset using the dVRK system under both ex vivo tissue and green-screen settings, complementing the public EndoVis2017/2018 data. Compared to strong baselines such as EndoGaussian and Deform3DGS, the proposed method shows superior reconstruction quality, particularly in complex articulated regions like the gripper.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The authors complement public datasets (EndoVis2017/2018) with a partially novel in-house dataset collected using the dVRK system.

    • The in-house dataset includes sequences under both ex vivo tissue and green-screen conditions, providing diversity in visual context.

    • Part-level semantic masks are generated using SAM and refined, enabling evaluation of articulated pose tracking and part-level reconstruction.

    • The dataset is planned for public release, which could benefit the surgical AI community and support reproducibility.

    • The proposed method addresses a clinically relevant and challenging problem: realistic and controllable reconstruction of surgical instruments from monocular videos.

    • The ability to reconstruct and animate articulated instruments in a photorealistic manner could be valuable for simulation-based training, data augmentation for learning, and autonomous robotic surgery research.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The pose estimation module relies on per-frame optimization (render-and-compare), making it computationally expensive and unsuitable for real-time or large-scale deployment.

    • The pipeline requires manual initialization (2D–3D correspondences for PnP) in the first frame, limiting full automation.

    • Tip detection is heuristic-based, which may lead to error accumulation across long sequences.

    • The method is validated on only a single instrument (LND), with no evidence of generalizability to other tools.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the technical novelty is moderate and some concerns remain regarding scalability and reproducibility, the paper addresses a challenging and underexplored problem—photorealistic and controllable reconstruction of articulated surgical instruments from monocular videos. This task is particularly difficult due to the lack of ground-truth pose data and the complexity of instrument kinematics. The authors make a notable contribution by designing a full pipeline capable of tackling this challenge, and their method shows solid performance both quantitatively and qualitatively. Additionally, the authors complement public datasets with a partially novel in-house dataset collected using the dVRK system, which includes sequences under diverse conditions (ex vivo tissue and green screen). This contributes to the community by providing new resources for evaluating Real2Sim methods in the surgical domain.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    After carefully considering the authors’ rebuttal, I have decided to raise my score from Weak Accept to Accept. The rebuttal addresses key concerns raised in the initial review with clear and well-reasoned explanations. In particular: The authors clarify that the method is designed for simulation and data generation, where real-time inference is not essential. The reported processing time of 7–9 minutes per sequence is acceptable for this intended use. The manual initialization step is minimal (6 2D points in a single frame), and the authors propose a plausible path toward full automation using existing keypoint detection networks. The concern regarding error accumulation is adequately addressed through per-frame independent tip detection with loose regularization. Regarding the use of only one instrument, the authors justify this choice based on CAD availability and community practice, while also highlighting the generalizability of their method to other articulated tools. Moreover, I appreciate the authors’ commitment to releasing code and the partially novel dataset, which will benefit the community. Overall, the paper presents a well-motivated solution to a challenging problem with convincing results and clear writing. I believe the contributions are solid and relevant to the MICCAI audience.



Review #3

  • Please describe the contribution of the paper

    The paper proposes ​​Instrument-Splatting​​, a novel framework for photorealistic and controllable 3D reconstruction of surgical instruments from monocular videos, which can serve as a Real2Sim technique. The key contributions include 1) the c​ontrollable GS reconstruction​​ with geometry pretraining; 2) the ​​robust pose tracking​​ even under large motions; 3) the photorealistic textures​​ learning for high-fidelity appearances.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The existing methods have the limitations including unrealistic CAD models, ​​inability to handle articulated motions​, and ​​limited manipulation flexibility. This paper proposes a novel pipeline. 1) geometry pretraining effectively binds the GS to the mesh models with accurate geometry. 2) pose tracking and joint states estimation are realized based on the pretrained GS. 3) correspondence matching is used to guide the GS to move toward the current frame pose, which is robust to large inter-frame instrument motions.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) The validation is based on only one type of instrument. The performances on other instrument types are not revealed. 2) The rendering speed is not reported. 3) A demo video is recommended to show the declared manipulation flexiblity of the articulated instrument. 4) The concept of Real2Sim is mentioned but not explained.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The technical innovation is impressive. The controllable Real2Sim is valuable for surgical AI. However, the experiments only involved one type of instrument and the demonstration video is not provided.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank reviewers for the valuable comments, and we are encouraged by the positive comments on our organization and clarity (R1&R2), ‘novel method’ (R2&R3) to address this ‘challenging and underexplored’ task(R1), ‘new dataset’(R1), and ‘solid performance’ (R1). Below, we address specific comments, which will be added in revision. Code will also be available for reproduction.

——R1:

  1. Time for Pose Estimation As our reconstruction is intended for downstream simulation and data generation, real-time reconstruction is not essential. With fast GS rendering and early stopping, overall pose estimation takes 7~9 mins per sequence, sufficient for our case.
  2. Limited Autonomy Manual correspondences are only used to initialize wrist pose in the first frame; subsequent process is automatic. For initialization, only 6 2D points on wrist are annotated, adding minimal workload. A DNN (e.g., [7]) can replace this step to accurately detect keypoints given distinctive landmarks. Will be added to our future work.
  3. Error Accumulation Tips are detected independently via SVD per frame, avoiding error accumulation. To reduce the impact of inaccuracies, loose regularization is utilized in tip loss.
  4. Only one type We acknowledge the limitation and clarify that LND instrument is used since – Only this has a publicly available CAD model – For mesh-based instrument tasks, evaluating only on LND is standard, such as [2]. – LND features a representative two-joint structure (shared by many other articulated instruments) with small domain gap from others. As a template(CAD)-based method, our method is type-independent. Thus, it can be easily extended to other articulated EndoWrist tools, once knowing CAD models.

——R2:

  1. Only one type of instrument: Kindly refer to Q4 in R1
  2. Rendering Speed: 172 FPS on an A5000 GPU
  3. Demo: As shown in Fig. 5, our reconstruction handles complex articulations, showing high flexibility. A demo video of final instrument GS with varying joint angles and views will be released with code.
  4. Real2Sim: Refers to our process using real surgical videos to create a controllable instrument digital twin for photorealistic simulation.

——R3:

  1. Clarification of manual selection – Only training requires manual initialization in the first frame. In real inference, we have instrument GS and input user-defined pose for controllable data generation; thus no need for manual selection or pose estimation anymore. – Initialization: Manual initialization is only used in the first frame, and only wrist pose is needed for initialization; shaft and gripper poses are fixed relative to it. Only 6 3D-2D pairs on wrist are sufficient for PnP pose calculation. – Testing for Quantitative Evaluation: We simulate user-defined input pose in testing phase since pose-frame paired data are required for quantitative evaluation. Specifically, we perform per-frame pose estimation, frames are then split into training set for reconstruction and testing set for evaluation, where estimated poses can simulate user-defined inputs, and frames serve as ground truth.
  2. Figure: In Fig. 3, we will add arrows among modules, highlight input/output of each module, and add symbols to assist clarity.
  3. Ablation Study We respectfully clarify that a quantitative ablation study has been provided in Table I, validating our major components: geometry pretraining on reconstruction quality, and the proposed enhancements to render&compare (i.e., tip loss and loose regularization). Indeed, our framework consists of other elements as you mentioned, but they are coupled in a sequential pipeline where each element is a prerequisite for the next, e.g., pose estimation/control requires kinematics. Removing these elements would break the pipeline, making it non-functional, e.g., removing the pose tracking can lead to no overlap between rendered mask and GT, disabling pose optimization. Thus removal analysis of these elements is not that meaningful in this context.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The final version should thoroughly include reviewer comments and suggestions if the paper is accepted. In particular, concerns expressed by R3.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper introduces Instrument-Splatting, a Gaussian splatting-based framework for photorealistic 3D surgical instrument reconstruction from monocular videos, leveraging geometry pretraining with CAD models and robust pose tracking to handle complex motions. It achieves high-fidelity results through a dual-loss texture learning strategy and outperforms existing methods, validated on both real surgical sequences and controlled ex-vivo/green-screen datasets. However, the simplevalidation on only one instrument type without demonstrating generalizability to other surgical tools, and its reliance on manual 2D-3D correspondence selection without clarifying whether this requirement persists during inference, degrade the quality of the manuscript. Beside, the lack of rendering speed metrics, comprehensive ablation studies, and detailed explanation of the Real2Sim concept leaves key performance aspects and methodological contributions unclear.



back to top