Abstract

Computer vision technologies markedly enhance the automation capabilities of robotic-assisted minimally invasive surgery (RAMIS) through advanced tool tracking, detection, and localization. However, the limited availability of comprehensive surgical datasets for training represents a significant challenge in this field. This research introduces a novel method that employs 3D Gaussian Splatting to generate synthetic surgical datasets. We propose a method for extracting and combining 3D Gaussian representations of surgical instruments and background operating environments, transforming and combining them to generate high-fidelity synthetic surgical scenarios. We developed a data recording system capable of acquiring images alongside tool and camera poses in a surgical scene. Using this pose data, we synthetically replicate the scene, thereby enabling direct comparisons of the synthetic image quality (27.796±1.796 PSNR). As a further validation, we compared two YOLOv5 models trained on the synthetic and real data, respectively, and assessed their performance in an unseen real-world test dataset. Comparing the performances, we observe an improvement in neural network performance, with the synthetic-trained model outperforming the real-world trained model by 12%, testing both on real-world data.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1448_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1448_supp.zip

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Zen_Realistic_MICCAI2024,
        author = { Zeng, Tianle and Loza Galindo, Gerardo and Hu, Junlei and Valdastri, Pietro and Jones, Dominic},
        title = { { Realistic Surgical Image Dataset Generation Based On 3D Gaussian Splatting } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents a surgical data generation pipeline based on 3D Gaussian splatting. The 3D model allows for pose editing, thus synthesizing photorealistic images with annotation information. The paper also demonstrates improvement in object detection network training with generated synthetic datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This work proposes an application of synthesizing novel surgical scenes leveraging a recently popular 3d Gaussian splatting technique. This paper presents to reconstruct the instrument and background respectively and fuse the 3D Gaussian models, then render 2d novel scenes to generate photorealistic datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    This pipeline assumes that the background scene is static without tool-to-tissue interaction. However, such a scenario is unrealistic in clinical cases. Though the original 3D Gaussian splatting can only reconstruct static scenes, some works consider deformations, such as EndoGS [2] and nerf-based method as Endonerf [1], which makes it possible to get rid of the static scene assumption. The proposed pipeline heavily depends on a complex physical setup. For instance, employing circular sampling for reconstruction is impractical in most surgical scenarios, especially minimally invasive surgery. Furthermore, following the preprocessing stages, patient anatomy and surgical instruments must maintain rigidity, which is also unattainable in clinical settings. Although each iteration of the reconstruction process produces different combinations of tool poses against a static background, it fails to capture the dynamic nature of minimally invasive surgeries. In these procedures, backgrounds typically consist of soft tissues that are in natural deformation. Consequently, the necessity for frequent background reconstructions persists, adding to the laborious task of generating more realistic datasets. This paper claims that: “Our generated datasets have been proven effective for training neural networks,” but this point is not well supported by only showcasing improvement on a downstream object detection network. It is not explained why “models trained on generated images outperform those trained on true images by up to 12%.” Moreover, the training dataset, as described in the paper, features the same rigid background, potentially leading to overfitting issues. Additionally, the absence of cross-fold validation on different datasets raises concerns about the generalizability of the results.

    Reference:

    [1] Wang Y, Long Y, Fan S H, et al. Neural rendering for stereo 3d reconstruction of deformable tissues in robotic surgery[C]//International conference on medical image computing and computer-assisted intervention. Cham: Springer Nature Switzerland, 2022: 431-441.

    [2] Chen Y, Wang H. EndoGaussians: Single View Dynamic Gaussian Splatting for Deformable Endoscopic Tissues Reconstruction[J]. arXiv preprint arXiv:2401.13352, 2024.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors didn’t claim to release the code and dataset.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please refer to the weakness part with corresponding comments.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work is constrained by significant limitations in methodology, experimentation, and practical application, as elaborated in weaknesses. It relies on unrealistic assumptions to generate surgical datasets, which undermines the credibility of its findings. The experimental setup is clinically infeasible. Furthermore, there isn’t enough evidence or explanation to back up the claim that it makes network training better.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision

    The authors’ response doesn’t convince me of the necessity and uniqueness of the proposed method to clinical scenario. Although the authors claimed that the goal of the method is to pre-train neural networks, the evaluation experiments didn’t provide clear evidence that the proposed synthesis method-based pre-training potentially enhances clinical applications. Moreover, the authors didn’t explicitly address my concern about the intensive reliance on a complex physical setup to synthesis data for each new scene. Overall, the objective and feasibility of the proposed work remains unconvincing to me, so I will adhere to my initial decision.



Review #2

  • Please describe the contribution of the paper
    • The introduction of 3D Gaussian Splatting for generating synthetic surgical datasets is a significant contribution to the field, addressing the challenge of limited availability of comprehensive surgical datasets for training.

    • The research significantly enhances the automation capabilities of robotic-assisted minimally invasive surgery through advanced tool tracking, detection, and localization, which are crucial for improving surgical outcomes.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper demonstrates a thorough experimental setup, including the training of YOLOv5 models on synthetic and real data, with clear performance metrics for comparison.

    • The evaluation of synthetic image quality using metrics like PSNR, SSIM, and LPIPS, along with comparative analysis with ground truth images, showcases a comprehensive assessment of the proposed methodology.

    • The practical application of the synthetic datasets in training neural networks for object detection, specifically in robotic surgery, highlights the real-world relevance and impact of the research.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    While the paper compares models trained on synthetic and real data, a more in-depth analysis of the advantages and disadvantages of using synthetic datasets would provide valuable insights. Additionally, further experiments are needed to comprehensively evaluate the proposed method.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors could explore practical implications of their methodology, including scalability to diverse surgical scenarios, robustness to environmental variations, and potential integration with other computer vision technologies for enhanced automation.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    By providing constructive feedback and suggestions for further exploration, the authors can enhance the impact and relevance of their research.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose a simple, yet effective synthetic data generation method for minimally invasive robotic surgery based on 3D Gaussian Splatting (GS). They evaluate the generated data present one downstream application (object detection) and show that the model trained on synthetic data outperforms the baseline.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Gaussian splatting is a promising approach to generate synthetic surgery datasets, potentially overcoming the downsides of classical rendering, generative deep learning, and NeRFs for synthetic data generation.

    The idea of systematic editing in GS representation is interesting and could be transferred to other applicaitons.

    The automated label generation is interesting and definitely useful.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    As the method does not handle manipulation of the “background”, i.e. the surgical scene, synthetic data from tool-tissue interaction cannot be generated (which is actually the most important data for the training of useful models in computer aided surgery). The background remains static and tissue deformation cannot be modeled.

    While the authors state that the method outperforms classical rendering and generative deep learning-based approaches, they only compare against NeRF-based methods.

    The technical contributions of the work are rather weak as the authors mainly combine existing methods.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The keywords are missing.

    The authors compare the results with Instant-NGP and Nerfacto, both are novel view synthesis methods. It would be interesting to see the performance of the proposed method in comparison with the previous approaches.

    Tool-tissue interaction and tissue deformities are not modeled in the proposed framework. The authors should discuss if and how this can be done in the discussion section of the manuscript.

    The authors should explain in more detail how the Spherical Harmonics function is used to automatically generated labels. In general, the automated label generation process could be better illustrated in a figure.

    While the proposed method is able to synthesize new tool poses, the method cannot directly be employed to alter specific propoerties of the scene which can be done using generative approaches [1]. The authors should discuss if and how this can be done in the discussion section of the manuscript.

    Why did the authors use synthetic data only for comparison? A more common approach is to pre-train on synthetic and refine with real data.

    The paper is lacking a propoer discussion section which is required to put the results into context, as well as explain implications and potential applications.

    Why did the authors choose PSNR? There are also other more modern metrics such as Inception score that can be used to assess the quality of synthetic images.

    [1] Shawn Mathew, Saad Nadeem & Arie Kaufman, CLTS-GAN: Color-Lighting-Texture-Specular Reflection Augmentation for Colonoscopy, MICCAI 2022

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the technical novelty is limited, the approach could be very useful for the generation of automatically labeled synthetic data for (pre-)training of machine learning models in computer aided minimally invasive surgery. However, certain aspects such as tissue deformation and a detailed description of the automated label generation are currently missing.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors sufficiently explained why also static backgrounds can be valuable in the generation of synthetic training data. Furthermore, in case of acceptance, the camera ready version should include more detailed information about the label generation. Therefore, I stay with my previous decision of “weak accept”.



Review #4

  • Please describe the contribution of the paper

    Introduces a 3D Gaussian splatting-based method to generate synthetic surgical datasets. This method includes extracting and combining 3D Gaussian representations of surgical instruments and the background operating environment. The object detection model trained on generated synthetic data performed well on real-world test data, compared to the model trained on real-world train data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. employing two Gaussian splatting models (one for the instrument and one for the background tissue), one of the main strengths of this work is the ability to generate synthetic datasets with instrument pose and segmentation mask annotations.
    2. The proposed method performs significantly better than two NERF-based models.
    3. The object detection model trained on the synthetic dataset performed better on a real-world test set.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. A major weakness of this work is the writing. The methodology and experiemnt section seems a little superficial and lacks specific details. The experiment section also need better organization. (specifics are highlighted in constructive feedback).
    2. The SOTA comparison (Table 1) seems limited. Only 2 Nerf-based models are compared.
    3. While the performance difference of the model trained on a synthetic dataset.
    4. Lacks multi-fold test/performance comparison using other models to remove any dataset/model bias.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Lacks specific details on methodology, experimental setup and dataset. Code is not available. The information in the manuscript is insufficient to effectively reproduce this work.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Include specific details in the manuscript: a. Section 2.2, para 1, line 1: How are the two Gaussian models trained? techniques, parameters, platform? b. Provide more specific details on the datasets (The dataset used to train the Gaussian model and the generated dataset). How many frames were in each video? what is the resolution? c. Consider adding a dataset and experiment subsections for better information organization. d. Improve the figure and table caption to be more informative. e. Add citations to SOTA models in Tables.
    2. Consider doing multi-fold test and alternate model test for object detection task to remove any dataset/model bias.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    In general, the idea and effort are commendable. The proposed method enables synthetic dataset generation for given surgical scenes and instruments. The experimental setup also allows the possibility of generating ground-truth images to effectively evaluate generated synthetic dataset. While the idea has great value, this work is limited by writing and limited experiments. The writing requires revision to add specific details on methodologies, datasets, and experimental setup. The writing lacks enough details to effectively reproduce this work. Additionally, the SOTA comparison is limited and lacks a multi-fold test/alternate model test for the object detection task. Taking these into account, I recommend weak acceptance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Based on the rebuttal reponse, I continue to stand on my previous recommendation.




Author Feedback

We thank the reviewers for their insightful comments. Bellow, we summarized their comments into three main points.

1.The method assumes a static background (no tool-to-tissue interaction or deformation)
Two key aspects of the generation of synthetic surgical images are the relevant positioning of the tools and realistic background deformations. Our work focuses on the tool representation for the generation of synthetic images adding, removing and altering the tools on realistic backgrounds without modifying them. The goal is to pre-train neural networks gaining a strong realistic prior before real data is used.
While works like Endonerf and EndoGaussians(arXiv paper without peer-review) consider deformations, they focus on replicating and reprojecting deformable scenes that have already occurred (while removing occluding objects) and are unable to synthesize new surgical states or unseen deformations. We agree with the reviewers that this is an interesting direction for the work and wish to move in this direction, but it is beyond the scope of the novel-pose synthesis method we present here. For this study, generating synthetic images of surgical tools in novel poses to train neural networks is the goal, and static scenes are sufficient to achieve this objective.

  1. The comparison scope is limited to NeRF-based methods, and cross-fold validation should be performed to better evaluate the method. One major challenge in image generation is obtaining ground truth (GT) images for direct comparison. We addressed this using Gaussian editing and tool pose tracking, enabling us to acquire precise GT images for comparison with synthesized ones. This capability is uncommon in other generative methods, which typically lack precise GT data. Thus, we compared our results with GT images for a more accurate evaluation of our method’s effectiveness. As 3D Gaussian Splatting (3DGS) is a novel view synthesis technique, we primarily compared it with NeRF, the current SOTA in this field. We selected NeRFacto and InstantNGP for their comparable training times, ensuring a rigorous experiment. For the selected downstream task, tool detection, we trained a YOLO model. Although we did not perform multi-fold validation, our test dataset, as mentioned in the manuscript, includes backgrounds and tool poses entirely independent of the training and validation sets. This hold-out set effectively demonstrates the model’s generalization ability and robustness beyond the training data distribution. Thanks to the reviewer comments we recognize the value of multi-fold validation and plan to incorporate it in our research. If allowed, we would include such analysis in a camera-ready version.

  2. Technical contributions are considered limited, as the method mainly combines existing techniques and depends on complex setup. Descriptions in the paper could be improved. Our method is based on 3DGS, but we have made significant advancements to make it suitable for laparoscopic scenarios. The original 3DGS method could not achieve scene fusion, editing, or automatic annotation generation. Our method incorporates these capabilities, all while reducing the need for recording and annotating surgical videos and addressing the paucity of available data. Our lighting-accurate tool scans may be inserted to any trained preexisting surgical scene. In MIS, lighting conditions and tool geometries remain consistent, but tool poses are constrained by trocar placement. Our method can synthetically project the tool in any pose at any entry position, increasing the potential for available data by utilizing existing scenes. This innovation is crucial for enhancing the flexibility and applicability of our approach. Regarding the reviewers’ comments on the lack of detailed descriptions, we will supplement and improve these in our paper. We thank all the reviewers once again for their comments.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Information about label generation and explanations of the experimental environment need to be clarified more, but it seems that the major issues have been addressed in the rebuttal.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Information about label generation and explanations of the experimental environment need to be clarified more, but it seems that the major issues have been addressed in the rebuttal.



back to top