Abstract

In contemporary surgical research and practice, accurately comprehending 3D surgical scenes with text-promptable capabilities is particularly crucial for surgical planning and real-time intra-operative guidance, where precisely identifying and interacting with surgical tools and anatomical structures is paramount. However, existing works focus on surgical vision-language model (VLM), 3D reconstruction, and segmentation separately, lacking support for real-time text-promptable 3D queries. In this paper, we present SurgTPGS, a novel text-promptable Gaussian Splatting method to fill this gap. We introduce a 3D semantics feature learning strategy incorporating the Segment Anything model and state-of-the-art vision-language models. We extract the segmented language features for 3D surgical scene reconstruction, enabling a more in-depth understanding of the complex surgical environment. We also propose semantic-aware deformation tracking to capture the seamless deformation of semantic features, providing a more precise reconstruction for both texture and semantic features. Furthermore, we present semantic region-aware optimization, which utilizes regional-based semantic information to supervise the training, particularly promoting the reconstruction quality and semantic smoothness. We conduct comprehensive experiments on two real-world surgical datasets to demonstrate the superiority of SurgTPGS over state-of-the-art methods, highlighting its potential to revolutionize surgical practices. Our code is available at https://github.com/lastbasket/SurgTPGS.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1324_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/1324_supp.zip

Link to the Code Repository

https://github.com/lastbasket/SurgTPGS

Link to the Dataset(s)

N/A

BibTex

@InProceedings{HuaYim_SurgTPGS_MICCAI2025,
        author = { Huang, Yiming and Bai, Long and Cui, Beilei and Yuan, Kun and Wang, Guankun and Hoque, Mobarak I. and Padoy, Nicolas and Navab, Nassir and Ren, Hongliang},
        title = { { SurgTPGS: Semantic 3D Surgical Scene Understanding with Text Promptable Gaussian Splatting } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {587 -- 597}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper establishes a relatively complete framework for 3D understanding in surgical scenes. The authors introduce a 3D semantics feature learning strategy and utilize the latest SAM and VLM methods. Additionally, the paper proposes semantic-aware deformation tracking to capture the seamless deformation of semantic features. Experimental results show that this method not only enables the reconstruction of 3D scenes but also possesses certain scene understanding capabilities in an open-vocabulary setting.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper establishes a well-rounded framework for 3D understanding in surgical scenes, which is crucial for real-world medical applications.
    2. The proposed semantic-aware deformation tracking technique effectively captures the seamless deformation of semantic features, making it highly relevant for dynamic surgical environments where deformation is common.
    3. The paper provides experimental results that demonstrate the method’s ability to reconstruct 3D scenes and its strong performance in understanding scenes within an open-vocabulary context.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. In Figure 4, the visualization results do not effectively demonstrate the advantages of the method. In the second row of Figure 4, the strength of “ours full” lies in the black edge regions, but there is no significant advantage in the actual tissue areas. I believe this does not sufficiently illustrate the superiority of the method.
    2. Open-Vocabulary is not reflected in the current version. Open-Vocabulary should refer to using text types during testing that did not appear in the training set, but the current experiments do not demonstrate this. Since the segmentation categories in CholecSeg8K and EndoVis18 likely all appear in the training set, the claim of Open-Vocabulary is not quite appropriate.
    3. Regarding the data split, you should provide a more detailed explanation. In EndoSurf, the training and test sets are divided in a 7:1 ratio, but this paper does not use the same dataset as EndoSurf. You need to clarify how the training and test sets are partitioned within the same sequence.
    4. In the last row case of Fig3, there are two surgical instruments, but the method appears to only segment one of them.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper establishes a relatively complete framework for 3D understanding in surgical scenes. The combination between gaussian splatting and semantic feature is interesting, but the “Open-Vocabulary” is not suitable. Additionally, the data splitting is not discussed comprehensively in this manuscript.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The author’s rebuttal addressed my questions.



Review #2

  • Please describe the contribution of the paper

    The proposed method integrates a 3D semantic feature learning strategy, combining the Segment Anything Model with a vision-language model to extract semantic features. It employs semantic-aware deformation tracking using convolutional networks to enhance spatial consistency of semantic features and capture deformations. Additionally, semantic region-aware optimization is introduced by incorporating a region loss for supervised training, improving reconstruction quality and semantic smoothness. Experiments on the CholecSeg8K and EndoVis18 datasets demonstrate that SurgOVGS outperforms similar methods in open-vocabulary segmentation accuracy, training time, and query speed.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This work applies open-vocabulary Gaussian splatting to 3D surgical scene understanding and proposes a 3D semantic feature learning strategy that integrates the Segment Anything Model with a vision-language model. This enables the extraction of richer and more accurate semantic features from surgical scenes, laying a solid foundation for subsequent processing.
    2. A semantic-aware deformation tracking mechanism is introduced, utilizing convolutional networks to enhance spatial consistency. This addresses the challenge of misaligned semantic features caused by tissue deformation in surgical scenes, allowing precise capture of semantic-level deformations and enabling more accurate reconstruction of textures and semantic features.
    3. Experiments conducted on two representative real surgical datasets—CholecSeg8K and EndoVis18—demonstrate that SurgOVGS achieves outstanding performance in open-vocabulary segmentation accuracy.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The contribution statement in the first section needs to be reorganized. Among the five listed points, 2nd, 3rd, and 4th are all specific technical improvements under the first contribution, while the 5th is merely a statement of the experimental results. Apart from validating the proposed method, the experiments do not involve a comprehensive or extensive comparative analysis, and therefore are not sufficient to be considered a standalone major contribution. 2 .The description of key modules such as the structure of the semantic tracking network, the coding details of the autoencoder, and how the semantic points are mapped to the Gaussian points are not sufficiently described, which may affect the reproducibility, and it is suggested that the authors further supplement the implementation details, for example parameter setting tables, simplified network structure diagrams.
    2. The ablation experiment design is not comprehensive enough, and it is only performed on a single sequence of CholecSeg8K, which fails to verify the universality of the proposed module in different surgical scenarios, and it is suggested that more evaluation indexes could be introduced.
    3. The graphical layout of the paper is also slightly confusing, with excessive information density and inconsistent colour scheme.
    4. Although the experimental part of the paper covers two datasets, it lacks the evaluation of the processing ability of instrument aliases and near-synonyms, etc. It is suggested to demonstrate its open-vocabulary generalisation ability by extending the linguistic query test set.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The strengths and weaknesses are as mentioned in the comments above. This paper does not provide or promise to release the code and related data, but the key techniques described are relatively clear. Although some wording should be revised to avoid overclaim, the overall structure is clear, and the proposed method demonstrates a certain level of novelty. It is recommended that the authors consider releasing the code in the future to ensure reproducibility..

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The main contribution of the paper is the proposal of a novel method that combines feature extraction using SAM and VLM with Gaussian Splatting, leading to improved 3D reconstruction quality and enhanced semantic smoothness.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    One major strength of the paper is the proposal of SurgOVGS, which enables real-time 3D visual-language modeling and segmentation—something that was previously not possible. This advancement is particularly impressive, and the results significantly outperform existing methods by a wide margin.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    One potential weakness of the paper is that the current model demonstrates its effectiveness on relatively easy-to-recognize structures with sufficient size, such as fat, instruments, and kidneys. However, it may struggle to achieve sufficient accuracy on more challenging and smaller anatomical structures, such as the ureter or nerves.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I gave an Accept recommendation because the paper implements a novel approach that is not found in other existing works and achieves high performance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank the reviewers for the valuable feedback. We appreciate that the reviewers find our work to be “clear structure & novel” (R1), “well-rounded & crucial for real-world application” (R2), and “impressive & novel” (R3). We address the major concerns as follows:

Reorganize the contribution statement. (R1) We will reorganize the statements in the contributions for the experiments.

Descriptions of key modules are not sufficient. (R1) Our semantic tracking network is formed by integrating 1D Conv layer and the MLPs as in the preliminary and Eq. 3. We will emphasize the details in the paper. Learning rate and loss weight settings are in section 3.2. Due to the limited pages, we will release all setting details in the code implementation.

The ablation experiment design is not comprehensive enough. (R1) The ablation study was conducted on sequence 01_00240 because it is the best representative of surgical challenges, including deformable tissues and instruments. This sequence encompasses diverse semantic regions and dynamic deformations than others, making it a robust testbed for evaluation. While the ablation study focuses on one sequence, the main experiments validate our method across multiple sequences, where our method outperforms all other baselines. These results demonstrate the generalizability of our modules across diverse surgical scenarios. We understand the reviewer’s concern and will include more evaluation metrics within the space of the table for discussion.

Inappropriate claim of Open-vocabulary and generalisation ability. (R1, R2) We realize the inappropriate claim for “Open-vocabulary”, and we will adjust our title and the claim in the paper to “Text Promptable Gaussian Splatting”. Due to the limited pages, we choose to evaluate all baselines with the same linguistic query (with common prompt template used in CLIP [18]), and we will extend the experiments for various aliases in future journal extension.

Ineffective visualization results. (R2) The mIoU results are evaluated inside the black edge (Without the black edge). Our segmentation result of the full model shows clearer boundaries, e.g., the grasper edge. Since Gaussians are distributed in space, the better black edge suggests that our full model constrains the Gaussian in a certain space, indicating our full method has better performance.

Details for the data split. (R2) Since we aim at achieving real-time query of 3D segmentation on novel view-points (not include in the training of Gaussians), we split the data as EndoSurf [27] (Deform3DGS [29], Endo-4DGS [6], EndoGaussian [14]). For every 8 images, the first 7 are selected for training and the last 1 is selected for testing (Typically, Gaussians are trained and tested on the same sequence as in 3DGS [8]).

Segment two surgical tools in Fig. 3. (R2) In the last row of Fig. 3, there are two types of instruments, but we only prompt one, which results in the segmentation results having only one instrument. We prompt the segmentation with the text “instrument-wrist”, which corresponds to the instrument (clasper) on the left side. If we want to segment the body of the instrument (scissor) on the right side, the corresponding text prompt should be “instrument-shaft” according to the definition in EndoVis [1].

Limitation on challenging structures. (R3) We observe the challenging cases (Veins, Small connective tissue) and find that although the accuracy is not satisfactory as large structures, our method still outperforms other baselines. We will provide a related discussion on the limitations.

We will also fix the minor problems (graphical layout, information density, color scheme) for better clarity. All suggested changes will be added to the paper (No additional experiments). The code, dataset, and checkpoints will be public upon acceptance.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top