Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Surgical scene understanding is crucial for computer-assisted intervention systems, requiring visual comprehension of surgical scenes that involves diverse elements such as surgical tools, anatomical structures, and their interactions. To effectively represent the complex information in surgical scenes, graph-based approaches have been explored to structurally model surgical entities and their relationships. Previous surgical scene graph studies have demonstrated the feasibility of representing surgical scenes using graphs. However, certain aspects of surgical scenes—such as diverse combinations of tool-action-target and the identity of the hand operating the tool—remain underexplored in graph-based representations, despite their importance. To incorporate these aspects into graph representations, we propose Endoscapes-SG201 dataset, which includes annotations for tool–action– target combinations and hand identity. We also introduce SSG-Com, a graph-based method designed to learn and represent these critical elements. Through experiments on downstream tasks such as critical view of safety assessment and action triplet recognition, we demonstrated the importance of integrating these essential scene graph components, highlighting their significant contribution to surgical scene understanding. The code and dataset are available at https://github.com/ailab-kyunghee/SSG-Com.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2642_paper.pdf

SharedIt Link: https://rdcu.be/eHw2o

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05114-1_59

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/ailab-kyunghee/SSG-Com

Link to the Dataset(s)

Endoscapes-Bbox201: https://github.com/CAMMA-public/Endoscapes Endoscapes-SG201: https://github.com/ailab-kyunghee/SSG-Com

BibTex

@InProceedings{ShiJon_Towards_MICCAI2025,
        author = { Shin, Jongmin AND Cho, Enki AND Kim, Ka Young AND Kim, Jung Yong AND Kim, Seong Tae AND Oh, Namkee},
        title = { { Towards Holistic Surgical Scene Graph } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {617 -- 626}
}

Reviews

Review #1

Please describe the contribution of the paper
1. The manuscript presents an extension to a public dataset of endoscopic scene segmentation with new and extra annotations particularly with multiple instrument classes, operator identity (right, left, and assistant), and instrument-tissue interaction (triplet) labels. This provides a structured approach to model interactions with instance annotations.
2. The manuscript reports the model performance improvement on two downstream tasks - triplet recognition and CVS assessment.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The manuscript is well organized and easy to understand.
2. The dataset proposed is interesting given that ES201 does not provide details on the instrument classes and the interaction labels, missing out on the full scene context.
3. The information provided in the hand/operator identity is crucial to distinguish the nature of the instrument-tissue interaction.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Inadequate Experimental Details
  - There is no mention of the performance of Faster R-CNN both at the overall level as well as the per-class making it difficult to gauge how the detector performance is impacting the overall performance on the tasks.
  - No details on how many proposals are handled and how many proposals per image are used to construct the graph.
  - No details provided on the feature dimensions and number of layers for both edge classifier and the hand identity classifier.
  - Unclear description of how triplet recognition is performed. How are the detections handled and since the model uses a graph structure, how does the model arrive at logits of shape B x 34? Is it averaging graph node features or max-pooling as these details are critical to the method.
  - For the triplet recognition, why are standard baseline models such as Rendezvous not included?
  - Is the model SSG-Com based on top of LG-CVS as the manuscript does not specifically mention this information?
  - Is there message passing involved between instrument and target instances?
2. Lack of technical novelty
  - The manuscript does not provide sufficient technical novelties in terms of utilization of interaction features as well as the hand identity information.
  - The identity of the hand or the operator is more intuitive if used as a part of the edge features as it provides distinctive cues as to what operator performs what action.
  - Lack of analysis of failure cases - Are there cases where the model fails to identify instrument-tissue interaction? What interactions are prone to misclassification more compared to other interactions
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall, the manuscript presents a nice approach to enrich existing dataset of segmentation of surgical entities. However, the manuscript lacks a lot of experimental details and clarification on how the model obtains the final logits needed for prediction for downstream tasks. Moreover, lack of sufficient baselines leads to inadequate analysis of model performance.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors presented a complete rebuttal with all questions answered with details. Also, this dataset presents a new avenue to explore not only scene understanding but also the complex interplay of different operators.

Review #2

Please describe the contribution of the paper

This paper introduces a new dataset, Endoscapes-SG201, which extends the existing Endoscapes-Bbox201 with refined annotations and hand identity labels. Based on this dataset, the authors propose a graph-based learning framework, SSG-Com, that incorporates spatial relations, surgical actions, and hand identity into scene graph representations. The method is evaluated on two downstream tasks and shows improved performance over existing approaches.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- This paper introduces a new dataset that incorporates hand identity information into surgical scenarios, and this dataset will be made publicly available.
- A new graph-based method is designed and achieves state-of-the-art results in both CVS prediction and triplet recognition tasks.
- The paper includes both quantitative comparisons and ablation studies to demonstrate the effectiveness of the proposed design.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The qualitative results in Fig. 3 are insufficient to fully demonstrate the effectiveness of the proposed method.
- Given that the Endoscapes dataset contains only laparoscopic views without operating room environment views, how was the annotation of operating hands achieved? Could there be potential inaccuracies that might affect the method and experimental results?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

It is suggested that the author discuss the generalization of the proposed graph construction method and whether it can be successfully applied to other datasets for understanding surgical scenarios, such as the CholecT50 surgical triplet dataset.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper proposes a dataset with a new hand identity task, which has potential for future research. A graph-based modeling approach is proposed. More experimental results are needed to demonstrate its effectiveness.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The authors propose a surgical scene graph dataset, Endoscapes-SG201, that builds on the existing Endoscapes-BBox201, adding tool classes, operator hand labels, and tool relation labels. Then, they show that pre-training latent graph representations on these scene graph annotations can lead to performance benefits in 2 different downstream tasks.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1) The proposed dataset is quite unique, and can accelerate various popular research areas that have begun to leverage graphs (scene graph prediction, representation learning, visual question answering, etc.). 2) The method formulation is very sensible - the separation of relations into spatial and action-based.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

1) The method description can be unclear at times. 2) The differences between the proposed SSG-Com and the baseline LG-CVS need to be better delineated, both from a method description perspective and in the evaluation. For example, why can’t LG-CVS be trained using the full Endoscapes-SG201 dataset (including relation labels)? While there may be clear answers to such questions, they aren’t necessarily clear from reading the manuscript.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

I would strongly encourage the authors to revisit the writing of the paper, focusing on distinguishing SSG-Com from LG-CVS. Perhaps a brief preliminary section describing LG-CVS could help here, with following sections explaining added/modified method components.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper offers a very clear contribution: a new scene graph dataset along with an incrementally improved latent graph learning method. Together, they enable improved downstream task performance on 2 different tasks.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

I am retaining my original rating, as my original comments still stand. The comments from the authors clarifying aspects that LG-CVS cannot handle (operator hand classification, action classification) as well as the downstream classification details (3 linear heads for multi-label triplet classification) improve clarity. I would be careful saying that existing methods ‘cannot’ handle certain modeling aspects as there are usually naive extensions that can enable this (for example, LG-CVS could be extended to handle the new labels by training the object detector to differentiate left-hand tool vs. right-hand tool for handedness, then including action-based edges as a 4th edge class in addition to the 3 spatial edges). With that said I definitely agree that the proposed method is a much more elegant and generalizable formulation.

Regarding concerns to method comparisons (Rendezvous for triplet classification), I agree with the authors that it is not strictly necessary to evaluate their framework, which is aimed to improve LG-CVS. However, it would have been generally informative as the proposed triplet dataset is completely new and different from the commonly used CholecT50. I also think that there is enough evidence that directly leveraging dense labels (e.g. tool bounding boxes) is better than relying on class activation mapping or other forms of weak supervision, as shown by Sharma et. al. in [1].

[1] Sharma, S., Nwoye, C. I., Mutter, D., & Padoy, N. (2023, October). Surgical action triplet detection by mixed supervised learning of instrument-tissue interactions. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 505-514). Cham: Springer Nature Switzerland.

Author Feedback

We appreciate all reviewers for their valuable comments and constructive feedback. The reviewers recognized that our new dataset, Endoscapes-SG201, is interesting (R3), unique and can accelerate research areas that leverage graphs (R2). The reviewers also noted that our graph formulation is sensible (R2), and our experiments demonstrate its effectiveness (R1). We will address all comments in the camera-ready version and release all code and datasets.

[R2,R3] LG-CVS and SSG-Com We clarify that SSG-Com builds on LG-CVS with several extensions. In LG-CVS, the graph is defined with tools and anatomy as nodes and spatial relations as edges. To leverage Hand identity and Action, we add a hand identity classifier and an action classifier. Consequently, in Tab. 3(a), LG-CVS cannot learn Hand identity and Action, whereas SSG-Com effectively integrates both.

[R1] Hand Identity Annotation The Endoscapes consists of laparoscopic cholecystectomy videos, where 4 trocars are used: 2 for the operator, 1 for the assistant, and 1 for the camera. Since trocar positions remain fixed during surgery, the operator’s left-hand tool always appears on the left, the right-hand tool in the lower right, and the assistant’s tool on the right. Therefore, accurate annotation can be achieved solely from laparoscopic views. Furthermore, two clinical experts performed the annotation, including multiple validation steps.

[R3] Experimental Details (Detector Performance) Faster R-CNN’s bbox mAP is 31.1, with high performance in gallbladder (62.3) and hook (63.4), while low performance in cystic plate (6.4) and irrigator (5.1). (Number of Proposal) Per image, 120 edges from 16 nodes are handled, and 4 edges per node are selected, resulting in 64 edges processed by the GNN. (Classifiers) We use MLPs for edge, hand identity, and action triplet classification. The Edge and Hand Identity Classifiers take 256-D features as input and output 7 classes for Action Edge and 4 for Hand Identity, including a null class. The Action Triplet Classifier takes a 1024-D graph latent as input, consisting of three linear layers, and predicts 34 classes using multi-label classification. (Message Passing) Message passing occurs as node and edge features pass through an MLP, updating node features. (Comparison with Rendezvous) Our goal is to show how Action and Hand Identity enrich graph latents. Thus, in Tab. 3(a), we ablate tool-class granularity, Action, and Hand Identity, passing the graph latents from both LG-CVS and SSG-Com through the same MLP decoder. Including Rendezvous may conflate graph effects with architectural differences, so we decided not to include it.

[R3] Technical Novelty While LG-CVS provides a solid graph foundation by encoding tools, spatial relationships, and anatomy, it does not consider Hand Identity and Action. We therefore extend its architecture with a hand-identity classifier to enrich node features and an action classifier to augment edge features. In scene graphs, objects are represented as nodes, and relationships as edges. Hand identity is an attribute of each tool, regardless of the presence of a relationship, so we use it as a part of a node rather than an edge. However, it can still influence the edge features through message passing.

[R3] Analysis of Failure Cases We have empirically observed that the model occasionally fails to identify specific instrument-tissue interactions. The model fails to predict interactions mainly for two reasons: (1) Missing detections lead to missing nodes, with top errors being ‘Irrigator’ and ‘Cystic Plate’. (2) Tool-target interactions are mispredicted despite node generation, with top errors being ‘Grasp’ and ‘Dissect’.

[R1] Qualitative Results We present Fig. 3 to visually demonstrate that the graph of SSG-Com contains richer information than LG-CVS. Although we include a few examples due to space limitations, the effectiveness of the information-rich SSG-Com graph is quantitatively verified in Tab. 3.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Most of reviewer’s concerns are sufficiently addressed.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

Towards Holistic Surgical Scene Graph

Author(s):