Abstract

Understanding the intricate workflows of cataract surgery requires modeling complex interactions between surgical tools, anatomical structures, and procedural techniques. Existing datasets primarily address isolated aspects of surgical analysis, such as tool detection or phase segmentation, but lack comprehensive representations that capture the semantic relationships between entities over time. This paper introduces the Cataract Surgery Scene Graph (CAT-SG) dataset, the first to provide structured annotations of tool-tissue interactions, procedural variations, and temporal dependencies. By incorporating detailed semantic relations, CAT-SG offers a holistic view of surgical workflows, enabling more accurate recognition of surgical phases and techniques. Additionally, we present a novel scene graph generation model, CatSGG, which outperforms current methods in generating structured surgical representations. The CAT-SG dataset is designed to enhance AI-driven surgical training, real-time decision support, and workflow analysis, paving the way for more intelligent, context-aware systems in clinical practice. The dataset is available at github.com/felixholm/CAT-SG

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1687_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/1687_supp.zip

Link to the Code Repository

N/A

Link to the Dataset(s)

CAT-SG dataset: https://github.com/felixholm/CAT-SG

BibTex

@InProceedings{HolFel_CATSG_MICCAI2025,
        author = { Holm, Felix and Ünver, Gözde and Ghazaei, Ghazal and Navab, Nassir},
        title = { { CAT-SG: A Large Dynamic Scene Graph Dataset for Fine-Grained Understanding of Cataract Surgery } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {96 -- 106}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper makes important contribution with the release of a new scene graph dataset for cataract surgery enchancing the CATARACTS benchmark. Additional contributions include the presentation of a new scene graph generation model (CatSGG)

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The paper is very well writte and easy to follow 2) The annotations and release of a new surgical scene graph dataset for CATARACTS 3) Although not novel the query embedding approach for semantic relation prediction is interesting for its flexibility to incorporate semantic constrains leading to efficient training

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) Major weakness of the paper is the absence of inter-rater reliabitility analysis and a detailed description of the annotation process. I understand the page limits however simply stating that 9 trained students completed this very complex set of annoations is inadequate to understand how the process was completed. The agreement/disagreement between the annotators and the process to reach consensus on disputes is an important element to appreciate the validity of the annotations. Metrics such as Fleiss Kappa and Krippendorff’s Alpha should have been provided to present the inter-rate reliability.

    2) The geometric relation prediction needs more information. What is meant by “being close to each other” and how its evaluated ?

    3) Table 3 shows the CatSGG outperforming ORacle. However for some relations prediction (holding and especially pushing with F1=0) CatSGG/CatSGG+ perform very poor. These are not discussed and no explanation is provided. This is another major weakness of the paper.

    4) I dont fully understand Table 4 and the comparison on surgical workflow recognition. Specifically the 3-4 columns and how was Holm et. al. [10] developed.

    • Was it trained on the original CATARACTS dataset without the semantic relationships ?
    • What are the “spatial encodings” available in the original CATARACTS ? Is this the locations from the segmentations ?
    • Are these the spatial encodings in the CAT-SG ?

    • Table 2 in [10] also includes a dynamic graphs model with max performance at A:75.15 and F1:68.56, which is close to GATv2. Why is not this variant part of the comparison ?

    5) The technique recognition task needs more information.

    • Was this recogntion done during a particular phase, multiple phases or the whole procedure ?
    • Was it a binary classification task “Stop and Chop” or “Divide and Conquer” ? If yes the obtained performance is not very high.
    • Can both techniques happen during the procedure ?

    6) The authors mention release of annotations but not source code and models. Results will not be reproducible if these are not released.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I believe the paper makes a cogent contribution however the weakeness would need to be addressed in the rebuttal.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposes CAT-SG, a large dataset containing accurate dynamic scene graph annotations that enables automatic comprehensive analysis of surgical workflow.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • This paper proposes a large scale CAT-SG dataset with detailed annotations building upon the publicly available CATARACTS dataset, which addresses the issue of lacking of datasets in dynamic scene graph modeling for surgical workflow analysis.

    • The proposed CatSGG+ method seems convincingly leverages surgical domain specific prior information.

    • The author also conducted quite thorough experiments for downstream tasks as surgical workflow recognition and surgical technique recognition to demonstrate its applications.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The benchmark is comprehensive for the main task “Scene Graph Generation” where only the proposed method and ORacle are evaluated. Spatial proximity “close to” should be evaluated to against segmentation based methods as [1]. The improvement on the semantic relation prediction task in table 3 (taking the semantic relation prediction task out) is not that significant especially for the pushing relation with F1 score 0.0 for proposed CatSGG+ method.

    • Lack of ablation studies for the proposed method CatSGG+ to demonstrate the effectiveness of each component.

    [1] Dynamic scene graph representation for surgical video.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper tackles the dynamic scene graph generation task with an introduced large scale dataset a novel method with good insight and contributions to the field. My only concerns are listed above. In my opinion, it could be highlighted if my concerns are convincingly addressed.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The primary contribution of this paper is the introduction of the CAT-SG dataset, the first large-scale scene graph dataset for cataract surgery. CAT-SG captures over 1.8 million fine-grained annotations detailing tool-tissue interactions, procedural variations, and temporal dependencies, offering a rich semantic representation of surgical workflows. To demonstrate its utility, the authors benchmark CAT-SG across three key tasks—scene graph generation, phase recognition, and technique recognition. In addition, they propose CatSGG, a novel scene graph generation model that leverages large-scale pretraining and spatio-temporal attention to outperform existing methods. Together, the dataset and model advance the development of intelligent, context-aware systems for surgical training, decision support, and workflow analysis.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The introduction of the CAT-SG dataset is a particularly strong contribution. It is not only large-scale, with over 1.8 million annotated relations, but also unique in its granularity, capturing detailed tool-tissue interactions, procedural variations, and temporal relationships that go beyond traditional phase segmentation. This level of annotation enables a novel formulation of surgical understanding as a structured graph problem, which is a powerful and underexplored approach in the surgical domain. Additionally, the proposed CatSGG model leverages spatio-temporal attention and large-scale pretraining, showcasing the effectiveness of the dataset in enabling better structured surgical understanding.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    While the paper makes a strong contribution, there are a few areas that could be improved. In Table 2, the comparison would benefit from a more comprehensive review with recent and diverse surgical datasets, such as CholecT50 and JIGSAWS. Additionally, in Table 4, the evaluation of the CatSGG model would be strengthened by a broader comparison with other LLMs to contextualize the performance gains. This would provide a clearer picture of how the proposed model stands relative to the latest advances in multimodal learning.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The introduction of the CAT-SG dataset fills a significant gap in surgical AI by offering a large-scale, fine-grained resource for modeling tool-tissue interactions and procedural workflows. However, the comparison to existing dataset should be more comprehensive. Table 4 and Table 5 should both include more algorithms for better comparison with existing methods.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank all reviewers for their thoughtful feedback and constructive suggestions, which we believe will help further improve the quality and impact of our work. We address the main concerns below, grouped by topic.

  1. Inter-Rater Reliability and Annotation Process (R1, R2): We appreciate the emphasis on annotation reliability. Our annotation protocol included rigorous training of 9 student annotators, especially toward gaining a fundamental understanding of phacoemulsification cataract surgery. We used detailed guidelines to break down each annotation into simple components and basic interactions (such as Pushing, Cutting, etc.). Due to time constraints, we did not have overlapping annotations and, therefore, could not provide inter-rater agreement metrics, but we followed an iterative expert review process to verify the validity of our labels.

  2. Geometric Relation “Close to” Definition and Evaluation (R1, R2): The “close to” relation is defined based on spatial proximity of predicted masks: two instances are considered “close to” if their boundaries are predicted to be adjacent by the segmentation model, following the protocol in [10]. We will clarify this in the final version.

  3. Performance of CatSGG/CatSGG+ for Specific Relations (R1, R2): We agree that the low F1 scores for rare relations such as “pushing” are limitations. This is largely due to extreme class imbalance (Tab. 1). The training set samples that contain at least one semantic relation were randomly selected across the experiments. Fixed (over)sampling for training instead of random sampling might improve the performance in future experiments. We appreciate that the reviewers see our methods as a strong initial baseline for our new task and dataset. We hope that the performance can be improved in future work, both from us and from the community.

  4. Comparisons and Ablations (R1, R2, R3): • Scene Graph Baselines: We compared to the strongest available surgical scene graph baseline (ORacle) as it is the only public method for this domain. We hope there will be more diverse models to compare against in the future. • Dataset Comparisons (R3): We searched for all datasets to compare that contain scene graphs (4D-OR) or graph-like structures (CholecT45/CholecT50). JIGSAWS does unfortunately not contain graphs or relations. • Ablations on CatSGG (R2): Due to space constraints in the paper, we didn’t include detailed ablation studies for the CatSGG. However, it is important to note that each component plays a significant role in the overall scene graph generation. Hence, we focused on presenting the final predictions. • Holm et al. [10] (R1): The results we report for Holm et al. are directly from their paper, it was trained on CATARACTS. For more details please refer to [10]. We did not include their results using embedded visual features, to have a fair comparison against our other methods.

  5. Workflow and Technique Recognition Details (R1): • Workflow Recognition: “Spatial encodings” refer to node groundings (position and size) derived from segmentations, in both [10] and CAT-SG. • Technique Recognition: Technique Recognition was performed over the nucleus breaking phase. The task is binary and each procedure is labeled with only one of the techniques. Recognizing the difference in these techniques remains very challenging, even for humans. It requires recognizing minute differences in motion patterns and interactions. We will clarify these task definitions.

Summary: We thank the reviewers for recognizing the novelty and relevance of our dataset and method. We will address all major concerns and are confident that these revisions will further strengthen the paper and its value to the community.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top