Abstract

Surgical simulation plays a pivotal role in training novice surgeons, accelerating their learning curve and reducing intra-operative errors. However, conventional simulation tools fall short in providing the necessary photorealism and the variability of human anatomy. In response, current methods are shifting towards generative model-based simulators. Yet, these approaches primarily focus on using increasingly complex conditioning for precise synthesis while neglecting the fine-grained human control aspect. To address this gap, we introduce SG2VID, the first diffusion-based video model that leverages Scene Graphs for both precise video synthesis and fine-grained human control. We demonstrate SG2VID’s capabilities across three public datasets featuring cataract and cholecystectomy surgery. While SG2VID outperforms previous methods both qualitatively and quantitatively, it also enables precise synthesis, providing accurate control over tool and anatomy’s size and movement, entrance of new tools, as well as the overall scene layout. We qualitatively motivate how SG2VID can be used for generative augmentation and present an experiment demonstrating its ability to improve a downstream phase detection task when the training set is extended with our synthetic videos. Finally, to showcase SG2VID’s ability to retain human control, we interact with the Scene Graphs to generate new video samples depicting major yet rare intra-operative irregularities.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0234_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/0234_supp.zip

Link to the Code Repository

https://github.com/MECLabTUDA/SG2VID https://ssharvienkumar.github.io/SG2VID/

Link to the Dataset(s)

https://huggingface.co/datasets/SsharvienKumar/SG2VID

BibTex

@InProceedings{SivSsh_SG2VID_MICCAI2025,
        author = { Sivakumar, Ssharvien Kumar and Frisch, Yannik and Ghazaei, Ghazal and Mukhopadhyay, Anirban},
        title = { { SG2VID: Scene Graphs Enable Fine-Grained Control for Video Synthesis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {513 -- 523}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    In this paper, authors introduce SG2VID, the diffusion-based video synthesis model that leverages Scene Graphs (SGs) for human-controllable video generation for surgical simulation. The key innovation is the structured SG conditioning, which encodes spatial-temporal relationships of surgical tools and anatomy. The authors demonstrate SG2VID’s effectiveness on three surgical datasets (Cataract-1k, CATARACTS, Cholec80) and showcase its ability to generate rare intra-operative anomalies (e.g., pupil contraction) and improve downstream phase detection via generative augmentation.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. SG-based Video Synthesis Prior works (e.g., Endora, MedSora, LVDM) rely on unconditional generation, text prompts, or optical flow, which lack fine-grained control. SG2VID use SGs for spatial-temporal video synthesis, enabling interpretable conditioning.
    2. Practical architecture of first-frame conditioning allows patient-specific synthesis for personalized simulation.
    3. Potential of improving downstream application of phase recognition by augmenting training data with synthetic sequences.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The 2-d spatial spread is not clearly explained.
    2. Author claim that the additional information for scene graph (2-dim average optical flow direction and the 1-dim average depth) can help in modelling the motion trajectory by capturing temporal dynamics within the surgical scene. However, there are no ablation study showing the improvement of the proposed method. 3. It further brings out the question: “what’s the performance gain of the proposed work compared to SurGrID [6]?”
    3. SG construction relies on SASVi-generated masks, which may introduce errors. The paper does not evaluate how mask inaccuracies affect generation quality.
    4. The evaluation of using Mask R-CNN for object detection on the synthesised sequences is not clear. How’s the Mask R-CNN trained and tested on the synthesised sequences? And what’s the goal of this experiment?
    5. For downstream task of data augmentation for phase recognition, only one dataset is validated, which is limited.
    6. While quantitative metrics are strong, no surgeon feedback is provided to assess perceptual realism and educational value.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The overall framework and method look inspiring to me. However, the experiments are not clear and satisfactory to show the effectiveness of the method.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposed SG2VID, the first diffusion-based video synthesis model using Scene Graphs (SGs) for fine-grained control over surgical tool/anatomy dynamics and scene layout. Authors demonstrated the effectiveness of SG2VID in generating rare surgical irregularities and improving downstream tasks via generative augmentation.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Enabling intuitive control over tool/anatomy size, movement, layout, and rare intra-operative irregularities via interactive SG manipulation.

    2) Demonstrating improved downstream task performance (surgical phase recognition) through generative augmentation using synthetic videos.

    3) Validating the method across three surgical datasets (Cataract-1k, CATARACTS, Cholec80) with superior quantitative metrics (FVD, FID, LPIPS) and qualitative results compared to prior works.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Technical Novelty: The use of SGs for video generation builds directly on SurGrID, which uses SGs for static image synthesis. While extending this to video is non-trivial, the core SG conditioning idea is not entirely novel.

    • Clinical Validation Gaps: No user study with surgeons to validate the realism or educational utility of generated videos. Metrics like FID/FVD may not fully capture clinical relevance.

    • Limited resolution: 128×128 may discard fine-grained details necessary for surgical realism

    • Node/Edge Definition of SG: Nodes are defined as connected components in masks, which may oversimplify complex surgical scenes. The edge definition (spatial adjacency) ignores semantic relationships critical for surgical workflows

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    SG2VID represents a valuable incremental advance in controllable surgical video synthesis. While methodological and validation gaps exist, the core idea, leveraging SGs for human-in-the-loop simulation, is compelling and aligns with the community’s shift toward generative surgical training tools.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This works proposed the first diffusion-based video model incorporating scene graph and thus enables fine-grained control for surgical video synthesis. The proposed method is effective and generalizable across multiple datasets and scenarios.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The proposed method is novel and solid by leveraging the power of scene graph for surgical video synthesis. It tackles the challenges of fine-grained control in surgical scenes.
    • The method has good potential for downstream tasks such as surgical phase recognition.
    • The supplementary is very helpful and the videos are well prepared.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • What the training time of the model?
    • Can this method achieve real-time synthesis? What is the inference speed? Please include some failure cases in Fig. 4 as well.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    See the major advantages.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We sincerely thank all the reviewers for their positive comments and constructive feedback. We are grateful for your recognition of SG2VID’s capability in precise video synthesis, particularly its accurate control over tool and anatomical size and movement, entrance of new tools, and the overall scene layout, all while still maintaining intuitive, fine-grained human control. We also deeply appreciate your acknowledgement of the effort we invested in the supplementary videos to comprehensively demonstrate SG2VID’s capabilities. We will gladly incorporate the minor revisions suggested by the reviewers into the final manuscript.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top