Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Surgical video generation can enhance medical education and research, but existing methods lack fine-grained motion control and realism. We introduce SurgSora, a framework that generates high-fidelity, motion-controllable surgical videos from a single input frame and user-specified motion cues. Unlike prior approaches that treat objects indiscriminately or rely on ground-truth segmentation masks, SurgSora leverages self-predicted object features and depth information to refine RGB appearance and optical flow for precise video synthesis. It consists of three key modules: (1) the Dual Semantic Injector, which extracts object-specific RGB-D features and segmentation cues to enhance spatial representations; (2) the Decoupled Flow Mapper, which fuses multi-scale optical flow with semantic features for realistic motion dynamics; and (3) the Trajectory Controller, which estimates sparse optical flow and enables user-guided object movement. By conditioning these enriched features within the Stable Video Diffusion, SurgSora achieves state-of-the-art visual authenticity and controllability in advancing surgical video synthesis, as demonstrated by extensive quantitative and qualitative comparisons. Our human evaluation in collaboration with expert surgeons further demonstrates the high realism of SurgSora-generated videos, highlighting the potential of our method for surgical training and education.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0811_paper.pdf

SharedIt Link: https://rdcu.be/eHw4B

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05127-1_50

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/0811_supp.zip

Link to the Code Repository

https://github.com/DavisMeee/SurgSora.git

Link to the Dataset(s)

N/A

BibTex

@InProceedings{CheTon_SurgSora_MICCAI2025,
        author = { Chen, Tong AND Yang, Shuya AND Wang, Junyi AND Bai, Long AND Ren, Hongliang AND Zhou, Luping},
        title = { { SurgSora: Object-Aware Diffusion Model for Controllable Surgical Video Generation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15969},
        month = {September},
        page = {521 -- 531}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper introduces SurgSora, a framework designed to generate motion-controllable surgical videos from a single input frame and user-specified motion cues. The system consists of three key modules: (1) Dual Semantic Injector – Integrates RGB-D features with segmentation cues to refine spatial representations; (2) Decoupled Flow Mapper – Merges multiscale optical flow with semantic features to produce realistic motion dynamics; and (3) Trajectory Controller – Estimates sparse optical flow and facilitates user-guided object movement. These components collectively condition Stable Video Diffusion, enabling precise and controllable video synthesis.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The key strengths of the paper are: (1) The paper is generally well-written and easy to follow. (2) The use of diffusion models for synthetic surgical video generation is a timely and relevant research topic. (3) The proposed architecture, though complex, appears to be novel and meaningful. (4) The experimental results are promising, demonstrating state-of-the-art performance on the publicly available CoPESD datase (5) Ablation studies validate the advantages of the various modules.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The negative aspects of the paper are: (1) A more detailed discussion on the absence of experimental comparison with MedSora [24] would be beneficial. (2) The study relies on only one dataset for training and evaluation. Including additional datasets - such as those used in Endora and MedSora - would make the paper stronger. It is unclear why additional datasets were not utilized? (3) SurgSora’s improved performance over competing approaches is expected, given that it appears to be the only model trained on this specific dataset. Could the authors confirm this assumption?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents an interesting new architecture, and the experimental results are promising, surpassing the state-of-the-art. However, as mentioned above, I have concerns regarding the following: (1) the absence of a comparison with MedSora; (2) the reliance on a single dataset for experiments; and (3) the apparent specialization of SurgSora for the evaluation dataset compared to competing approaches. If the authors address these concerns and depending on the feedback received, I would be open to reconsider my rating.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The feedback provided by the authors addressed most of my concerns. Hence, I am willing to accept the paper.

Review #2

Please describe the contribution of the paper

Propose a surgical video generation method where the generated video can be controlled by user’s click.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Two pre-trained models, SAM and DAMv2 are adopted to provide clues.
2. A pre-trained tracjectory decoder is utilized to estimate optical flow by user interaction. And this means the user can edit the video in some degree.
3. Features from frame, semantic feature, and depth are fused and warpped by optical flow to improve the quality of generated video.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Semantic features are generated by pre-trained SAM. In the case that the prompt is not given, SAM will use points of a grid as prompt. In surgical video, from my experience, this usually results in a messed semantic mask (over-segmented or less-segmented organs or tissues). From Fig. 1 I notice the authors directly using the intermedian features from SAM’s encoder as segment feature. I am not sure how useful are these features, and I wonder if changing the prompt of SAM (such as manually assigning some points) will largely affect the result.
2. Optimization is missing (optimization function, etc.).
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The controllable video generation for surgical video is important for surgical training. And to use pretained models to refine generated videos is promosing.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

I generally agree with the comments from R2 and R3 that:

1) some technical details are not well-discussed or completely missing. The semantic feature is going into RGB feature and depth feature as shown in Fig. 1, but the how they are combined is missing in this submission. In the authors’ rebuttal, they mentioned the semantic feature (from SAM) is concatenated with depth feature (from DAMv2), or RGB features (here the details of RGB encoder is also missing), respectively. However, only one arrow in Fig. 1 and no notation of concatenation in Eq.1 make it hard to understand if they are concatenated, added together, or operated in other forms. For optimization part, the authors claimed they optimize the model by standard MSE, while I cannot find any related explanation from the original submission. These make it very difficult to re-implement this method solely on the original submission;

2) lack of comparison mentioned. For example, MedSora compared with colonoscopic and laparoscopic videos from different institutes, while this submission is evaluated only with CoPESD (collected from 4 in-vivo porcine sites). The authors answered R2’s questions on generalizability by simply saying SurgSora performing better than compared methods on CoPESD seems less convincing to me;

3) in rebuttal, the authors answered R3’s question on how transferred the RGB and depth features by optical flow. But warping RGB and depth feature with the same optical flow seems strange. Suppose a camera zooms in (camera getting close to focus point), surrounding RGB goes closer to center, resulting optical flow pointing towards center. Their corresponding depth should change since camera getting close, but the authors mentioned “without directly altering depth values”.

Besides,

for the usage of SAM, one of my initial questions was that if the prompt was a grid and the generated masks were noisy, the semantic feature would still be useful or not. The authors did not directly answer my question but instead claim the fusion design would make model robust to segmentation noise without much evidence.

In general, some ideas in this submission are interesting, especially for the trajectory control part like dragGAN. However, the missing technical details and lack of experimental results suggest this submission is not yet ready for publication.

Review #3

Please describe the contribution of the paper

The paper proposes SurgSora, a tool that uses one single initial image to generate a surgical video. The method claims using self-predicted object features and depth information to refine RGB appearance and optical flow for precise video synthesis. The provided demo video shows nice results although it is very short, with limited ability to examine its exact effectiveness.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- A number of technical details, particularly some key elements, are not clearly presented, thereby discounting the reliability of the proposed method. For example:  How the depth information and structure are obtained and constructed from the single image is not clear or inadequately described. Although Section 2.1 has mentioned using Depth Anything V2, but no information on how this works out. Brief description on this topic and details about the depth information (data form and precision) are necessary.  How the depth information and structure insight are transformed by the optic flow prediction is not clear or inadequately described.
- The authors claimed that SurgSora does not need segmentation masks. However, the method mentioned using segmentation features (Fig.1, Section 2.1, Equation 1), and object segmentation (Section 2.1), ed images. It also used tools such as Segment Anything Model (section 2.1). Please clarify and ensure these contents are consistent.
- Optic flow to transfer voxels in the RGB is understandable. How it may modify the depth information to achieve current depth relation of different objects in the scene to simulate camera view navigation is totally not clear.
- Equation 1: explanations of mathematical symbols is needed. What is the Epsilon function?
- Section 2.2: what exactly is the depth information’s structural insights?
- Equation 2: what are SiLU, Conv3d, and ConvCAT?
- Section 2.3: Not clear what it means by “click to set trajectory”.
- The video demo is very short. It is not possible to examine that the depth information and relationship of different parts in the scene are correct.
- There is a typo in the video: “Genteraion”
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- A number of technical details, particularly some key elements, are not clearly presented, thereby discounting the reliability of the proposed method. For example:  how the depth information and structure are obtained and constructed from the single image is not clear or inadequately described. Although Section 2.1 has mentioned using Depth Anything V2, but no information how this works out. Brief description on this topic and details about the depth information (data form and precision) are necessary.  How the depth information and structure insight are transformed by the optic flow prediction is not clear or inadequately described.
- The authors claimed that SurgSora does not need segmentation masks. However, the method mentioned using segmentation features (Fig.1, Section 2.1, Equation 1), and object segmentation (Section 2.1), ed images. It also used tools such as Segment Anything Model (section 2.1). Please clarify and ensure these contents are consistent.
- Optic flow to transfer voxels in the RGB is understandable. How it may modify the depth information to achieve current depth relation of different objects in the scene to simulate camera view navigation is totally not clear.
- Equation 1: explanation of the mathematical symbols is needed. What is the Epsilon function?
- Section 2.2: what exactly is the depth information’s structural insights ?
- Equation 2: what are SiLU, Conv3d, and ConvCAT?
- Section 2.3: Not clear what it means by “click to set trajectory”.
- The video demo is very short. It is not possible to examine that the depth information and relationship of different parts in the scene are correct.
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N.A.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Many necessary technical details are largely missing. It is not possible to duplicate the work as reported. Please see details outlined in Item#7.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

While the paper can be accepted, it needs to adhere to the standards of a scientific publication by being self-contained. This will enable readers to effortlessly comprehend the material. The authors must avoid presuming that all readers are specialists in the field, despite the commonality of many terms within the domain. Therefore, enhancing the clarity of its presentation is essential.

Author Feedback

We sincerely thank Reviewers R1, R2, and R3 for their thoughtful feedback and recognizing the novelty and potential impact of SurgSora. We will address all minor issues (e.g., typos) and integrate the reviewers’ suggestions in the final version. R1Q1 (Use of SAM features): SAM is used solely to extract object-aware semantic priors—not hard segmentation masks. The embeddings from SAM’s final encoder layer are concatenated with RGB and depth features (Fig. 1) and fused via dedicated encoders \Phi_{RGB} and \Phi_{D} (Eqn 1) to provide object-level context. This design makes the fused features more semantically aware while staying robust to segmentation noise. Although manual prompts could improve mask quality, they would undermine the model’s fully automatic pipeline and reduce usability in real-world settings. R1Q2 (Optimization function): Our model is trained end-to-end using standard MSE loss between the predicted and ground truth video clips, a common approach in video synthesis tasks. R2Q1 (Comparison with MedSora): We did not include MedSora in the comparison because it is fundamentally incompatible with our task setting. Specifically, MedSora requires multiple video frames as input to extract motion, whereas our SurgSora operates from a single static frame, making MedSora unsuitable for our setting. Furthermore, MedSora performs unconditional generation based on learned motion priors, without support for user control. In contrast, SurgSora enables controllable generation through sparse, user-defined trajectories, offering greater interactivity and practical utility. R2Q2 (Use of additional datasets): We appreciate the suggestion but emphasize that datasets used by Endora and MedSora typically lack surgical tool annotations, making them unsuitable for our task of instrument-level controllable generation. Our method targets user-guided tool motion, which is central to surgical training scenarios and not addressed by existing datasets. R2Q3 (Dataset specialization): All baseline models in Table 1 are trained under the same protocol on CoPESD. The consistent training setup ensures fair comparison. SurgSora’s superior performance stems from its architecture’s ability to leverage RGB, depth, and semantic priors with user-defined motion guidance, rather than dataset overfitting. R3Q1–Q4 (Technical clarity and terminology): Depth estimation: Depth Anything V2 (DAV2), a vision transformer-based model, predicts dense floating-point depth maps from single RGB images ([1, 256, 256] format). Flow-depth fusion: RGB and depth features are spatially warped using optical flow (without directly altering depth values), then fused via supervised learning to enhance spatio-temporal coherence. Segmentation use: There is no inconsistency. SurgSora does not use segmentation masks as input. Instead, semantic embeddings from SAM’s final encoder layer are used as soft priors to enhance object awareness. Please see our answer to R1Q1. Equation clarifications: In Eqn. 1, \Epsilon denotes the encoder. In Eqn. 2, SiLU is the Sigmoid Linear Unit SiLU(x)=x⋅sigmoid(x)SiLU(x)=x⋅sigmoid(x), Conv3D is a 3D convolution layer, and Concat refers to feature concatenation—all common components in deep learning. The “structural insights” from depth refer to spatial priors (e.g., relative positions) used to guide fusion. Trajectory input: Users define instrument paths by clicking points in the 2D input image; these are interpolated into smooth trajectories used to guide video synthesis. Demo&Depth Validation: Longer demo clips (29–38s) are included in the supplementary materials. Additionally, we validate depth effectiveness indirectly via optical flow performance in Table 1(F1-epe/all), which supports the accuracy of spatial understanding. R3-Q5: Reproducibility: We initially included a link to the anonymized GitHub repository in the submission. Unfortunately, it became inaccessible due to a hosting issue, which we will resolve and make public upon acceptance.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

SurgSora: Object-Aware Diffusion Model for Controllable Surgical Video Generation

Author(s):