Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Surgical Video Synthesis has emerged as a promising research direction following the success of diffusion models in general-domain video generation. Although existing approaches achieve high-quality video generation, most are unconditional and fail to maintain consistency with surgical actions and phases, lacking the surgical understanding and fine-grained guidance necessary for factual simulation. We address these challenges by proposing HieraSurg, a hierarchy-aware surgical video generation framework consisting of two specialized diffusion models. Given a surgical phase and an initial frame, HieraSurg first predicts future coarse-grained semantic changes through a segmentation prediction model. The final video is then generated by a second-stage model that augments these temporal segmentation maps with fine-grained visual features, leading to effective texture rendering and integration of semantic information in the video space. Our approach leverages surgical information at multiple levels of abstraction, including surgical phase, action triplets, and panoptic segmentation maps. The experimental results on Cholecystectomy Surgical Video Generation demonstrate that the model significantly outperforms prior work both quantitatively and qualitatively, showing strong generalization capabilities and the ability to generate higher frame-rate videos. The model exhibits particularly fine-grained adherence when provided with existing segmentation maps, suggesting its potential for practical surgical applications.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0550_paper.pdf

SharedIt Link: https://rdcu.be/eHw1E

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05114-1_30

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/0550_supp.zip

Link to the Code Repository

https://github.com/diegobiagini/HieraSurg

Link to the Dataset(s)

N/A

BibTex

@InProceedings{BiaDie_HieraSurg_MICCAI2025,
        author = { Biagini, Diego AND Navab, Nassir AND Farshad, Azade},
        title = { { HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {310 -- 319}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes a two-staged method for surgical video generation. At the first stage, an encoder takes initial frame and its semantic mask, along with phase and triplet (tool, action, target) labels for each following frame that would be generated as inputs, and outputs new semantic masks. These semantic masks are fed into the second stage to generate their cooresponding video frames.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The generated semantic masks at the first stage are based on phases and action triplets. Triplet and phase information, in my understanding, are efficient but sometimes limited descriptor for videos. The varations of triplet and phase are fixed in a limited space, and therefore are effcient to be trained on.
2. Since the final generated video is conditioned on triplet, phase, and semantic masks, some down-steaming tasks, such as training a classification model for triplet or phase, or a segmentation model, using synthetic video as a part of training data are possible.
3. Propose a semantic masks collection method by combining RADIO features and SAM2.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Since the varations of condition seem to be limited as mentioned, the generated semantic masks (and the video that based on these masks) might lack of variants even if adjusting the noise to diffusion model.
2. Some details are missing such as the method to choose parameters for K-Means in Sec. 2.1 HieraSurg - HieraSurg-S2M.
3. Some descriptions in Sec. 2.1 HieraSurg - HieraSurg-M2V are confused. In my understanding, this part takes generated semantic masks as inputs, and generates initial feature H_{seg}. H_{seg} and intermedian features H_{i} (i for different block) are concatenated, and then a self-attention (instead of cross-attention stated in paper) is performed. If the authors mean H_{cat}=[H_0, H_1, …, H_n, H_{out}], and the cross-attention is performed across different levels of features, I would like to suggest the author to conside revising this part. And the usage of final feature H^{out}_{seg} is unclear. Is it directly fed to decoder?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

Please conside sharing detailed information of implementation, such as the baseline model, the shape of each variable in Sec. 2.1 HieraSurg - HieraSurg-M2V and how they are combined with decoder.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The potential usage of this work in down-steaming tasks mentioned in strength 2.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This paper introduces HieraSurg, a two-stage conditional diffusion framework for synthetic surgical video generation. The proposed approach comprises two components: (i) HieraSurg-S2M, which generates panoptic segmentation maps using a diffusion model, and (ii) HieraSurg-M2V, which leverages these segmentation maps, along with textual information describing the surgical procedure, to produce realistic video sequences via a latent diffusion model (CogVideoX-2B). The framework is evaluated against both conditional and unconditional video generation baselines. Results demonstrate the potential of hierarchical generation in the surgical domain by leveraging textual and segmentation information.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
While conditional diffusion models in video generation are not novel per se, the key strengths of this work include:
- The development of a hierarchical pipeline that integrates surgical textual cues with semantic segmentation maps to guide video generation.
- The use of SAM2 for generating high-quality ground truth panoptic segmentation maps, enabling the training of the first-stage model.
- A post-processing step involving K-Means clustering to deal with inconsistency in the generated map.
- A qualitative and quantitative evaluation comparing the proposed method with state-of-the-art (SOTA) approaches.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The methodology section would benefit from the inclusion of a pseudo-algorithm or flow diagram to clearly illustrate the data pipeline and model interactions. For instance, the subsection on diffusion models could be trimmed or merged if it does not introduce novel technical elements.
- The model heavily relies on the semantic segmentation map, implying that when the ground truth panoptic segmentation map is used, the results should be comparatively better. However, in Table 2, S2M+M2V VAE Pred Seg shows superior performance for FPS1. A clearer explanation of this observation is necessary.
- The paper lacks details about the classes included in the panoptic segmentation maps and the number of clusters determined during the K-Means post-processing step,both of which are critical for reproducibility and interpretation.
- The training procedure for both HieraSurg-S2M and HieraSurg-M2V is underexplained. Information such as dataset splits, optimization settings, training duration, and computational requirements would strengthen the experimental section and support the framework’s practical viability
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I recommend a weak accept as the paper presents a novel and well-motivated hierarchical framework for surgical video generation, supported by thoughtful design choices and qualitative results. However, the submission would benefit from greater methodological clarity and expanded experimental details to fully support its contributions.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The paper introduces HieraSurg, a novel two-stage diffusion-based pipeline for surgical video generation. By explicitly modeling hierarchical surgical semantics such as phases, action triplets, and segmentation maps it aims to generate realistic surgical scenes conditioned on both high-level and mid-level information. The proposed framework consists of a segmentation prediction model (S2M) and a video generation model (M2V), and shows improved results over existing baselines on Cholecystectomy datasets.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Hierarchical abstraction: The paper proposes an interesting hierarchical modeling approach that separates high-level semantic scene prediction from low-level texture synthesis, which is conceptually aligned with how surgical workflows are structured.
- Semantic conditioning: The use of phase and action triplets as conditioning inputs for segmentation prediction provides a richer control mechanism than standard unconditional video generation.
- Custom segmentation pipeline: An automatic segmentation labeling method based on SAM2 and RADIO is designed to address data scarcity, which is a practical contribution for surgical data science.
- Comprehensive experiments: The method is evaluated on both 1 FPS and 8 FPS settings with a variety of fidelity and detection-based metrics.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Cascaded design lacks justification: While the authors argue that hierarchical scene modeling benefits from a two-stage architecture, the paper does not present any direct comparison or ablation to validate this claim. Moreover, potential error propagation across the stages is not fully discussed.
- VAE and segmentation encoder design is unclear: The paper does not specify whether the VAE is pretrained or trained from scratch. Additionally, the segmentation encoder in M2V appears to be jointly trained with the DiT block, which is unconventional. The motivation behind this design choice should be clarified.
- Limited temporal modeling in M2V: The video generation model conditions only on the first frame, which may hinder the model’s ability to maintain temporal consistency for longer sequences. No mechanism is introduced to account for temporal dynamics beyond the initial input.
- Insufficient detail on semantic representations: The phase and triplet encodings are critical to the model’s conditioning but are poorly described.
- Lack of discussion on clinical applicability: The paper would benefit from a clearer explanation of how the generated videos might be used in real-world surgical settings, such as training simulators or decision support tools. The current discussion remains too abstract.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents a promising hierarchical approach to surgical video generation with solid experimental results. However, key design choices such as the two-stage pipeline, limited temporal conditioning, and jointly trained segmentation encoder lack sufficient justification. Addressing these limitations could significantly strengthen the paper’s contributions.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank the reviewers for their constructive comments and want to shed light on some excellent points raised. Reasoning for 2-stage design: (R4) While no ablation was built from the ground up for the full pipeline(instead opting for single-component ablations), we believe that the ability to train with additional visual representations is a real motivation for our approach. And while error propagation is valuable, the cascaded design allows for attribution when evaluating mode failure. As mentioned by (R2), we agree that generalisation to surgical events with no representation in the dataset is limited. However, the 2-stage approach allows flexibility in specifying possibly OOD human-created maps that can bridge the gap, a lacklustre aspect of single-stage models.

K-means and K: (R2,R3) We evaluate the distortion when choosing K in a sensible range(5 to 20), and then use an algorithmic elbow method, i.e. finding the minimum (between the Ks) of the second derivative of the distortion. We plan to add this missing detail.

Training choices: (R4) The VAE was not finetuned. WRT the choice of training the semantic part together with the backbone: to be able to start from the Cogvideox weights during the unconditional training, we chose not to remove the attention mechanism between encoded condition (text encoding at this point) and activations. Since we trained M2V starting from the unconditional weights, if we were to plug in the semantic head without finetuning the inner blocks, the network would keep its previously learned behaviour, ie. ignoring the parts of the features corresponding to the conditioning. Admittedly, we could have finetuned the multimodal attention blocks+semantic encoder only, instead of the whole network, but in practice the computational difference between the two did not warrant going for the less stable approach.

Clarifications: (R234) We will open-source the code before the conference, including training and data settings. (R2) Indeed, the nomenclature of this being self/cross/multimodal attention is blurred even in background work; a more encompassing ‘attention’ is more appropriate. We failed to make explicit that H_seg should also have an index, since it is being processed through the network. The last H_seg obtained in output is unused. We will improve clarity on these points. (R3) We attribute improved fidelity metrics when providing M2V 1fps with predictions instead of Ground Truth to the more chaotic nature of predicted maps. S2M has learned to ignore unclear entities as part of the segmentator noise at dataset creation, reverting to a more average behaviour when presented with such a situation. But when it receives unseen (but consistent) GT maps it actually is challenged to come up with something tangible, yielding less visually appealing but more semantically correct samples as verified by the HR metric. This behaviour is not found in 8fps due to:1) having slack frames that it can use to correct visual errors, 2)this model only uses the first 6 frames of a predicted segmap, a subset that has a distribution closer to the GT. (R3) The segmentation maps are classless; they are only characterised as instances (with an in-clip index) that move. While optimal class information would improve our method, we found that obtaining it automatically is too unreliable. (R4) We argue that M2V has a strong temporal consistency mechanism baked into it, thanks to individual frames of the video segmap acting as stepping points. However, we agree that S2M still lacks strong priors to keep a segmentation in a frame consistent with one in the next frame. The postprocessing step plays a role in patching this.

We are pleased that the reviewers recognise our contributions in the unsupervised labelling pipeline and its role in helping parse reality for a generative model; one of the many steps necessary to have fully data-driven, clinically-applicable scenario-generators in data and label-scarce scenarios.

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

HieraSurg: Hierarchy-Aware Diffusion Model for Surgical Video Generation

Author(s):