Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Computer-assisted interventions can improve intraoperative guidance, particularly through deep learning methods that harness the spatiotemporal information in surgical videos. However, the severe data imbalance often found in surgical video datasets hinders the development of high-performing models. In this work, we aim to overcome the data imbalance by synthesizing surgical videos. We propose a unique two-stage, text-conditioned diffusion-based method to generate high-fidelity surgical videos for under-represented classes. Our approach conditions the generation process on text prompts and decouples spatial and temporal modeling by utilizing a 2D latent diffusion model to capture spatial content and then integrating temporal attention layers to ensure temporal consistency. Furthermore, we introduce a rejection sampling strategy to select the most suitable synthetic samples, effectively augmenting existing datasets to address class imbalance. We evaluate our method on two downstream tasks—surgical action recognition and intra-operative event prediction—demonstrating that incorporating synthetic videos from our approach substantially enhances model performance.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4192_paper.pdf

SharedIt Link: https://rdcu.be/eHw6p

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05141-7_40

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/4192_supp.zip

Link to the Code Repository

https://gitlab.com/nct_tso_public/surgvgen

Link to the Dataset(s)

N/A

BibTex

@InProceedings{VenDan_Mission_MICCAI2025,
        author = { Venkatesh, Danush Kumar AND Funke, Isabel AND Pfeiffer, Micha AND Kolbinger, Fiona AND Schmeiser, Hanna Maria AND Distler, Marius AND Weitz, Jürgen AND Speidel, Stefanie},
        title = { { Mission Balance: Generating Under-represented Class Samples using Video Diffusion Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15970},
        month = {September},
        page = {412 -- 422}
}

Reviews

Review #1

Please describe the contribution of the paper

The study evaluates the use of diffusion models to generate data for under-represented classes. It focuses on generating both spatial and temporal data. The method was evaluated on two tasks using two datasets: the first task is gesture recognition on the SAR-RARP50 dataset, and the second is SLB recognition on a custom dataset recorded by the authors.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

They demonstrate how synthetic video data can be used to generate samples for under-represented classes, which in turn can improve activity recognition performance. In addition, they propose a specific model—SurV-Gen—which, in most cases, outperforms two existing models: LVDM and Endora.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The approach is evaluated on only two datasets, and no ablation studies are provided to examine the individual components of the SurV-Gen model..
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Class imbalance is a significant challenge, particularly in the surgical domain where some tasks are rare and data collection can be difficult. Developing methods to address this issue is therefore essential. The authors present an interesting approach based on diffusion models, achieving state-of-the-art or near state-of-the-art results.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

The authors present a two-stage, text-conditioned diffusion-based framework for generating high-fidelity surgical videos, specifically targeting under-represented classes in datasets like those from robot-assisted radical prostatectomy. The method first uses a 2D latent diffusion model to generate spatial content based on text prompts, and then applies temporal attention layers to ensure consistency across frames. To improve data quality, a rejection sampling strategy is employed to select the most relevant synthetic videos for augmenting existing datasets and addressing class imbalance.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors propose an interesting approach to address dataset imbalance in action recognition for surgical videos through text-conditioned diffusion-based video generation. This is a relevant and underexplored area in surgical AI.

The inclusion of image quality evaluation metrics—specifically targeting realism, unbiased quality, and coverage—is a valuable addition. These measures provide a more holistic perspective on the generative model’s outputs and help quantify its effectiveness beyond simple visual inspection.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Major Concern – Lack of Ground Truth in Qualitative Comparison: Figure 3 presents a qualitative comparison between the proposed diffusion model and baseline diffusion models. However, no ground truth frames are included, making it difficult to assess how well each method captures real surgical content. For readers unfamiliar with this specific video domain, it is particularly hard to judge which outputs are more realistic or clinically useful. To make the comparison meaningful and representative of actual model performance, it would be highly beneficial to include a diverse set of ground truth images alongside the generated ones. This would enable proper visual benchmarking and strengthen the qualitative evaluation of the proposed method.

Major Concern – Inconsistent and Unclear Evaluation in Table 2 Table 2 presents action recognition results for underrepresented classes using real and synthetic data. However, there are several inconsistencies that weaken the conclusions drawn:

Incomplete Baseline Comparisons: The authors report results for the baseline models only with rejection sampling, omitting their performance without rejection sampling. In contrast, for the proposed method, they include results both with and without rejection sampling. This inconsistency makes it difficult to fairly assess the relative impact of rejection sampling across methods, and raises the question of whether baseline models might also benefit similarly without it (this concern applies to Table 3 as well).

Limited Gains and Inconsistencies: For the proposed method, the action recognition performance without rejection sampling is often worse than using real data alone, which undermines the value of the generated data in its raw form. Even with rejection sampling, the improvements are not consistent across all action classes, and only show modest gains in two underrepresented classes. This suggests the proposed method’s benefits are not robust or generalizable.

Minor Concern – Use of “ca.” for Approximation The manuscript uses the abbreviation “ca.” (e.g., “ca. 5 min”) to indicate approximate values. While this is technically correct and commonly used in some academic fields, it may not be immediately familiar to all readers, especially in more general or interdisciplinary venues. For clarity and accessibility, consider replacing “ca.” with “approximately” or “about.”
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

To strengthen the validity of the results, the authors should report:

Baseline performance with and without rejection sampling.

More detailed analysis on why performance gains vary across classes.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

To strengthen the validity of the results, the authors should report:

Baseline performance with and without rejection sampling.

More detailed analysis on why performance gains vary across classes.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The paper introduces SurV-Gen, a two stages video diffusion method for surgical video generation, that aims to synthesize videos of underrepresented classes in surgical action and event recognition. The authors separate the spatial and temporal modeling in the consecutive stages, and also introduce a rejection sampling model to select the best synthetic samples.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper is well structured and written, easy to follow.
2. Image-level quality metrics are in favor of the presented method for the SLB task.
3. Results on task 1 are in favor of the presented method in two out of three selected underrepresented surgical actions (and in five out of seven in total). Results on task 2 are in favor of the presented method.
4. The rejection sampling model has a positive impact in both tasks.
5. Relatively small standard deviation across all results, indicating the robustness of the method across independent runs.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. It is not clear how separating the spatial and temporal modeling improves the training and inference efficiency.
2. There are not details (brief) on the pre-trained SD model of stage 1.
3. It is not clear why even though in SAR-RARP50 there are videos of ca. 5 min long, the generated videos of all three classes are 4 sec long. Why it was limited to 16 frames?
4. It is not clear if the data split was random and patient-based in both datasets.
5. In the rejection sampling model, it is not clear what threshold k represents, and how its values were selected.
6. There is not mention if the image-level quality metrics behave similarly for the surgical action recognition task. Would be interesting to see these numbers too, especially since the authors link in the text the quantitative results in Tab.1 of task 2 data, with the qualitative results in Fig.3 of task 1 data.
7. There is no mention of how many generated videos complemented the underrepresented classes at the end, after the rejection sampling model. How ended up the classes distribution?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Please see above. This is an interesting work with promising results.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank the reviewers for their helpful feedback and find it encouraging that the reviewers agree that class imbalance in surgical datasets is a relevant and underexplored problem and that our approach is promising and interesting for the CAI community.

Major comments regarding experiments and results:

How do spatial-temporal separation and individual components impact SurV-Gen’s efficiency? (R1,R3) Due to space limitations and MICCAI’s rebuttal policy, we cannot provide full ablation experiments and leave this to future work. Yet, we want to note that the combined training of the spatial and temporal components scales the number of parameters to 1 billion, leading to longer training times without achieving performance gains compared to the proposed method.

Inconsistent gains across action classes (R2): We specifically augmented the existing real datasets with synthetic samples only for the under-represented classes (marked in green in Table 2). Therefore, it is an expected result that the addition of synthetic videos mostly improves the performance of these under-represented classes. Notably, these gains do not hurt the results of the well-represented classes. We leave the analysis on the addition of synthetic samples also for the well-represented classes for future work.

Baselines without rejection sampling are missing (R2): This seems to be a misunderstanding. To ensure fair comparison, we applied rejection sampling (RS) to both SurV-Gen and baseline methods, as RS consistently improved downstream performance. Results without RS are shown only as an ablation for SurV-Gen to highlight its importance. While generative models could produce implausible samples, the RS strategy effectively eliminates these, strengthening the overall contribution.

Image-level metrics for task 2 (R3): We could not add the results due to space restrictions.

Further details:

Why generate only 16-frame (or 4 sec.) videos (R3): Most actions in SAR-RARP50 dataset have a median duration of only 3 - 10 secs. Also, due to computational limitations, the X3D model is trained on 16-frame video clips only, which are sampled from the video. At test time, the model is then evaluated in a sliding window approach. Therefore, generating 16-frame video clips actually aligns well with the downstream task. Additionally, hardware and time constraints forced us to limit the number of frames to 16. We shall mention this in our limitations.

How many generated videos for under-represented classes (R3): A total of 800 videos were generated per under-represented class, with 400 retained after rejection sampling. We aimed to equally balance the number of samples only for the under-represented classes.

What is threshold “k” and how is it chosen (R3): A synthetic video of class c is discarded if c is not in the top-k predictions of a ResNet3D classifier, which was trained on the recognition task using the real data only (see Section 3.2). Because task (2) is a binary classification problem, k = 1 is the only option here. In contrast, there are 7 classes in task (1) and k = 1 would be too strict, leading to the rejection of too many candidates.

Data splits (R3): A patient-based data split was used in this study.

No ground truth frames in Fig. 3 (R2): We agree that adding example frames would aid readers unfamiliar with the datasets and plan to include them in the final version.

No details on pre-trained SD model (R3): We shall add additional details to the final version of the paper.

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

Mission Balance: Generating Under-represented Class Samples using Video Diffusion Models

Author(s):