Abstract

Endoscopic video generation is crucial for advancing medical imaging and enhancing diagnostic capabilities. However, prior efforts in this field have either focused on static images, lacking the dynamic context required for practical applications, or have relied on unconditional generation that fails to provide meaningful references for clinicians. Therefore, in this paper, we propose the first conditional endoscopic video generation framework, namely EndoGen. Specifically, we build an autoregressive model with a tailored Spatiotemporal Grid-Frame Patterning (SGP) strategy. It reformulates the learning of generating multiple frames as a grid-based image generation pattern, which effectively capitalizes the inherent global dependency modeling capabilities of autoregressive architectures. Furthermore, we propose a Semantic-Aware Token Masking (SAT) mechanism, which enhances the model’s ability to produce rich and diverse content by selectively focusing on semantically meaningful regions during the generation process. Through extensive experiments, we demonstrate the effectiveness of our framework in generating high-quality, conditionally guided endoscopic content, and improves the performance of downstream task of polyp segmentation. Code released at https://www.github.com/CUHK-AIM-Group/EndoGen.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1015_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/1015_supp.zip

Link to the Code Repository

https://github.com/CUHK-AIM-Group/EndoGen

Link to the Dataset(s)

N/A

BibTex

@InProceedings{LiuXin_EndoGen_MICCAI2025,
        author = { Liu, Xinyu and Liu, Hengyu and Wang, Cheng and Liu, Tianming and Yuan, Yixuan},
        title = { { EndoGen: Conditional Autoregressive Endoscopic Video Generation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15969},
        month = {September},
        page = {168 -- 178}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces a conditional endoscopic video generation method capable of modeling inherent global dependencies through its SGP strategy. Additionally, the authors propose a Semantic-Aware Token Masking mechanism that dynamically determines masking ratios based on token variance, ensuring the retention of the most informative tokens. This approach enhances the model’s ability to produce rich and diverse content by selectively focusing on semantically meaningful regions during the generation process.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The author explicitly states that time dynamics are crucial for simulating endoscopic surgery. To achieve this, they developed the SGP strategy to model the spatial and temporal dependencies between frames, and the SAT strategy to preserve frames with rich semantic content based on variance and mask meaningless frames. Through the above two strategies, the proposed method can ensure temporal coherence while maintaining the clinical relevance of the generated video. The experimental results demonstrate that the proposed method can achieve state-of-the-art in both visual realism and downstream tasks. The method presented in this article can, to some extent, advance the technology of medical video generation.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1The methodology part is unclear and lacks sufficient detail regarding how to use conditions. The title of this paper is “Conditional Autoregressive Endoscopic Video Generation”, in which conditional should be a central component. However, “condition token” only appears 4 times in the entire manuscript. This significantly undermines the paper’s core contribution. 2Is the reconstruction error limited to standard cross entropy loss? Even if there is only one loss, I think a formula should be added to make your method section more complete. 3In the experimental section, the results obtained solely from synthetic data are even higher than those obtained solely from clinical data, which is highly questionable. In fact, many papers have shown that in downstream tasks, using synthetic data alone cannot compare with real data. Can this method, which only uses cross entropy as a constraint to synthesize data, really achieve such good results? I think the methodology section should be more focused and clearly described. 4In the tables of the experimental section, there is no description of what the indicators used represent. For someone who is not familiar with this task, it is difficult to know what these experiments are comparing, and each value is very large.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The author should rephrase the details of the method, such as the entire process of the network (how to encode, how to input VQGAN, how to reconstruct), the loss function of the network, how to ensure temporal correlation among frames, and how to use conditional tokens and what impact they have on the results.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The author has addressed my concerns regarding the methodology section. Although there is a lack of clinical validation, I believe the contribution of the methodology section in this article is worthy of acceptance



Review #2

  • Please describe the contribution of the paper

    The manuscript proposes a conditional endoscopic video generation framework using an autoregressive model. The proposed solution is based on observations of high resource consumptions and memory requirements of video generation models based on 3D convolutions and observations of temporal inconsistencies within the output of interleaved spatial and temporal modules. The proposed approach introduces Spatiotemporal Grid-Frame Patterning (SGP), rearranging video frames in a grid-like structure as a single image. It further introduces a Semantic-Aware Token Masking (SAT) mechanism to improve the model’s ability to generate diverse sequences. Eventually, the potential of synthetic data is showcased in downstream experiments.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The manuscript presents the work’s motivation clearly.

    The manuscript includes detailed and insightful ablation studies.

    The writing style is clean and easy to follow.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Figures and tables could be organized better and placed near their text references, further smoothening the reading experience.

    The manuscript claims to present the first conditional endoscopic video generation framework, which neglects prior works such as Endora [C. Li et al., 2024].

    One major advantage of autoregressive models over traditional denoising diffusion models is their capability for long sequence generation. However, the proposed SGP method of reorganizing frames into a single grid-like image counteracts this benefit.

    For a comprehensive evaluation, the manuscript lacks a comparison to a standard 2D diffusion model trained on the SGP data representation.

    The metric usage explanation could be more precise by highlighting superior scores in tables and indicating whether lower or higher scores are optimal. For example, the LPIPS metric can be a similarity measure or quantify sample diversity. It is not clear which is the case here. Since the authors claim improved diversity from SAT, I would expect higher scores to be better.

    The fidelity of the qualitative SurgVisdom results in Figure 4 is rather poor. To further judge the quality of generated sequences, it would be beneficial to include examples of all datasets in the supplementary material, not just results from training on HyperKvasir.

    The downstream experiments are quite similar to [C. Li et al., 2024]. Those make no good use of the conditional generation setup, e.g., by evaluating improvements in classification.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please see the list of strengths and weaknesses.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    The rebuttal response addresses many of the reviewers’ comments, such as formatting issues and unclarities regarding the methodology and experiments, and commits to improving the manuscript in those sections for revision.

    However, major issues were not fully understood and can potentially not be addressed in the revision due to the MICCAI 2025 guidelines. This includes comments on the “synthetic only” results (DiffTumor does NOT report synthetic only results, but results from generative augmentation), the weak and poorly motivated downstream task (new classification results must not be included in the revision), and the lacking comparison to baselines (2D DDM on SGP; additional experiments must not be included) and related work (the manuscript could have compared to Endora for unconditional generation to get baseline results).

    The rebuttal response doesn’t fully address my concerns regarding using a grid representation in an AR model. The number of frames is limited by the grid size, which is limited by the computational resources. I do not believe a VQ-GAN for 64x3x256x256 images can be trained on a single consumer GPU with 11GB VRAM. While the AR might be fast and efficient in training, the VQ-GAN will be the bottleneck here.

    The response does also not address my concerns regarding the fidelity of the SurgVisdom results in any way.

    Hence, I recommend rejecting the manuscript in its current form.



Review #3

  • Please describe the contribution of the paper

    The paper presents a method for endoscopic video generation called EndoGen. It is an Autoregressive model with a Spatiotemporal Grid-Frame Patterning strategy and a a Semantic-Aware Token Masking mechanism. The authors compare their method on two datasets: HyperKvasir and SurgVisdom. This method outperforms earlier methods like SimDA, VDM or VidGPT and most by a large margin in relationship to the metrics FVD, CD-FVD, FID and LPIPS. They further conducted an ablation study and present two downstream tasks (Semi-supervised Polyp Segmentation and Endoscopy 3D Scene Reconstruction).

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    To me the major strengths of the paper are the excellent results of the proposed method. The paper reads well, discussed a large number of high quality and new references and the visual quality of the created images is very high. The methodology seems solid and the experiment seems to have been conducted with the required rigor (although some details may be missing -> see below). I like seeing that this paper is one of the few I’m reviewing this year, that does not try to cheat by shortening references, by abbreviating all journal/conferences. Thanks for setting a good example :)

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Please take these comments with a grain of salt, as my expertise is not video generation. I have a hard time understanding how Spatiotemporal Grid-Frame Patterning can map the temporal sequence into a spatial representation. Does it mean, that the method creates a large image, with the temporal frames next to each other? Could the authors describe this aspect in more detail? The formatting of the paper is unfortunate. The text reads quite well, however whenever I saw a figure or table, the reference seems very far away in the text. Therefore I found it hard to relate the figures and tables to their corresponding sections. e.g. Table 1 seems to be referenced way later. usually figures and tables should be placed close to its reference, and usually after the reference in the text. Some figures and tables seem to not at all be referenced in the text. The captions of the tables and figures is often too short for me. I can hardly understand what the tables and figures show, because I don’t know what I should be seeing. Especially the tables use abbreviations that were not introduced before. Table 1 states that the best performance is indicated by bold text, but this is only given for the average. Fig5b is completely not understandable to me. The Experiments section may have been written in a rush, as the textual clarity is lower than in the beginning of the paper. Several implementation details seem to be missing. Even if the source code is announced to be published, it would still be nice to directly read about what programming environment was used, which libraries and their versions, hardware requirements and computation time. The ablation study does not seem to indicate on which dataset it was conducted. The downstream task of Endoscopy 3D Scene Reconstruction seems very weak to me, and feels like an afterthought. I cant understand the corresponding figure and I feel like this paragraph is the weakest of the paper, which is unfortunately the final part of the paper, which the audience might remember most. I think the paper would be improved by either removing this paragraph and spending the gained space on other stronger sections, or greatly rework this paragraph to make it equally strong.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    I have a few detailed comments: Abstract The first sentence feels very unmotivated and is not convincing for me as a first sentence.

    Introduction It seems to me that Fig1 is nowhere referenced in the text? Is Semantic-Aware Token Masking (SAT) an established abbreviation? I find it confusing, that apparently the “Masking” is part of the abbreviation, but does not get its own letter?

    Methodology “we fed them into the AR model” -> “feed”? “and reconstructed with AR model in an autoregressive manner” -> “with the AR model”? The SAT abbreviation in 2.2 has been introduced before. “split the feature with (T × L)/H segments, with each has a token length of H.” - This sentence seems slightly mis-formulated. Further H seems not explicitly defined as opposed to B, T, L and C. “With SAT, the model is encouraged to generate videos that are not only temporally coherent but also clinically meaningful, addressing a critical limitation of existing video generation methods.” - This sentence is too general in my opinion. Especially the “clinically meaningful” part is too bold. I would argue that there is no single mathematical metric that defines clinical meaningfulness

    Experiments Table1 is hard to read, as the number from individual columns are too close to each other. What is “Bar”?, Eso, Ecto, Perf? What is the metric compared here? Why is only the Avg. bold? Section 3.1 has a multitude of incorrectly placed articles (“the”). Please re-read this paragraph! I find Table5 confusing. What does Lab. Unl. Real and Unl. Syn. mean? What is compared against what in which rows? “compared to other methods [8, 29] that shows distorted or blurry” -> “show” Fig 5b is, to me, impossible to understand without any description of what the rows and columns are. The caption does not help. “From Tab. 5 and Fig. 5(a), replacing real with synthetic data could” - The beginning of this sentence seems incorrectly formulated.

    References [20] could spell out NeurIPS.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method seems sound to me and the results are very promising. Most criticism I have is in terms of editorial issues, which can likely be easily fixed. The writing is decent and there are many good explanations and results in the paper. If the editorial issues can be fixed, some of the writing in the second half of the paper is improved, some more details are provided in regards to implementation details and especially the captions of the tables and figures are improved, I (with my limited background on video generation) recommend this paper to get a ‘weak accept’

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I see that the other two reviewers recommend rejection, which gives this paper poor chances.

    I am worried about the other reviewer’s comment:” the results obtained solely from synthetic data are even higher than those obtained solely from clinical data, which is highly questionable.” The authors gave a good answer, but I cannot verify if this is true.

    Other than that, I think all the changes promised by the authors will make this an “acceptable” paper, especially as many issues from the reviewers seem to be editorial in nature.

    In contrast to my fellow reviewers I will recommend acceptance.




Author Feedback

We appreciate the reviewers’ constructive feedback and address major concerns below.

R1Q1:Lack detail on conditions. A:Thanks for the comment. Conditional token is indexed from a set of learnable embeddings, and serves as the starting prefilling token. Starting from it, the model generates a sequence of video tokens autoregressively. Without conditional token, the model can only generate random class samples and fails to produce desired class videos when needed by doctors. We’ll expand it in revision.

R1Q2:Only CE loss? How it lead to strong performance? A:Yes. Like advanced large AR language and image models like LLaMA and Janus, we also use simple CE loss. The strong performance comes from the SGP with AR training paradigm, which models discrete data distributions by sequentially predicting the next token, unlike continuous iterative patterns in diffusion. This discrete approach yields more precise generation and reduces noisy artifacts.

R1Q3:Synthetic better than real data? A:Synthetic data can now outperform real data due to its greater diversity and can encompass variations not typically seen in real datasets. Prior work such as DiffTumor(CVPR24) also reported such findings.

R1Q4:Metrics, and why large value. A:We mainly use FVD [24] following SOTA video generation works. FVD=|μ_r-μ_g|^2+Tr(Σ_r+Σ_g-2(Σ_rΣ_g)^(1/2)), where μ_r and μ_g are the means of the feature distributions for real and generated videos, and Σ_r and Σ_g are the covariance matrices. The term Σ_r+Σ_g can cause large FVD when the dataset is diverse. In video generation tasks, the goal of using FVD is to measure differences between models, while relatively lower ones mean closer to the real data.

R2Q1:How SGP works. A:SGP arranges video frames in a sequential, row-by-row format within a large image, which ensures the frames maintain temporal dynamics when processed by the AR model.

R2Q2:3D task. A:We reconstruct a 3D scene with our generated video. In Fig.5b, top row shows views rendered from our 3D scene, and bot row shows the GT views. This task is designed following [11], which aims to indicate our generated videos show high fidelity and robust geometric structure. Due to page limit, we admit this part is not sufficiently detailed. We commit to removing this to strengthen preceding sections and will provide more details in a later extension version. Thank you!

R2Q3:Implementation details. A:EndoGen is trained on single 4090 GPU, torch 2.2, requiring only 11G memory and 2.8s to generate a video.

R2Q4:Formats. A: We commit to address the editorial issues, improve the experiments and implementation details, and enhance the captions. The abbrev. in Tab1 and 2 refer to the 8 and 3 conditions in corresponding datasets, and full class names are omitted due to page limit.

R3Q1:Neglect Endora. A:Endora is limited to unconditional generation and fails to produce desired conditional videos needed by doctors. We respectfully point out that this discussion is in the intro, and we’ll clarify.

R3Q2:SGP diminishes AR long-sequence. A:We consider that instead of diminishing the benefit of AR models, SGP harnesses their strength by focusing on inter-frame continuity through a grid-like structure. This reorganization prioritizes temporal consistency and detail preservation, essential for endoscopic sequences. Moreover, compared to diffusion-based Endora, which generates only 16-frame videos, our model produces 4x longer videos, demonstrating SGP could stimulate long-range capabilities of AR models.

R3Q3:Compare to diffusion+SGP. A:We tuned a Stable Diffusion 1.5 with SGP on 4 GPUs and achieved 1759.3 FVD (Ours: 507.2).

R3Q4:Metric usage. A:We use LPIPS similarity measure, and all metrics are lower better. We‘ll add notation.

R3Q5:Downstream. A:Thanks comment. We trained ResNet50 and Swin-T for classification on HyperKvasir with 5-fold CV. With our synthetic data, the performance improved remarkably: (ResNet50: 86.1%; +Ours: 88.7%. Swin-T: 90.4%; +Ours: 93.6%).




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Introducing visual autoregressive models to the surgical domain is a meaningful contribution. However, reviewers pointed out several main issues including the lack of comparison to baselines and related works such as Endora.

    There was a debate regarding R1Q3 about synthetic-only results. After checking the evidence provided in the rebuttal, I stand with R4 and consider the evidence in the rebuttal to be misleading.

    New results are given in rebuttal but ignored as per MICCAI policy.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top