Abstract

Generative models hold promise for revolutionizing medical education, robot-assisted surgery, and data augmentation for machine learning. Despite progress in generating 2D medical images, the complex domain of clinical video generation has largely remained untapped. This paper introduces Endora, an innovative approach to generate medical videos to simulate clinical endoscopy scenes. We present a novel generative model design that integrates a meticulously crafted video transformer with advanced 2D vision foundation model priors, explicitly modeling spatial-temporal dynamics during video generation. We also pioneer the first public benchmark for endoscopy simulation with video generation models, adapting existing state-of-the-art methods for this endeavor. Endora demonstrates exceptional visual quality in generating endoscopy videos, surpassing state-of-the-art methods in extensive testing. Moreover, we explore how this endoscopy simulator can empower downstream video analysis tasks and even generate 3D medical scenes with multi-view consistency. In a nutshell, Endora marks a notable breakthrough in the deployment of generative AI for clinical endoscopy research, setting a substantial stage for further advances in medical content generation. Project page: https://endora-medvidgen.github.io/.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0470_paper.pdf

SharedIt Link: https://rdcu.be/dV5xd

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72089-5_22

Supplementary Material: N/A

Link to the Code Repository

https://endora-medvidgen.github.io/ https://github.com/CUHK-AIM-Group/Endora

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Li_Endora_MICCAI2024,
        author = { Li, Chenxin and Liu, Hengyu and Liu, Yifan and Feng, Brandon Y. and Li, Wuyang and Liu, Xinyu and Chen, Zhen and Shao, Jing and Yuan, Yixuan},
        title = { { Endora: Video Generation Models as Endoscopy Simulators } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {230 -- 240}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose Endora, a video diffusion model for endoscopy. Following recent trends in video generation, they incorporate a latent denoising diffusion probabilistic model (DDPM) into transformer blocks (previously coined as Diffusion Transformers, or DiTs for short by Peebles et al., 2023 ICCV). They include foundation features (Dinov2) to guide video generation and adapt the feature distribution to the target domain via normalization through pearson correlation. They include qualitative and quantitative evaluations. Furthermore, the authors evaluate the model’s capability on downstream tasks such as semi-supervised learning and 3D reconstruction, demonstrating the potential of synthetically generated video in various applications.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper approaches an important topic, as accurate and controllable video generation will open up many possibilities in endoscopy and laparoscopy.
    • The authors evaluate the model’s capability on downstream tasks such as semi-supervised learning and 3D reconstruction, demonstrating the potential of synthetically generated video in various applications.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The evaluation methods could be more comprehensive. The authors omit a potentially important method [1] presented at ICCV 2023, which has an official implementation available. A detailed comparison or ablation study with respect to DiTs [1] would help gauge the significance of the other technical contributions.
    • While the paper applies current trends in DiTs to endoscopy, the most significant novel contribution appears to be the integration and adaptation of pre-trained Dino features to guide video generation. However, this idea is only briefly explored and, according to Table 2, marginally contributes to performance.
    • To improve the qualitative evaluation, consider including the more recent diffusion-based baseline alongside the older GAN-based methods. For consistency, it would be helpful to include all baselines in both the qualitative comparison and the downstream task evaluation in Table 2 (or only the most relevant, DiT based methods).
    • The introduction could be more focused and concise. Instead of emphasizing the perceived impact of the work, the authors should clearly contextualize the proposed contributions within the related works in video generation, both inside and outside the medical domain. What makes this problem particularly challenging in endoscopy? The authors mention things like “fluidity of real-life medical procedures”, but this is not specific.
    • To better demonstrate the quality of the generated videos, the authors could include supplementary material, such as a video demonstration or comparison.
    • The pipeline figure could be improved to more clearly depict the diffusion and noise injection process.

    1: Peebles, William, and Saining Xie. “Scalable diffusion models with transformers.” In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4195-4205. 2023.

    Minor:

    • The writing style in some parts of the paper could be more objective and tempered. Statements such as “Endora marks a notable breakthrough in the deployment of generative AI for clinical endoscopy research, setting a substantial stage for further advances in medical content generation” and “the dire need for … clinical endoscopy videos” should be rephrased to maintain a more professional and unbiased tone.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No aspects of reproducibility are mentioned, but as code is available for similar methods like DiTs this would not be majorly prohibitive.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The application is impactful but without a comparison to diffusion transformers it is not possible to gauge the novelty of the methods. Instead, the authors could opt to focus on applications which make video generation in this domain particularly challenging due to the unique environment.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The most significant aspects to improve the paper would be adequate comparison to DiTs, better uncovering what makes this domain uniquely challenging for video generation (descriptively and through experiments), and improved presentation.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    It is indeed true that the original DiT paper does not mention video generation. However, technical reports of derivative works suggest that incorporating temporal information does not consist of more than an encoder, tokenization, and decoder subsequent to the DiT which does the heavy lifting. After a closer review, given the existence of other similar works on video diffusion using transformers (using a similar interlacing of spatio-temporal blocks) prior to submission [1], the experiments conducted by the authors leave me unconvinced of technical novelty. Furthermore, while I agree with the authors that there are unique challenges in this domain, little effort is made to transform this knowledge into technical insights which support the arguments and would include me to use Endora over a general purpose method.

    Overall, the domain is interesting and valuable to the community. However the presentation and experiments conducted do not leave me convinced of acceptance. Furthermore, the technical contributions are limited.

    1: Ma, X., Wang, Y., Jia, G., Chen, X., Liu, Z., Li, Y. F., … & Qiao, Y. (2024). Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048.



Review #2

  • Please describe the contribution of the paper

    The authors propose a endoscopic video generation model. They use a diffusion model with spatial temporal transformer blocks (3d). They also use 2d DINO pretrained model to guide the training by distilling information. They use a diffusion loss (ELBO) and a correlation loss between the attention maps of the prior DINO model and their temporal diffusion model. They perform ablation studies and experiments on the downstream tasks, which are disease diagnosis and 3D reconstruction. They also compare with other methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strength of the paper is that the others perform various experiments. Indeed video generation and synthetic data in endoscopy can be very useful. The other uses state of the are methods such as diffusion models, spatial-temporal transformers, and foundational models.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The main weaknesses of the paper are:

    The lack of a literature review. The authors do not mention any work done in endoscopic image generation and do not include a clear literature review for general video generation. Thus, it is unclear how they chose the methods to compare to. The authors should have a literature review talking about general and endoscopic specific related work and their limitations and clearly state the paper’s novelty as compared to the related work. This makes it hard to clearly identify the novelty of the paper.

    The introduction is also repetitive. It would be better if the introduction was more concise and straight to the point.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors have a code on github. However they do not include a link in the paper.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. No clear literature review present. Papers that could be included:
      • EndoVAE: Generating Endoscopic Images with a Variational Autoencoder
      • Augmenting Colonoscopy using Extended and Directional CycleGAN for Lossy Image Translation
    2. It is unclear what is “CLIP”
    3. Kvasir-capsule was referenced with the wrong paper
    4. It is unclear what is the size of the training data.
    5. Are the compared models also trained on the same endoscopic data?
    6. It is unclear why in some experiments not all the models are compared to. For example only LVDM is compared against in table 2 , whereas it is not compared to in Fig. 2.
    7. Fig 3 last image on the right doesn’t look like it corresponds to its image.
    8. Fig. 3 numbers are not discussed. It would be more clear to have an additional table with average quantitative results.
    9. Supplementary material with videos could be a good addition.
    10. 3D reconstruction point clouds could be helpful to see.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Even though it is clear that the authors have put a lot of work in the paper including working on the framework and the experiments, without a clear literature review, the novelty is unclear.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    After the rebuttal, the authors clarified that there are no generative endoscopic methods that generate videos, only those that generate images. This makes it clear why the literature seemed lacking. I therefore changed my score to accept, since the method is novel in the endoscopic field. However, in the camera ready version, I still suggest the authors clarify the minor comments I mentioned in the review and more importantly fix the introduction to be more concise, straight to the point, and include exact contributions, for readers to clearly identify the novelty of the paper.



Review #3

  • Please describe the contribution of the paper

    This work proposes an approach, called Endora, to generate medical videos to simulate endoscopic surgical scenes via generative transformer models (as an alternative to generative diffusion models). It uses cascading spatio-temporal transformer blocks with convolution layers that learn from the real video content and DINO for feature extraction. The model is trained and tested on three public datasets and shows good performance in quantitative evaluations with the FVD, FID, and IS metrics.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed method is novel and the idea is very interesting.

    The evaluation results are very promising.

    There are several practical applications for the method (e.g., data augmentation, depth rendering, teaching and training, etc.)

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There is no subjective study – the actual quality of the generated videos should also be evaluated by human experts in a user study.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    It is usually hard to assess whether really enough details are provided to reproduce the study.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    In the ablation study it would be interesting to also test removing the horizontal flipping augmentation, since orientation often plays a vital role in surgical videos. Moreover, at least a qualitative study should be performed to assess the generated videos by human experts.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Interesting idea that follows the current trend of generated visual content for the medical domain. The proposed method could have high impact in the medical field.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

First, we appreciate all the reviewers for their valuable suggestions. We are thankful for all reviewers in general appreciating that this work is novel, interesting, important & useful.

[R#1, Q1] Subjective Study for Generated Videos? We use standard video generation metrics, including FVD, FID & IS, for evaluation. Our method outperforms in quantitative results across three datasets and illustrates superiority through visual demonstrations and two downstream tasks: semi-supervised video classification and 3D Gaussian reconstruction. Besides, we appreciate your suggestion regarding the inclusion of a subjective experiment, which will undoubtedly enhance the comprehensiveness of our method’s evaluation.

[R#3, Q1] Difference &Comparison to DiT ? (I) We respectfully disagree with methodology similarity between DiT and our model as well as the reasonableness for a direct comparison with DiT, as DiT is only a 2D image generation backbone and the DiT paper does NOT involve ANY VIDEO-related model designs or experiments. Thank you for pointing out it & we will clarify it in revision. (II) Although a direct comparison with image generation architecture DiT isn’t feasible, we’ve presented results of a naive adaptation of DiT to video formats as a baseline: As shown in 1st row of Table 3 in paper, our full model significantly outperforms the DiT-adapted video baseline (FVD↓: 611.9 vs. 460.7, FID↓: 22.44 vs. 13.41). (III) We respectfully argue that our model exhibits significant effort and methodological innovation compared to DiT. Adapting DiT backbone for video format is a complex process due to the need for temporal dimension interaction. As detailed in Sec 2.1, we’ve applied strategies to harness diffusion models for video format. In Sec 2.2, we introduce a spatio-temporal interlaced transformer that effectively extracts features from both spatial and temporal dimensions.

[R#3, Q2] Unique Challenges of Endoscope Video? Video generation in dynamic scenes, such as in endoscopic medical videos, presents unique challenges due to the need to model temporal consistency and fluidity/dynamics of endoscopic tissues. As pioneers in this field, our contribution is primarily on exploring generation in endoscopic medical videos, without an existing benchmark for comparison. This exploration and the insights gained warrant attention from the community.

[R#3, Q3] Comprehensiveness of Experiments? In Tab 1, we have demonstrated a full comparison with GAN and diffusion methods (LVDM: latent video diffusion). In Tab 2, we present a promising prospect of using generated data for the training of downstream semi-supervised models, and show a significant improvement (+10.8 F1) compared with the current generation methods. We also present another potential downstream application of 3D reconstruction in Fig 3.

[R#4, Q1] Literature review. We acknowledge suggestions to include extra works on endoscopic image generation for a thorough literature review. However, we’d like to respectfully argue that we have indeed reviewed a range of generative tasks in the medical domain, including GAN or Diffusion-based image generation, image reconstruction, and translation in Introduction. Compared with the literature, medical video generation, particularly in endoscopic scenarios, is an uncharted area. Our proposed model, distinct from previous studies on medical image generation, presents the first exploration in generating medical videos for endoscopic scenes.

[R#4, Q2] Details.
(I) What is “CLIP”: CLIP [14] denotes Contrastive Language-Image Pre-training. (II) Data size: The size of training data is 210 videos for Colonoscopic, 1000 videos for Kvasir-Capsule, 580 videos for CholecTriplet. (III) Evaluation protocol: All the compared models use same endoscopic data for training as ours. (IV) More results for Tab. 2 & Fig. 3 & Videos: Due to page limitation, we are unable to include all content in submission. We will include these details in revision.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    While R1 did not provide further details to support his recommendation, the remaining two reviewers were active. The rebuttal provided by the authors were sufficient and resolved reviewers’ questions. This submission should be accepted as a poster paper. Please also provide justifications for R3’s questions in the final version.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    While R1 did not provide further details to support his recommendation, the remaining two reviewers were active. The rebuttal provided by the authors were sufficient and resolved reviewers’ questions. This submission should be accepted as a poster paper. Please also provide justifications for R3’s questions in the final version.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    While the paper is interesting, I agreed with reviewer that the technical contribution of the paper is weak compared to recent work.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    While the paper is interesting, I agreed with reviewer that the technical contribution of the paper is weak compared to recent work.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper presents a promising research direction, and its relative novelty within the endoscopic domain is adequate. The rebuttal provided helpful clarifications.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper presents a promising research direction, and its relative novelty within the endoscopic domain is adequate. The rebuttal provided helpful clarifications.



back to top