Abstract

In ophthalmic surgery, developing an AI system capable of interpreting surgical videos and predicting subsequent operations requires numerous ophthalmic surgical videos with high-quality annotations, which are difficult to collect due to privacy concerns and labor consumption. Text-guided video generation (T2V) emerges as a promising solution to overcome this issue by generating ophthalmic surgical videos based on surgeon instructions. In this paper, we present Ophora, a pioneering model that can generate ophthalmic surgical videos following natural language instructions. To construct Ophora, we first propose a Comprehensive Data Curation pipeline to convert narrative ophthalmic surgical videos into a large-scale, high-quality dataset comprising over 160K video-instruction pairs, Ophora-160K. Then, we propose a Progressive Video-Instruction Tuning scheme to transfer rich spatial-temporal knowledge from a T2V model pre-trained on natural video-text datasets for privacy-preserved ophthalmic surgical video generation based on Ophora-160K. Experiments on video quality evaluation via quantitative analysis and ophthalmologist feedback demonstrate that Ophora can generate realistic and reliable ophthalmic surgical videos based on surgeon instructions. We also validate the capability of Ophora for empowering downstream tasks of ophthalmic surgical workflow understanding. Code is available at https://github.com/mar-cry/Ophora.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1768_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{LiWei_Ophora_MICCAI2025,
        author = { Li, Wei and Hu, Ming and Wang, Guoan and Liu, Lihao and Zhou, Kaijing and Ning, Junzhi and Guo, Xin and Ge, Zongyuan and Gu, Lixu and He, Junjun},
        title = { { Ophora: A Large-Scale Data-Driven Text-Guided Ophthalmic Surgical Video Generation Model } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {426 -- 436}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes using a latent diffusion model to generate ophthalmic videos from text. To train the model, the authors curate a dataset of 160k video, with a smaller dataset of 28k video that is privacy preserving. The model is pretrained on the larger dataset and fine tuned on the smaller one to preserve privacy.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Public code for data generation
    • Thorough data curation and documentation of the dataset refinement process
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Unclear whether any data will be made public, or just the code
    • Fine-tuning using a smaller dataset doesn’t seem to guarantee privacy. Is there any checks afterwards?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • In Fig. 2, there are visible inconsistencies from frame to frame and instruments don’t always appear straight. Please comment on how this would affect downstream tasks.
    • Is Fig. 3 showing the average from the 3 surgeons’ ratings? Inter-rater reliability on the goodness of the video scores would help understand the qualitative evaluation
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The potential to generate privacy preserving videos for ophthalmic surgery is exciting. However, there is a lack of discussion on how artifacts in the generated video will affect downstream tasks.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper introduces Ophora, a novel text-guided ophthalmic surgical video generation model. Its key contributions include developing Ophora-160K, a dataset of 160K+ video-instruction pairs created through a comprehensive data curation pipeline, and implementing a Progressive Video-Instruction Tuning approach that transfers knowledge from pre-trained text-to-video models for privacy-preserved surgical video generation.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper created Ophora-160K, a large-scale, high-quality dataset of ophthalmic surgical video-instruction pairs through a comprehensive curation pipeline.
    2. The Progressive Video-Instruction Tuning method effectively transfers spatial-temporal knowledge from natural video domains to specialized medical contexts.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The paper mentions that existing methods struggle to capture fine-grained actions and intricate interactions between instruments and anatomical structures. Their proposed solution is building higher-quality video-text pairs. However, the high-quality Ophora-160K dataset is created merely by removing redundant narrative information and filtering poor-quality clips. The authors fail to demonstrate how these approaches specifically improve the generation of fine-grained actions.
    2. The paper’s architectural innovation is limited. The proposed generation framework is entirely based on CogVideoX-2b, with two-stage training on the Ophora-160K dataset, but lacks framework improvements specifically tailored to the dataset.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper’s proposed framework offers limited architectural innovation, its significant contribution lies in the development of Ophora-160K, a large-scale ophthalmic surgical dataset that will benefit future multimodal surgical research endeavors.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper introduces Ophora, a text-to-video (T2V) generation model specifically designed to synthesize realistic ophtalmic videos based on natural language instructions. It is among the first T2V models for ophtalmic surgery video generation. A systemic process was proposed to convert narrative surgical videos (from OphVL) into a large-scale dataset of high quality video clips paired with generation instructions. A two state transfer learning approach is applied: 1) Transfer-Per-Training (TPT) to adapt the general T2V model to the ophtalmic domain using Ophora 160k. 2) Privacy-Preserving Fine-Tuning (P2FT) using a visually filtered subset. Quantitative evaluation shows Ophora generates videos with higher realism and text-video consistency compared to baselines.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Tackles a significant challenge of acquiring sufficient high-quality, annotated sugical video data, particularly in ophtalmology field, by applying advanced text-to-video diffusion models to generate synthetic videos. 2) Addressing explicitly a step for privacy concerns by filtering sensitive visual information before the final fine-tuning stage 3) Qualitative examples and quantitative evaluation using FID, FVD and CS, indicates the robustness of the method. Using Ophora’s videos for data augmentation demonstrates tangible downstream ultility by showing significant improvement in surgical workflow analysis. Ophora significantly outperforms baseline T2V models (even when fine-tuned on the same data) in generation quality and text alignment. 4) A well-thought data curation pipeline was introduced, including LLM-based instruction refinement and LVLM-based visual filtering for privacy.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) The study lacks discussion on the training time required for Ophora and the inference time required to generate videos, which could be a pratical barrier for large-scale data augmentation. 2) The generated videos are relatively short. The authors acknowledge exploring longer durations as future work. It’s unclear how well the model handles very complex, multi-step instructions or maintains coherence over significantly longer time spans relevant to full surgical procedures. 3) No thourough details about the prompts used for refining text instructions or the exact criteria the LVLM used to detect “sensitive visual information” would improve reproducibility and clarity

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    1) Consider adding a brief discussion on the computational aspects (training/inference time). 2) Expanding slightly on the limitations regarding generation length/complexity and data diversity dependency would provide a more balanced perspective. 3) Providing example prompts or more detailed criteria for the LLM/LVLM filtering stages in an appendix could be beneficial for transparency.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents a valuable contribution by successfully applying T2V generation to the ophthalmic surgery field. The approach to leverage pre-trained models via progressive tuning and the emphasis on data curation and privacy are commendable. The demonstration of downstream utility via data augmentation is particularly compelling.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

N/A




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top