Abstract

Automatic surgical video analysis is pivotal in enhancing the effectiveness and safety of robot-assisted minimally invasive surgery. This study introduces a novel procedure planning task aimed at predicting target-conditioned actions in surgical videos to achieve desired visual goals, thereby addressing the question of ``What to do to achieve a desired visual goal?”. Leveraging recent advancements in deep learning, particularly diffusion models, our work proposes the Multi-Scale Phase-Condition Diffusion (MS-PCD) framework. This innovative approach incorporates multi-scale visual features into the diffusion process, conditioned by phase class, to generate goal-conditioned plans. By cascading multiple diffusion models with inputs at different scales, MS-PCD adaptively extracts fine-grained visual features, significantly enhancing procedure planning performance in unstructured robotic surgical videos. We establish a new benchmark for procedure planning in robotic surgical videos using the publicly available PSI-AVA dataset, demonstrating that our method notably outperforms existing baselines on several metrics. Our research not only presents an innovative approach to surgical video analysis but also opens new avenues for automation in surgical procedures, contributing to both patient safety and surgical training.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2373_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2373_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Zha_See_MICCAI2024,
        author = { Zhao, Ziyuan and Fang, Fen and Yang, Xulei and Xu, Qianli and Guan, Cuntai and Zhou, S. Kevin},
        title = { { See, Predict, Plan: Diffusion for Procedure Planning in Robotic Surgical Videos } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a diffusion model capable of establishing procedural planning when provided with the starting and goal frame information of a specific procedure in surgery. They claim to have enhanced the accuracy of the denoising process by adding a scale dimension to the diffusion process of PDPP [28].

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The authors emphasize that their study is the first to perform procedure planning for surgical videos using a diffusion process.

    • They highlight that the proposed multiscale diffusion process contributes to performance improvement.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    While the first attempt at procedure planning in surgical videos is highly significant, the technical contribution of [28] appears to be very similar. It is difficult to consider the denoising process for multi-scale selection as a significant contribution.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors utilized publicly available datasets, and it seems straightforward to reproduce their work using the code provided in [28].

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Consideration should be given to the potential for rapid expansion and application of the PDPP model and procedural planning tasks. Additional contributions could include proposing structures that leverage holistic scene information from the given dataset to create useful representations for sequence prediction. Additionally, it may be worthwhile to explore clinically meaningful applications that allow for evaluation in clinical settings.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I’ve been pondering about how much domain contribution should be recognized for the rapid expansion of technology from the computer vision domain to the medical domain. Final rating may be adjusted.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    I still believe there is a lack of major novelty, even after reviewing the authors’ opinions in the rebuttal.



Review #2

  • Please describe the contribution of the paper
    1. The paper presents a surgical procedure planning task where the model learns to predict a set of action sequences to reach from a current to future surgical scene of robotic surgery videos.

    2. The method introduced a multi scale formulation to the diffusion model to tackle lower visual variance and the results show the improvement over the baselines.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is clear and easy to follow.

    2. The task of predicting surgical actions conditioned on the target observation is novel and is crucial for surgical robot automation. The idea to incorporate phase information is straightforward as it provides cues for steps/actions.

    3. The multi-scale approach is clear and shows benefit over fixed scales as the variation in visual cues across steps might be limited.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1.a. Details about the selection branch mechanism (model and its input) are missing.

    1.b. How does it adaptively select the “optimal” input scale as there is no direct supervision for the scale selection?

    1. The ablations are limited. For example, in the implementation details, the model takes 512d features of the encoder trained on HowTo100M. It would be interesting to observe the performance when model is pretrained on surgical videos.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Please refer to points 1 & 2 in the main weaknesses.

    2. Fig.3 can be improved as it is using a lot of empty spaces.

    3. Dws is not mentioned anywhere else in the paper.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents an important task suited for surgical procedural planning and control. The novelty is majorly on the task and less on the architecture. However, the paper introduces multi-scale formulation to improve the performance but does not have enough details to support it. Furthermore, the ablations are limited to only scale selection.

    Based on these observations, I recommend weak-reject.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors provided apt responses to the issues raised on the adaptability of the scale selection method which further impacts the goal observation, all guided through procedure planning objectives. The authors also stressed improving the figure and the descriptions for more clarity.



Review #3

  • Please describe the contribution of the paper

    This paper introduces the task of surgical planning, devises a first benchmark using a public dataset, and establishes the first baseline using a diffusion based approach.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Surgical planning, as introduced by the authors, is an interesting and valuable task.
    • Formulating surgical planning in form of diffusion is novel in the field of surgical data science, and seems to be effective.
    • The authors provide experimental results for multiple baselines, including their own proposed architecture.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The technical novelty in the authors proposed network compared to the previous works seems to be limited to a multi-scale input used during the diffusion process. While effective, this is only a minor contribution.
    • The organization and writing is unclear in some parts. E.g. what the exact input and output of the models are is not fully clear as well as the figure 1. How the proposed method differs from PDPP, except for multi-scale diffusion, is not clear. If there are not differences, why are the results on table 1 for scale=1(ours) different, than PDPP. Simply due to randomness? Figure 3. is difficult to read.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I believe the paper would benefit from focusing more on the task and the novelty of modeling surgical planning using diffusion as well as focusing on the benchmark. The methodological novelty seems limited compared to PDPP, therefore emphasizing that less in comparison would be beneficial. The papers writing should be improved according to the suggestions in “weaknesses”. Finally, I encourage the others to release their code and data to facilitate further research in this new task.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the technical novelty is limited, the new task of surgical planning, as well as the approach of modeling it as a diffusion process are interesting and beneficial for the community.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    Overall, similar to what I have indicated in my original review, I believe the task, its motivation and formulation are all interesting and valuable.

    The authors addressed my concerns regarding the clarity of their method, and I hope similar clarifications would be made in the final manuscript.

    The authors also promised to release their code, which I appreciate.

    My only concern was the limited architectural novelty, which has not been properly addressed by the authors, and I now believe that this is the biggest weakness of this paper.

    However, I think the overall value of this paper to the SDS community is higher than most papers, who have more architectural novelty. It is quite rare to see a paper defining, formulating and introducing a new and relevant task to the community. Therefore, regardless of the aforementioned weaknesses, I suggest an acceptance.



Review #4

  • Please describe the contribution of the paper

    The authors ask how to achieve a specific visual goal and incorporate multi-scale visual features into a diffusion process, conditioned by phase class, to generate goal-conditioned plans.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors are proposing new lines of research in the field of computerized procedure planning. Predicting

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The surgical images in the figures (including Fig 4 in the supplement) are very small and hard to understand. Annotating them would be helpful.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors properly reference the source of their data but do not mention the availability of their results, both data and trained network.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    What is the meaning of GT in Fig. 2? Ground truth? Don’t leave the reader guessing

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is interesting work in the field of surgical data science

  • Reviewer confidence

    Not confident (1)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

Thanks for the positive feedback! We are encouraged that the reviewers found the new task of surgical planning interesting (R4), valuable (R4), important, and novel (R5). The idea is considered straightforward (R5), highly significant (R1), clear, easy to follow (R5), and interesting (R3). We are pleased that our method using diffusion is recognized as the first attempt (R1), a new line of research (R3), and the first baseline and benchmark (R4) in the field. The work is also seen as interesting and beneficial for the community (R4). In the final version, we will provide further clarity on the novelty and contributions and expand the discussion section to include more potential improvements and clinical insights related to our approach. We address specific reviewer concerns as follows:

  • Reproducibility (R3, R4): We will make the source code, data, and pre-trained models publicly available by including a link in the final version.

  • Contribution Clarification (R1, R4): Our focus is to introduce a novel task aimed at predicting target-conditioned actions in surgical videos to achieve desired visual goals. This represents a significant shift from existing research that primarily focuses on recognizing or predicting the current or future states in surgical videos. In addition, we construct a dataset and build a strong baseline using the proposed diffusion method to establish a new benchmark for evaluation. Automating the planning of surgical procedures can enhance robotic surgical systems. By predicting the sequence of actions needed to achieve a surgical goal, the system can avoid unnecessary movements and minimize errors, improving patient safety. Our benchmark for procedure planning in surgical videos allows systematic evaluation of different approaches, contributing to future research.

  • Figure Visualization & GT (R3): We will add annotations for surgical images and include “Ground Truth” in Fig.2 to avoid misleading in the final version.

  • Clarity and Organization (R4): Thanks for pointing out the issue. We will provide further clarity on: (1) In Fig.1, the inputs are only the initial and goal observations (Oi and Og), while the outputs are action sequences and the optimal scale selection given the input pair. We will change “start” to “initial” for more clarity. (2) Besides multi-scale diffusion, we propose to employ ViT rather than MLP in PDPP for phase classification (at the end of Sec 2.1), since we found that the phases are very similar, which makes it difficult for MLP to classify accurately. In this regard, our results (s=1) are better than PDPP. (3) We will adjust Fig.3 and add more captions for clarity.

  • Missing Selection Details (R5-1): We clarify that the scale selection branch adjusts input scales as conditions and cascades diffusion models at different steps for training, optimized indirectly through the procedure planning objective to avoid additional losses and hyperparameters. (as mentioned in the third paragraph of Sec 2.2). The outputs include both predicted action sequences and the optimal scale (see Fig.1), which is then used to update the inputs (Oi and Og) for the next denoising process. The scale selection branch randomly selects the scale during the initial 50 steps of each 200-step epoch. When the scale changes, the initial and goal observations in Fig. 1 are updated, resulting in changes to the action sequence prediction, which in turn affects the loss in Eq. (4). This guides the scale selection branch to achieve more accurate action sequence predictions. We will enrich the descriptions in the final version.

  • Limited Ablations (R5): Thanks for your suggestion. In our future work, we plan to explore features extracted from pre-trained models on surgical videos, e.g., GSViT (arXiv:2403.05949, Mar 2024) for ablation analysis.

  • Empty Spaces & Dws (R5): We will revise Fig.3. ‘Dws’ means diffusion branch for window scale selection, and we will add the descriptions in the final version.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper brings the task of video procedure planning from the computer vision community to the surgical domain, which might be interesting to some of the researchers in surgical data science.

    This AC identified a limitation of this paper, in that it is restricted as a video analysis model, without real interaction or control with the surgical robotic system. Currently, there is no clear pathway for clinical meaningful applications that can be sufficiently validated. The authors should note this, by indicating in the revised paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper brings the task of video procedure planning from the computer vision community to the surgical domain, which might be interesting to some of the researchers in surgical data science.

    This AC identified a limitation of this paper, in that it is restricted as a video analysis model, without real interaction or control with the surgical robotic system. Currently, there is no clear pathway for clinical meaningful applications that can be sufficiently validated. The authors should note this, by indicating in the revised paper.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top