Abstract

Every day, countless surgeries are performed worldwide, each within the distinct settings of operating rooms (ORs) that vary not only in their setups but also in the personnel, tools, and equipment used. This inherent diversity poses a substantial challenge for achieving a holistic understanding of the OR, as it requires models to generalize beyond their initial training datasets. To reduce this gap, we introduce ORacle, an advanced vision-language model designed for holistic OR domain modeling, which incorporates multi-view and temporal capabilities and can leverage external knowledge during inference, enabling it to adapt to previously unseen surgical scenarios. This capability is further enhanced by our novel data augmentation framework, which significantly diversifies the training dataset, ensuring ORacle’s proficiency in applying the provided knowledge effectively. In rigorous testing, in scene graph generation, and downstream tasks on the 4D-OR dataset, ORacle not only demonstrates state-of-the-art performance but does so requiring less data than existing models. Furthermore, its adaptability is displayed through its ability to interpret unseen views, actions, and appearances of tools and equipment. This demonstrates ORacle’s potential to significantly enhance the scalability and affordability of OR domain modeling and opens a pathway for future advancements in surgical data science. We will release our code and data upon acceptance.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0311_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0311_supp.pdf

Link to the Code Repository

https://github.com/egeozsoy/ORacle

Link to the Dataset(s)

https://github.com/egeozsoy/ORacle

BibTex

@InProceedings{Özs_ORacle_MICCAI2024,
        author = { Özsoy, Ege and Pellegrini, Chantal and Keicher, Matthias and Navab, Nassir},
        title = { { ORacle: Large Vision-Language Models for Knowledge-Guided Holistic OR Domain Modeling } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a pipeline remarking an interesting attempt at modelling editable scenes of digital operating rooms, by harnessing the reasoning ability of LLM and scene understanding and generation abilities of large visual-language models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Novel experimental setting and research direction which could raise a lot of interests of the community of surgical data science.
    2. The locations of the objects are generated at the right places according to the two visual results.
    3. The results seem to be better than the previous state-of-the-art, with good performances on downstream tasks as well.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The given visual results imply that the model seems to be doing only 2D modelling of the surgical scene, for example, in Fig3, the added scissors look like a 2D picture in a 3D scene, which is not too realistic. Could the authors elaborate on why this could happen? And the potential methods to improve this to make it more realistic?

    2. One baseline lacking, which is zero-shot learning on the pretrained large model. Large models sometimes can be “zero-shot” learners, it would be interesting to see how detailed prompts can make the existing generative visual models generate without fine tuning on the 4D-OR dataset. That would provide a more convincing adjustment of the effectiveness of the proposed pipeline. However, MICCAI doesn’t allow extra results in the rebuttal so maybe in the future work.

    3. There are no failing cases discussed which should be included in the rebuttal. All large models suffer from hallucinations, the authors should mention or discuss the successful rate of generating reasonable scenes.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. It vaguely says the temporal information was embedded in the prompt text or maybe I have missed it. Could the authors elaborate the details of how temporal information is actually embedded and maybe give a few examples of the prompts?

    2. The figure could be improved for clarity. For example in Figure 2, the authors could add a dotted box on the left including both the texts attributes and the scene from 4d-or and labelling it as “generation/sampling”, with an extra arrow from the left scene image to the final generated scene image.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the visual results are not perfect yet, I think this could be an interesting research direction.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors propose an adaptable knowledge-guided OR modeling approach. Their goal is to be able to do this in a robust way from a single viewpoint.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors introduce multimodal knowledge guidance, which for the first time, allows adapting to previously unseen concepts at the time of inference. The combination of computer vision and LLM is interesting.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The tools injected into the scenes are not aligned with the coordinate system of the scenes. In figure 3 the robots base is not on the floor (left, non adaptable) and the surgeon is not suturing, but a yellow scissor is between the surgeons hand and the patients knee (right).

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Page 3: You claim that you will release the data publicly, but the claim lacks details and specifics. Will the data include the trained network? Where is the data set publicly available? How long will the data set be available?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The work in this paper is of interest to the community.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This work proposes a novel approach for operating room scene graph generation by treating the task as a language generation problem. The paper uses Large Vision and Language Models, specifically LLaVa, to generate text strings corresponding to scene graphs from multi-view operating room images. The model also incorporates scene graph strings from previous time points as inputs. The paper proposes a data augmentation strategy that generates images of alternative tools (with Stable Diffusion) and overlaps them on adequate parts of room views. Also, they input visual or text descriptions of present objects and interactions to enhance generalizability, but during training, they replace names with random symbols to avoid class name memorization. Finally, they train and validate their method on the 4D-OR dataset and demonstrate their superiority over previous models that require point clouds. They present ablations to support their proposals’ impact and prove their method’s generalizability.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Relevance of the task and scope of the paper:

    • The paper reformulates semantic scene graph generation in the operating room as a language strings generation task, which differs from previous approaches in this domain that use a strict graph generation approach.
    • The paper highlights significant flaws in current approaches, like their reliance on costly point clouds and their poor generalization, and demonstrates how their proposed approach can overcome these limitations.

    Technical novelty:

    • The research leverages modern large language and vision models to generate operating room scene graphs, being the first application of such models in this context. Specifically, this work adapts LLaVa’s architecture by incorporating transformers’ self-attention mechanisms to handle multiple views of the operating room scenes. Also, it integrates the temporal reasoning methodology of previously proposed models by feeding previous scene graph strings into the model. Notably, this paper proposes a strong architecture and removes the need for input point clouds.
    • The paper introduces reproducible, resourceful, and innovative methods to improve the training process and the model’s generalizability. These include data augmentation using modern pretrained Diffusion Models, including textual and visual descriptions, and anonymizing object names during training. Such strategies are well justified in the paper and demonstrate significant improvements.

    Experimental Validation:

    • The paper presents a robust validation benchmark, comparing different versions of its approach against existing models in the original test set of 4D-OR and their proposed altered set. The paper includes ablation studies for each of their methodological proposals that demonstrate the impact of all components of their approach. This work achieves state-of-the-art performance with very high metrics.

    Writing and presentation:

    • The paper is well-written and organized, with correct terminology and coherent discussions.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Related Work:

    • The paper does not cite previous works that have approached scene graph generation as a language/caption generation problem in other domains (e.g. See Refs. [1, 2, & 3]). This literature omission limits the contextual understanding of how the proposed method compares with alternative approaches.

    Technical Correctness:

    • The paper uses LLaVa as the main architecture with an additional self-attention module for multi-view processing. However, LlaVa itself is a very large model, and the paper does not compare model parameters or FLOPs with previous state-of-the-art models. This lack of complexity comparison makes it unclear how the improved results are affected by model design or computational complexity, and it weakens the affordability of OR modeling.

    Presentation:

    • Most of the paper’s explanations are presented in text, and some lack graphic aids, which could significantly enhance understanding of the content. Specifically, the paper does not provide visual examples of the altered test set or the symbolic representation of descriptors. Also, the method explanation provides minimal mathematical formulation.

    Minor weaknesses in experimental validation:

    • The generalizability experiments are conducted only on the non-temporal version of the model, which restricts the validation of the complete proposed architecture and impedes direct ablation comparisons with the current state-of-the-art model LABRAD-OR.
    • The paper does not conduct experiments incorporating parts from previous methodologies, like including point clouds or using an additional model to predict the roles of the human nodes in the graphs. These experiments would provide a more direct comparison and assess the impact of removing or including these parts.

    References: [1] Yiwu Zhong, et al. Learning To Generate Scene Graph From Natural Language Supervision. ICCV 2021 [2] Yikang Li1, et al. Scene Graph Generation from Objects, Phrases and Region Captions. ICCV 2017 [3] Itthisak Phueaksri, et al. An Approach to Generate a Caption for an Image Collection Using Scene Graph Generation. IEEE Access

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    This work claims it will provide a link to its source code after acceptance. The paper provides the most general information to ensure reproducibility, and the majority of the missing detailed information (e.g., self-attention module dimensions and learning schedule hyperparameters) could be provided in a future public code release. However, the paper does not provide enough detail to reproduce the altered test set, although this might be due to space limitations in the format.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The paper significantly contributes to holistic operating room modeling by presenting a novel formulation of operating room scene graph generation that aligns with recent advances in vision and language models. It leverages novel large vision and language models to surpass the current state of the art and introduces an innovative training scheme utilizing modern image generation models. The comprehensive validation benchmarks, including a newly altered validation set and complete ablation studies, demonstrate a thorough investigation of the proposed approach. Despite its strengths, the paper lacks a detailed computational complexity analysis of parameter counts and FLOPs to clarify whether improvements are due to the model’s efficacy or its potentially increased parameter count. However, even though the paper presents a robust experimental validation, it could be further enhanced by extending generalizability experiments to include the entire model with temporal predictions. Also, incorporating elements from previous methodologies, like point clouds and additional role prediction models, would provide a more detailed assessment of how the model compares to previous approaches and how these removed parts could contribute to the task and the model. Finally, the paper would benefit from visual examples of the altered test sets and symbolic descriptor representations to improve readability and comprehension. Still, the paper is well-written and provides satisfactory explanations for their multiple contributions in the limited space. The paper, as it is, presents a complete proposal and empirical validation.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper introduces an innovative methodology that achieves state-of-the-art results, thoroughly supported by a robust evaluation benchmark. It makes significant contributions to the field of computer-assisted interventions and overcomes multiple limitations of previous approaches. While there are some shortcomings, particularly in the area of computational complexity comparison, these are outweighed by the paper’s novelty and robust experimental backing.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank all the reviewers for their insightful and valuable comments. We are pleased that the reviewers recognized the novelty and strengths of our work: the novel experimental setting and research direction (R1), the introduction of multimodal knowledge guidance (R3), the innovative application of large vision-language models (R5), and better results than the previous state-of-the-art, with good performances on downstream tasks (R1).

Regarding the zero-shot performance of existing LVLM’s in our setting (R1), we want to point out that both the operating room domain, as well as the scene graph generation tasks are uncommon for these models, and our initial experiments showed very poor results in the zero-shot context. Furthermore, these models are also incapable of accepting multi-view input. Therefore, adapting these models, both in terms of modifying the architecture, and in terms of fine tuning them, was crucial in our work.

Regarding the integration of the temporal prompt (R1), all details will be included in the code, which will be available at the time of the publishing of this work at MICCAI 2024. Nonetheless, we want to provide a short example here: If at timepoint T the triplet <head surgeon, sawing, patient> is predicted, we add this to our memory. For timepoint T+1, the model prompt now looks like this (simplified). “Memory: <head surgeon, sawing, patient> Entities… Predicates … “ Now the model would predict new triplets for timepoint T+1, and we would add these to the memory as well.

Regarding the data / code that will be released (R3), we will indeed release everything, including the entire code, any data we have generated in addition to the previous work, but also the relevant model checkpoints.

Regarding the computational complexity (R5), we agree that LVLM can have high costs. While this initial works focus was not on this, we want to briefly address some concerns. Unlike previous works, ORacle is a fully end-to-end approach, making it more efficient in terms of compute time. Previous methods relied on three or more steps, which usually consist of different neural networks. Additionally, we see a great effort in the community to reduce the runtime complexity of these LVLM’s, including quantization techniques, as well as custom designed hardware, which we believe will make it possible to run these models in the operating rooms in real time, in the future.

Overall, we are grateful for the constructive feedback of all the reviewers and will incorporate these improvements in the final version. As acknowledged by the reviewers, our approach presents a significant advancement in holistic OR modeling using large vision-language models, achieving state-of-the-art results and demonstrating practical adaptability.




Meta-Review

Meta-review not available, early accepted paper.



back to top