Abstract

A comprehensive understanding of surgical scenes allows for monitoring of the surgical process, reducing the occurrence of accidents and enhancing efficiency for medical professionals. Semantic modeling within operating rooms, as a scene graph generation (SGG) task, is challenging since it involves consecutive recognition of subtle surgical actions over prolonged periods. To address this challenge, we propose a Tri-modal (i.e., images, point clouds, and language) confluence with Temporal dynamics framework, termed TriTemp-OR. Diverging from previous approaches that integrated temporal information via memory graphs, our method embraces two advantages: 1) we directly exploit bi-modal temporal information from the video streaming for hierarchical feature interaction, and 2) the prior knowledge from Large Language Models (LLMs) is embedded to alleviate the class-imbalance problem in the operating theatre. Specifically, our model performs temporal interactions across 2D frames and 3D point clouds, including a scale-adaptive multi-view temporal interaction (ViewTemp) and a geometric-temporal point aggregation (PointTemp). Furthermore, we transfer knowledge from the biomedical LLM, LLaVA-Med, to deepen the comprehension of intraoperative relations. The proposed TriTemp-OR enables the aggregation of tri-modal features through relation-aware unification to predict relations to generate scene graphs. Experimental results on the 4D-OR benchmark demonstrate the superior performance of our model for long-term OR streaming. Codes are available at https://github.com/RascalGdd/TriTemp-OR.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0281_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0281_supp.pdf

Link to the Code Repository

https://github.com/RascalGdd/TriTemp-OR

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Guo_Trimodal_MICCAI2024,
        author = { Guo, Diandian and Lin, Manxi and Pei, Jialun and Tang, He and Jin, Yueming and Heng, Pheng-Ann},
        title = { { Tri-modal Confluence with Temporal Dynamics for Scene Graph Generation in Operating Rooms } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a framework for semantic modeling in operating rooms in form of scene graph generation. Their Tri-modal approach combines images, point clouds and language, and experimentally achieves better results than previous methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The way authors integrate 2D and 3D features using RoiAlign is unique in the operating room setting
    • The use of language embeddings for dealing with rare classes seems to be effective
    • The proposed method seems to be technically sound and effective, reaching superior results than previous methods
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The authors claim to “directly exploit … temporal information” unlike previous works, however they only consider a temporal window size of 3 frames and do not provide any experimental results on the effect of using temporality vs working on a single timepoint. Is a significantly bigger temporal context than 3 possible? Is temporality having any positive impact on the results? Without this context, it is hard to decide if the authors are indeed proposing a more effective way of using temporal information compared to previous works.

    • According to table 2, only a specific kernel size combination outperforms the prior state-of-the-art, raising concerns about the robustness of the model and the potential for overfitting. How was this kernel size combination chosen? Do the authors have any intuition for why the results are sensitive to this choice?

    • The selection rational for LLaVA-Med, which was trained on pubmed articles, is currently under-explained in the paper. While using it leads to results that are higher than SOTA, both using CLIP embeddings or no embeddings performs worse or comparable to SOTA. Surprisingly, CLIP even seems to hurt compared to not using any embeddings. Therefore this raises the question, how was the choice to use LLaVA-Med made?

    • In Figure 3., it is unclear what the baseline (B) refers to. Authors are showing both ViewTemp and PointTemp improve upon this baseline, but it is unclear how their method would work without having neither of these components.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    While providing code will help the reproducibility of the results on 4D-OR, it is unclear how this method could be applied to a slightly different dataset or setting, as it seems to be quite dependent on the exact choice of hyperparameters such as kernel size.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    While the proposed methodology is mostly technically sound, the novelty is limited, and experimental results are lacking. I think the paper would benefit from a clear formulation of the unique contributions of this work, as well as more through experimental results, clearly highlighting the contributions of the individual components. It would be important to clarify how some decisions were made, particularly regarding kernel size and language model.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the proposed framework seems to be effective, there are significant questions remaining about the contribution of individual components, the effectiveness of the use of temporality, and the decision process behind some hyperparameter choices. Current experiments are lacking, and do not fully support the claims made in the paper.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision
    • This work originality argued that they “directly exploit … temporal information” unlike previous works, while only viewing a temporal window of 3. I raised my concerns about this very limited temporal context. The rebuttal did not address my concern in this regard, actually it even validated the point that the proposed methodology does not effectively integrate long temporal context with the following comment: “When it (the temporal context) is equal to or greater than 3, the performance of our model stabilizes.”
    • I also had concerns regarding potential overfitting of the method parameters, possible to the testing set, as only a specific kernel size comparison in table 2 gave good results. Authors did not properly justify why only this combination works, and more importantly, if this was chosen based on their results on the test set.

    Overall, because of the limited novelty, combined with the limited number of evaluations to support the different components, I suggest a rejection of this paper.



Review #2

  • Please describe the contribution of the paper

    State-of-the-art results in most of the metrics An ene-to-end multi-modal model combining three different elements to focus on different components

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper provides a new multi-modal model for 3D scene graph generation. The model combines three sources of information: multi-view temporal interaction, geometric-temporal point aggregation and a medical LLM. The authors perform an ablation study to demonstrate the contribution of each component. In addition, they compare the use of two LLM models, and analyze the advantage of using the LLaVA-Med model which includes medical data.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The model is analyzed on only one dataset and is compared to only two previous models, it might be a limitation of the domain, nevertheless it is still limiting.
    The ablation study in fig 3 is a one way ablation. That is B+L might be sufficient with no need for V or P, yet this is not examined. Small note: I think Table 2 is an hyperparameter search and not an ablation study.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    In this study the authors address the challenge of 3D scene graph generation in the operation room. they suggest a new multi-modal network. In the introduction the authors make claims such as “exhibits wide variations in the duration of activities at different phases, it is inevitable to confront a significant imbalance of classes between frequent ones”. From what I could see, later on in the results, ablation and discussion they did not demonstrate how their approach is better at dealing with this issue.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    A novel multi-modal end-to-end model that in most of the metrics provides SOTA results. The model logic behind the developed model is explained and justified

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors dressed my main concerns. While only one dataset and two prior studies are examined, this is a new domain in which the authors provide SOTA results.



Review #3

  • Please describe the contribution of the paper

    This paper proposes a method for surgical phase recognition in the 4D-OR dataset via scene graph generation. Authors utilize images, point clouds, and text embeddings of prompts as three modalities.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Authors introduced a novel method using scene graph generation for surgical phase recognition in the 4D-OR dataset. In their methods, they proposed using convolution and attention-based networks for image and point feature extraction. In addition, the authors used additional text embeddings of prompts from the LlaVA-Med vision-language model to distill information.

    2. The authors provided a comparison with the existing work on the same dataset and achieved improved results on average. Moreover, they tested the main components of the model (ViewTemp, PointTemp, LLM) in the ablation study.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Authors claim that using the Med version of LlaVA can capture medical context better than vanilla LLMs. In the ablation, text embeddings from LlaVA-Med are compared with CLIP. Wouldn’t comparing LlaVA-Med with LlaVA instead of CLIP test this claim better? What is the reason behind this choice?

    Moreover, are used prompts sufficient to include the rich semantic content for LLMs to capture as mentioned? I am concerned that the semantic contribution of LLMs can be limited and performance improvement might depend on knowledge distillation with distinctive features. The number of scenarios (nodes/edges) is limited. From Figure 4, it is clear that CLIP features are not separated well in the low dimensions and consequently do not bring any advantage in performance. This can be due to the semantic representation limitation of the CLIP text-encoder. Did authors have any experience with other LLMs (including unimodal text models) that can generate better features? Or can simple well-separated one-hot encoded feature vectors make a better contribution?

    1. The baseline in Figure 3 is not explained in the body text. Is this model using only encoded image/point features without ViewTemp/PointView models?

    2. In the introduction section, the authors claim that “Unlike specific surgical assistance interventions, e.g., surgical phase recognition [9], instrument segmentation [21], and anatomy tracking [10], holistic scene modeling of operating theatres [28,27,6] can facilitate coordination and communication among surgical teams, optimize the surgical process, and enhance safety and efficiency during surgery.” Can authors reference this claim or discuss why the mentioned topics cannot help coordination and communication among surgical teams, optimize the surgical process, and enhance safety and efficiency during surgery?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Authors used a multimodal LLM in their experiments to match the features with images. However, can using only text based BERT/Med-Bert/Sentence-Transformer produce similar or better results?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed image and point models (ViewTemp, PointTemp) are designed based on reasonable claims on spatio-temporal feature extraction and their effectiveness is shown in Figure 3. Moreover, using LLMs to distill information in this task is an innovative idea and authors could further improve their results with this approach. The final model showed better performance than the referenced work. However, the contribution of the LLM could be investigated better with wider comparisons (with more capable medical or vanilla LLMs) to fully prove given claims.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    After rebuttal, the choice of using multi-modal LLM instead of unimodal text-based LLMs is cleared (Rebuttal 1.2). However, I still believe that a comparison with vanilla LLaVa would be a natural experiment. This could be done in addition to the CLIP comparison (Rebuttal 1.3). I am not convinced that a rich semantic context can be captured from given prompts, performance gain might result from the regularization effect of adding well-separated features. Therefore, I am skeptical about the contribution of the LM/Med-LLaVa.

    Comments on clarity and statements are addressed sufficiently (Rebuttal 4, 6).




Author Feedback

We thank reviewers for the valuable comments, and we are encouraged by the positive comments on ‘novel method’ (R1&R3), ‘very good organization’ (R1&R3), ‘justified model logic’ (R1). Below, we address specific comments.

  1. About LLMs 1.1 Selection rationale for LLaVA-Med (R5) The knowledge in LLMs embraces embedding different semantic relation features, which allows for effective knowledge distillation thereby improving the model understanding of intraoperative relations. We therefore propose to leverage LLMs. However, due to the prior knowledge gap between the open vocabulary and biomedical semantics, CLIP struggles to differentiate the semantic embeddings of different action relations (the left of Fig. 4), leading to a decline in performance. In contrast, since the relation prompts used are included in PubMed where LLaVA-Med was trained, the relation semantics are well distinguished (the middle of Fig. 4), proving that LLaVA-Med assists our model in understanding relevant semantics and obtaining better results. 1.2 Effect of LLaVA-Med, why not unimodal text models (R3) Experimental results validate that the rich semantic features extracted by LLaVA-Med can guide visual features leading to better performance (see Fig. 3 and Fig. 4). Moreover, LLaVA-Med is a multimodal pre-trained model (text&vision), which is well-suited to our multimodal model. In contrast, unimodal models may lack feature interaction with visual cues. 1.3 Comparison with CLIP instead of LLaVA (R3) CLIP has widely shown success in general SGG models with LLMs compared with LLaVA. Thus, we used it as a baseline to compare LLaVA-Med in OR-SGG. We keep this suggestion for future work.
  2. Model Effectiveness 2.1 Temporal (R5) The effectiveness of our temporal design can be observed in Fig. 3. ViewTemp contains 2D temporal features and PointTemp includes 3D temporal cues. Our baseline uses only image features without 2D and 3D information. As shown in Fig. 3, removing either the 2D or 3D temporal module results in a performance drop. Also, the number of temporal frames is determined by a hyperparametric search. When it is equal to or greater than 3, the performance of our model stabilizes. 2.2 ViewTemp in Table 2 (R5) Table 2 shows the ablation for hyperparameters in ViewTemp rather than comparison studies. Generally, the choice of hyperparameters is inevitable for learning-based models (James et al., 2012), where kernel size combinations are derived from Inception (Christian et al., 2014). We ablate different kernel combinations for View#1 and View#6 to accommodate different granularity of subject/object characteristics. For instance, View#6 displays a finer-grained view with lower feature density, so we use larger kernel combinations. 2.3 Class imbalances (R1) We exploit the rich semantics from LLMs to alleviate the class imbalance in ORs. Table 2 shows that our model shows a remarkable advantage in less-frequent relations, e.g., ‘Saw’ and ‘Clean’, as discussed in Sec. 3.3.
  3. Datasets (R1) To our knowledge, 4D-OR is the only available dataset for this task. The recent MICCAI works (4D-OR and LABRAD-OR) can be evident. Besides, these two works are the only peer-reviewed SOTAs in this task.
  4. Baseline setting (R3&R5) Our baseline in Fig. 3 uses only encoded image features without 2D and 3D temporal information from ViewTemp and PointTemp. Additionally, we removed the feature alignment based on LLaVA-Med. We will add this in the final version.
  5. Ablation in Fig. 3 (R1) We have actually considered other combinations. Due to space constraints, we only display the important settings. L is implemented through feature alignment with 2D and 3D temporal information (V+P). Therefore, We our model benefits more from the complementary multimodal feature unification of L and (V+P).
  6. Statement corrections (R1&R3) We will change Table 2 to hyperparameter search (R1) and rephrase the second sentence in Introduction (R3).




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The issues pointed out by R5 should be thoroughly revised.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The issues pointed out by R5 should be thoroughly revised.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviews for this submission after rebuttal are still mixed. Two of the reviewers have been satisfied with the answers provided by the authors, while there is only one reviewer remains negative about this submission. I would recommend to accept this paper for a poster presentation.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The reviews for this submission after rebuttal are still mixed. Two of the reviewers have been satisfied with the answers provided by the authors, while there is only one reviewer remains negative about this submission. I would recommend to accept this paper for a poster presentation.



back to top