List of Papers Browse by Subject Areas Author List
Abstract
Analyzing operating room (OR) workflows to derive quantitative insights into OR efficiency is important for hospitals to maximize patient care and financial sustainability. Prior work on OR-level workflow analysis has relied on end-to-end deep neural networks. While these approaches work well in constrained settings, they are limited to the conditions specified at development time and do not offer the flexibility necessary to accommodate the OR workflow analysis needs of various OR scenarios (e.g., large academic center vs. rural provider) without data collection, annotation, and retraining. Reasoning segmentation (RS) based on foundation models offers this flexibility by enabling automated analysis of OR workflows from OR video feeds given only an implicit text query related to the objects of interest. Due to the reliance on large language model (LLM) fine-tuning, current RS approaches struggle with reasoning about semantic/spatial relationships and show limited generalization to OR video due to variations in visual characteristics and domain-specific terminology. To address these limitations, we first propose a novel digital twin (DT) representation that preserves both semantic and spatial relationships between the various OR components. Then, building on this foundation, we propose ORDiRS (Operating Room Digital twin representation for Reasoning Segmentation), an LLM-tuning-free RS framework that reformulates RS into a “reason-retrieval-synthesize” paradigm. Finally, we present ORDiRS-Agent, an LLM-based agent that decomposes OR workflow analysis queries into manageable RS sub-queries and generates responses by combining detailed textual explanations with supporting visual evidence from RS. Experimental results on both an in-house and a public OR dataset demonstrate that our ORDiRS achieves a cIoU improvement of 6.12%-9.74% compared to the existing state-of-the-arts. The code is available at https://anonymous.4open.science/r/ordirs.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2221_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
N/A
Link to the Dataset(s)
N/A
BibTex
@InProceedings{SheYiq_Operating_MICCAI2025,
author = { Shen, Yiqing and Li, Chenjia and Liu, Bohan and Li, Cheng-Yi and Porras, Tito and Unberath, Mathias},
title = { { Operating Room Workflow Analysis via Reasoning Segmentation over Digital Twins } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15968},
month = {September},
page = {416 -- 425}
}
Reviews
Review #1
- Please describe the contribution of the paper
This paper proposes a training-free method that leverages multi-modality information for referring image segmentation for medical images/videos.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
-
It investigates into leveraging multiple modalities in LLM based reasoning for OR
-
It demonstrates the usage of agent based workflow in automatic OR analysis
-
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
-
The results shown by the paper are not convincing and leave many key questions unanswered. For example, the results shown in Fig. 3 over the fine-tuned method may come from the bounding box prompts + SAM2 segmentation workflow.Though this might lead to more holistic segmentation and deal with the sub-part granularity confusion, the false negative detection issue in the bounding box proposals would arise. At least according to Fig. 1 only one anesthesia machine is detected. As the testing set on the proposed dataset only contains 80 images and not enough qualitative results are presented, the actual improvement of the proposed pipeline is unclear when dealing with missing detections. A much more comprehensive evaluation for the proposed pipeline should be carried out( maybe consider including standard datasets outside medical setting as it’s a training free method).
-
The contribution of each individual module is unclear. For example, ablation on the depth information could have been conducted. As the whole pipeline seems computationally demanding, the added value of those modalities and running time& feasibility in real-life has to be shown.
-
The comparison with baseline methods are not convincing . No qualitative results on MVOR datasets has been carried out and the fine-tuning baseline methods with only 40 images is not convincing enough.
-
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Unfortunately the code link doesn’t work during reviewing.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(2) Reject — should be rejected, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Overall, this paper presents a training-free method leveraging object detection, SAM, LLAVA and LLM for Reasoning Segmentation. However, a much more rigorous proof of the added value and its capabilities dealing with false negative detections has to be shown. The experiments are not strong enough to support the claims made in the introduction.
- Reviewer confidence
Somewhat confident (2)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Reject
- [Post rebuttal] Please justify your final decision from above.
My major concern is regarding the experiment design. As also reported in the authors’ response, the OWLv2 image detection module is a key contributor to its large accuracy improvement over baseline methods. However, this module could easily make the whole system sensitive to false negative detections(where all baseline methods do not have this issue). As the proposed dataset only contains 82 images and no visualization from MVOR is shown, I would argue that more comprehensive experiments have to be conducted for a fairer comparison and to demonstrate its generalizability/downstream applications.
Review #2
- Please describe the contribution of the paper
this work tackles analyzing operating room (OR) video workflow using reasoning segmentation. for better flexibility, authors propose digital twin representation –a json format, in combination with llm without the need for finetuning in addition to an llm-agent. results are reported on 2 dataset with one private.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- more flexible approach
- llm-free finetuning.
- good results.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- lack of comparison to existing work. this drastically limits the assessment of the performance of the proposed method.
- compared only to 2 methods with one very weak baseline.
- ~x62 slower than method [5].
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- there is a huge lack of comparison to existing work. one can not assess the performance of the proposed method by comparing to only 2 methods with one very weak baseline.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
the rebuttal did not resolve my main concern about the performance. the tackled task is relatively new. therefore, there is no previous works to compare to head-to-head. however, in this particular case, authors could have done more experiments on very close works to assess the method’s performance and do extensive comparisons. right now, we dont know how good is the proposed method. they used very weak baseline.
Review #3
- Please describe the contribution of the paper
This work introduces ORDiRS, a novel framework for analyzing operating room (OR) workflows from video data to improve OR efficiency. ORDiRS uses a digital twin (DT) representation to preserve semantic and spatial relationships in the OR and reframes reasoning segmentation (RS) into a three-step process: reason, retrieve, and synthesize—all without the need for LLM fine-tuning.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The generalizable agentic approach proposed in this work would server as a template for or solving similar workflow analysis tasks.
- A detailed analysis of various foundation models was provided in ablation shown the effectiveness of the proposed template.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- To show the overall working of the work, additional supplementary material showing how the actual agent works on a single or multiple videos may of a 1-2 min length (to show enough variation in OR movement) would be helpful.
- There is a lot of speculation in the community on the reasoning aspects of LLMs. So just a curious question, do the authors thinks the models are able to generate valid reason queries? Any automatic way to check this? 2a. Not sure if the filtering function used in this work could handle this. Some examples of its explicit working would have given more clarity.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- The authors missed to comment on one aspect, although the template as a whole looks agnostic to OR rooms, the limilations lies in the size of the rooms, positioning, the camera configurations and the aspect of ‘receptive field’ of the vision models used. As these will limit the ability of the agent to identify all aspects going down in the OR, especially the ones that occupy smaller spatial regions in the video.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(5) Accept — should be accepted, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Overall the work is novel and has a great potential opening up the spaces to solve more complex problems in the surgical environments.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The work is novel and have the necessary proof to back their claims and other minor issues are clarified by the authors in the rebuttal.
Author Feedback
Lack of Comparison to Existing Work (R1,4): As pointed in introduction, reasoning segmentation in the context of OR efficiency analysis is a new task with this manuscript making a first attempt to explore it. Therefore, we are unaware of adequate baselines. While substantial prior work exists on surgical phase recognition and workflow analysis at the OR level, our approach differs fundamentally in both task formulation and methodology. Specifcially, traditional OR workflow analysis methods typically rely on closed-set classification or detection frameworks with predefined categories (e.g., phases, actions, or instruments). In contrast, our approach leverages reasoning segmentation with open-set text queries that require multi-step reasoning about semantic and spatial relationships. This fundamental difference in problem formulation makes direct comparison with these exisiting work inappropriate.
Computational Efficiency (R1): Processing time of our method mainly comes from constructing the digital twin representation, and we agree that there is significant opportunity for acceleration. For example, using FastSAM instead of SAM2, can accelerate processing by ~20x and parallel processing by ~3x further, leading to competitive performance time with baseline like LISA. Improvements are part of future work since this manuscript focused on feasibility.
Validity of LLM reasoning queries (R2): We acknowledge that LLM reasoning capabilities have inherent limitations and can potentially fail, especially for complex medical reasoning tasks. In our experiments, we evaluated GPT-4o’s performance in decomposing implicit queries into explicit reasoning requirements across 100 different OR workflow analysis queries. The model demonstrated satisfactory performance in most cases, correctly identifying the reasoning components needed for accurate segmentation with a success rate of 96%. When errors occurred (4% of cases), they typically involved insufficient decomposition of spatial relationships or incomplete consideration of domain-specific medical knowledge. For example, the model occasionally struggled with queries requiring understanding of specific surgical protocols. And, we confirm that filtering function indeed can help it by improving the success rate to 99%.
Convincing results and concerns of false negative detection (R4): SAM2 focuses on instance segmentation while our work focuses on reasoning segmentation. Therefore, ORDiRS is fundamentally different from simply applying SAM2 with bounding box prompts, because those operations are insufficient to answer the actual query. ORDiRS uses SAM2 (among other models) to construct a structured digital twin representation that enables an LLM to perform complex reasoning over the video content without direct fine-tuning. Moreover, in the DT construction process, LLM also involves (eg by prompting OWLv2) to avoid false negative detection.
Individual module contributions (R4): Removing depth information reduces performance to 71.42% cIoU, 73.81% gIoU while improving processing time to 62.31 seconds. Without LLaVA-7B, performance drops to 68.93% cIoU, 70.62% gIoU with 45.28 seconds processing time. Finally, removing OWLv2 (using direct SAM2) decreases performance to 58.25% cIoU, 61.04% gIoU, though processing time improves to 35.86 seconds. We will update these to ablation study.
Results for MVOR dataset (R4): Already provided in Tab.2 (caption) and sec.3. The trends are similar with in-house dataset some repatative qualitative descrpitions are omitted.
Insufficient samples for fine-tuning (R4): We use parameter efficient fine-tuning method (aka LoRA) therefore it does not rely on a large scale sets. Moreover, beyond the reasoning segmentation images we ourselve annotated we also include the general reasoning segmentation samples from ReasonSeg, MVOS and etc results in a total number of 2000+ samples for fine-tuning.
Meta-Review
Meta-review #1
- Your recommendation
Invite for Rebuttal
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A