Abstract

Ultrasonography has revolutionized non-invasive diagnostic methodologies, significantly enhancing patient outcomes across various medical domains. Despite its advancements, integrating ultrasound technology with robotic systems for automated scans presents challenges, including limited command understanding and dynamic execution capabilities. To address these challenges, this paper introduces a novel Ultrasound Embodied Intelligence system that synergistically combines ultrasound robots with large language models (LLMs) and domain-specific knowledge augmentation, enhancing ultrasound robots’ intelligence and operational efficiency. Our approach employs a dual strategy: firstly, integrating LLMs with ultrasound robots to interpret doctors’ verbal instructions into precise motion planning through a comprehensive understanding of ultrasound domain knowledge, including APIs and operational manuals; secondly, incorporating a dynamic execution mechanism, allowing for real-time adjustments to scanning plans based on patient movements or procedural errors. We demonstrate the effectiveness of our system through extensive experiments, including ablation studies and comparisons across various models, showcasing significant improvements in executing medical procedures from verbal commands. Our findings suggest that the proposed system improves the efficiency and quality of ultrasound scans and paves the way for further advancements in autonomous medical scanning technologies, with the potential to transform non-invasive diagnostics and streamline medical workflows. The source code is available at https://github.com/seanxuu/EmbodiedUS.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0242_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/seanxuu/EmbodiedUS

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Xu_Transforming_MICCAI2024,
        author = { Xu, Huan and Wu, Jinlin and Cao, Guanglin and Chen, Zhen and Lei, Zhen and Liu, Hongbin},
        title = { { Transforming Surgical Interventions with Embodied Intelligence for Ultrasound Robotics } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The main contribution of this paper lies in the integration of large language models (LLMs) for automated ultrasound robotics. The authors propose a novel embodied intelligence robot system capable of comprehending verbal instructions from clinicians to autonomously conduct the requested ultrasound related task. For this, known concepts are novelly combined. A word embedding model (bge-large-en-v1.5) is trained on augmented data to help transforming the concise verbal queries into contextual enriched “Ultrasound Assistant Prompts” using a predefined knowledgebase. The new prompt serves as input to an LLM (GPT-4 Turbo) and is complex and detailed enough to realize task execution.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Translation and complexity: Combination and application of advances in language understanding, embodied intelligence and robotics for clinical application within a complex setup.
    • Clinical usefulness: The robotic system and its capabilities as shown in the provided videos are promising for future use in clinical settings.
    • Clear evaluation: The evaluation of the methods is clear, and the ablation study performed gives valuable insights into the contribution of each improving step.
    • Modular setup: The setup seems modular and every task or system (E.g. robotic system, image segmentation module etc.) have their own APIs.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    -Unclear methodology: o The authors propose to retrieve information for contextually enriching the input query by comparing embedded entries from a knowledgebase to the embedded input query. Where is the knowledgebase coming from? It consists of APIs and Robotic Handbooks. The authors claim that the training data consists of synthetically generated instances for the Robotic Handbook and Ultrasound APIs. So, do these also serve as knowledgebase? o Why is model training conducted using synthetic data (e.g. APIs) when “real” APIs are present (e.g. robot system APIs). Where does the LLM learn the name of e.g. the real Robot API if it has never seen it? Is the information provided in the knowledgebase? That would mean, the knowledgebase is not synthetically generated, and the question remains, where it is coming from… o How was the knowledgebase data paired with the extra information (e.g. for API retrieval, the data was paired with narrative descriptions and for Robotic Handbook, data was paired with detailed instructions) where are these coming from? Are these pairings also used for training? o Was the LLM trained directly on the synthetic training data or on the embeddings of the finetuned bge-large-en-v1.5? o It was never mentioned, how information retrieved from the knowledgebase (domain specific knowledge, API and Handbook) was transformed into the Ultrasound Assistant prompt, only how the final result looked like. o It is unclear what training data was exactly used for what model. o It is unclear whether the ReAct framework is actively coupled with the LLM or whether it “takes over” the task. o It is unclear how the task execution is initiated, does the LLM actively invoke the APIs with correct parameters?

    -Unclear setup and components: o No information about the robotic system is given, does it have to be in any way pretrained (e.g. on how to sense the human) or can any robotic arm be used? o Which components offer APIs and how are they connected to communicate with each other. For example, does the Ultrasound machine offer APIs that can be used to change ultrasound settings by the LLM directly?

    -Unclear contribution o Unclear what has been done before and whether the contribution mainly lies in the integration of the language models

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The terms Robotic Handbook, API and the evaluation metric Recall@k are used without context or explanation what they are.
    • What happens if a new API is introduced to the system?
    • Figure 3 is unclear. Is that the info retrieved that serves for the US Assistant prompt?
    • In table 2, LLMs are compared but the best model is not listed in the table. Where all other models trained with “LLMs + UAR + RHR“ configuration?
    • The authors show in the video added to the supplementary material, that image segmentation is also performed. An “Image Seg” API is also depicted in the manuscript but never mentioned in the text. Does that mean, image segmentation is always performed?
    • A figure showing the full architecture of the setup would be very helpful.
    • The structure of the paper is good and the idea of generating an enriched assistant prompt from a simple input query is nicely presented and motivated.
    • The authors provide examples for the dataset to improve reproducibility within the supplementary material section. They also provide a working demonstration of the setup showcasing their success in implementing their desired system.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the authors succeed in presenting an embodied intelligence for autonomous ultrasound scanning, they fail to present the methodology and the overall setup in a clear and understandable way.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    While the authors answer most questions, the reproducibility is still unclear as some aspects remain hard to understand / follow due to undetailed descriptions. However, the authors claim to publish the code, which will most likely help and be valuable. The heavy reliance on a domain specific knowledgebase makes the generalizability, and therefore the translation of the framework questionable. In the rebuttal however, the authors clarify, that this is the first approach in coupling ultrasound robots with LLMs for human-robot interaction logic optimization. Therefore, they provide valuable insights and novel ideas for solving the task, which can be beneficial to the community despite unclear implementation details.



Review #2

  • Please describe the contribution of the paper

    This paper is inspired by the ReAct framework [1] and its key robot dynamic execution is this framework. This paper proposes an ultrasound robotic system which allows medical staff to verbally command robots. The proposed Ultrasound domain knowledge augmenting relies on cosine similarity to measure the similarity between user queries and exited knowledge. Then the ultrasound assistant prompt is used to integrate the several prompts.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Autonomous ultrasound robot based on large language model is interesting.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The method needs large-scale knowledge datasets which are produced by doctors and robotics specialists, etc, which can clearly limit in accomplishing more complex ultrasound inspection tasks. And it must rely on core modules which is not described in this paper, such as environment sensing, online robot trajectory planning and robot obstacle avoidance. Obviously, this paper is a preliminary work based on the ReAct framework [1] in autonomous ultrasound robots. [1] Yao S, Zhao J, Yu D, et al. React: Synergizing reasoning and acting in language models[J]. arXiv preprint arXiv:2210.03629, 2022.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    This paper is a preliminary work in US robot and it relies on many key modules, such as environment sensing, online robot trajectory planning and robot obstacle avoidance. Authors should elaborate on these modules to prove the reproducibility of the paper. And a lot of prompts datasets should be shown and the range of applicability of the US robot must be illustrated.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper is a preliminary work and the US robot is based on the existed ReAct framework in autonomous ultrasound robots.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper introduces an innovative system that integrates ultrasound robotics with advanced Large Language Models (LLMs) and domain-specific knowledge to improve the precision and efficiency of medical ultrasound scans.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper introduces an innovative system that integrates ultrasound robotics with advanced Large Language Models (LLMs) and domain-specific knowledge to improve the precision and efficiency of medical ultrasound scans. The system improves the understanding and execution of doctors’ verbal instructions by ultrasound robots. The dynamic execution framework that adapts to real-time feedback can minimize errors and improve performance. The experiments presented in the paper appear thorough, demonstrating improvements in performing noninvasive medical procedures based on verbal commands. The application of this technology can improve medical workflows.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The presented model relies heavily on the quality of domain-specific knowledge and the accuracy of the data used to train the LLMs, which could potentially restrict its performance. The paper claims significant advancements in carrying out medical procedures based on verbal commands, but it doesn’t compare its approach with existing methods or technologies in the same field of application.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The description of how LLMs (Language and Learning Models) integrate with robotic systems could benefit from more detailed explanations of the technical implementations and specific challenges encountered. Such explanations would help readers understand not only the conceptual framework but also the practical aspects of integrating these technologies.
    • Since the models are trained on a specifically tailored dataset, there is a risk of overfitting, where the models perform well on the training data but may not perform as effectively on new, unseen data. To improve the proposed system’s robustness, more diverse and extensive datasets could be used for evaluation.
    • A detailed comparative analysis against current standards could help highlight the system’s actual advancements and limitations.
    • The supplementary video demonstrates the model’s interactions with humans, which is promising. However, the model’s interactions with medical staff could be evaluated further and improved.
    • The authors should address the potential costs associated with implementing such a system and discuss the economic feasibility of widespread adoption.
    • To enable others to replicate the work, the authors may consider making the source code available or providing a detailed description.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The idea is novel and interesting, but the rating reflects missing details on technical implementation and economic feasibility.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors partially addressed my comment on overfitting but didn’t include details on mitigation plans. Additionally, the comments about potential costs, detailed comparative analysis, replication of the work, and further evaluation of the model’s interactions with medical staff were not answered.




Author Feedback

We sincerely thank the reviewers for their insightful comments and constructive feedback. [R1#Q1] Where does the knowledge base come from? The knowledge base originates from medical professionals and robotics experts. GPT-4 facilitates the synthesis of data, ensuring independent and identically distributed (i.i.d.) instances for training the embedding model. [R1#Q2] Why use synthetic data (e.g. APIs) for model training? If LLM has never seen a real bot API, where does it know its name from? Real-world APIs come in all sorts of formats, and we can only unify API formats by synthesizing datasets to make it easier for us to implement our frameworks. LLM does not need to know the name of the API, the information about the API is recalled based on the user instructions, through the embedding model, after cosine similarity calculation. [R1#Q3] How is knowledge base data paired with additional information (e.g., data paired with narrative descriptions for API searches, data paired with detailed descriptions for robot manuals), and where do these data come from? Are these pairings also used for training? These data are synthetic data and are used for the training of the embedding model. [R1#Q4] Was the LLM trained directly on the synthetic training data or on the embeddings of the finetuned bge-large-en-v1.5? We didn’t train LLM. [R1#Q5] How information retrieved from the knowledgebase (domain specific knowledge, API and Handbook) was transformed into the Ultrasound Assistant prompt? We retrieve the relevant information through the cosine similarity matching algorithm and then combine the relevant information according to the structure shown in Fig1. Specifically, the APIs list will be combined in the format shown in Fig2, and the Handbook will be combined in the format shown in Fig3. [R1#Q6] What training data was exactly used for what model? We use embedding model which is a fine-tuned bge-large-en-v1.5. For the LLM, we use some different models to implement the experiments. [R1#Q7] It is unclear whether the ReAct framework is actively coupled with the LLM or whether it “takes over” the task. ReAct framework is just a prompt strategy, it can not take over the task. [R1#Q8]Does the LLM actively invoke the APIs with correct parameters? LLM actively calls the corresponding API with parameters provided by the robot system. [R1#Q9] Information about the robotic system The robotic arm is RM65-B from RealMan-Robotics. The ultrasound machine is H20 from Angell. The depth camera is RealSense from Intel. The force sensor is M3815B from Sunrise. The control algorithms for the robot system were written by the authors themselves. [R1#Q10] Which components offer APIs and how are they connected to communicate with each other. For example, does the Ultrasound machine offer APIs that can be used to change ultrasound settings by the LLM directly? API communication is done within the program, specifically utilizing python’s value passing. Ultrasound machine didn’t offer APIs. Specifically for the carotid ultrasound sweep experiment, our APIs are depth camera, force transducer, vessel reconstruction, vessel segmentation, robotic arm control. [R1#Q11] What has been done before and whether the contribution mainly lies in the integration of the language models. Our contribution lies in integrating ultrasound robots with large language models to optimize existing human-robot interaction logic.

[R3#Q1] The method requires large knowledge datasets generated by, for example, physicians and roboticists, which would obviously limit the accomplishment of more complex ultrasound tasks. We recognize the need for larger, more accurate datasets for more complex ultrasound tasks.

[R4#Q1] No comparison with other technologies. As far as we know, at the time of the submission of the first draft, we were the first to do this type of work on an ultrasound robot, so there was no one to compare it to.

We will publish the code in the future.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Although some concerns remain, the rebuttal has adequately addressed the major issues. Considering the contributions of this paper, it can be presented as an oral presention at MICCAI 2024.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Although some concerns remain, the rebuttal has adequately addressed the major issues. Considering the contributions of this paper, it can be presented as an oral presention at MICCAI 2024.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Based on the reviewers’ feedback and the authors’ rebuttal, I would recommend accepting the paper. While there are concerns about clarity in methodology and reproducibility, the novel integration of ultrasound robotics with advanced language models presents promising advancements for medical procedures. The authors adequately addressed some concerns in the rebuttal, but further clarification on technical implementation and addressing remaining concerns should be addressed in the final version.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Based on the reviewers’ feedback and the authors’ rebuttal, I would recommend accepting the paper. While there are concerns about clarity in methodology and reproducibility, the novel integration of ultrasound robotics with advanced language models presents promising advancements for medical procedures. The authors adequately addressed some concerns in the rebuttal, but further clarification on technical implementation and addressing remaining concerns should be addressed in the final version.



back to top