Abstract

The advancement and maturity of large language models (LLMs) and robotics have unlocked vast potential for human-computer interaction, particularly in the field of robotic ultrasound. While existing research primarily focuses on either patient-robot or physician-robot interaction, the role of an intelligent virtual sonographer (IVS) bridging physician-robot-patient communication remains underexplored. This work introduces a conversational virtual agent in Extended Reality (XR) that facilitates real-time interaction between physicians, a robotic ultrasound system(RUS), and patients. The IVS agent communicates with physicians in a professional manner while offering empathetic explanations and reassurance to patients. Furthermore, it actively controls the RUS by executing physician commands and transparently relays these actions to the patient. By integrating LLM-powered dialogue with speech-to-text, text-to-speech, and robotic control, our system enhances the efficiency, clarity, and accessibility of robotic ultrasound acquisition. This work constitutes a first step toward understanding how IVS can bridge communication gaps in physician-robot-patient interaction, providing more control and therefore trust into physician-robot interaction while improving patient experience and acceptance of robotic ultrasound.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2021_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/2021_supp.zip

Link to the Code Repository

https://github.com/stytim/IVS

Link to the Dataset(s)

N/A

BibTex

@InProceedings{SonTia_Intelligent_MICCAI2025,
        author = { Song, Tianyu and Li, Feng and Bi, Yuan and Karlas, Angelos and Yousefi, Amir and Branzan, Daniela and Jiang, Zhongliang and Eck, Ulrich and Navab, Nassir},
        title = { { Intelligent Virtual Sonographer (IVS): Enhancing Physician-Robot-Patient Communication } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15969},
        month = {September},
        page = {286 -- 296}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper is on a tele-operated robotic ultrasound imaging system. The claimed novelty is enhanced user-friendliness (doctors and patients) through avatars, LLMs and low latency. The system’s hardware and software are thoroughly described and user-experiment is explained. The paper is concisely written and well to understand. References are carefully chosen and sufficient.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    for strengths, pls see my comments below

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    for weaknesses refer to my comments below, please

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    General remarks Authors should make sure that blanks are in front of all “[“ Section-wise comments Abstract: should contain real results and not just teasing them. 1 Introduction: fine & clear 2 Method: straight-forward structure and well to read 3 Experiment and results: the subjective ratings of the participants remain unclear: what does the scale look like? How was it defined (interspaced)?, which number is “good”, which is “bad”? The black horizontal lines (is it median or average? The figure caption does not explain) on the blue box makes a bad contrast. The “communication accuracy” looks rather like the rate of the amount of information by user and patient than a percentage… 4 Discussion & Conclusion: fine

    • Is the topic of interest to the MICCAI community? Probably yes
    • Does it present innovative ideas, new insights, or relevant impact? Not top-of-the-pops innovative, but pretty nice work
    • Is the evaluation sound? But remember: it is a conference paper. Yes: well explained experiment
    • Is the paper reproducible (ideally)? Yes
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    see my comments above

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I am confindent, that the authors will improve their manuscript according to my initial concerns.



Review #2

  • Please describe the contribution of the paper

    Authors introduce an Intelligent Virtual Sonographer (IVS), designed to bridge the interaction gap between the physician, the RUS (robotic ultrasound systems), and the patient. For this, authors leverage LLMs, and have 1) a IVS translate physician verbal instructions into robotic ultrasound commands and 2) relay system updates back to the physician. There are therefore two IVS, a physician-facing and a patient-facing.

    The contribution is both the system and an evaluation from the perspective of the physician-facing IVS. Authors measure assess its impact in bridging physician-robot-patient interactions and analyze its usability, communication effectiveness (refer to as interaction quality in method), perceived intelligence, and overall user satisfaction.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Although the concept of Intelligent Virtual Sonographer (IVS) is not novel, the novelty in this paper is using it both on the patient and physician side, through leveraging LLMs to enable natural language interaction.

    I commend the authors for having fully developed a system that works beyond the level of a research prototype. Despite its limitations (latency, simple looks in VR), the system appears to be robust.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    I have various concerns concerning the paper.

    1. My first concern is on the use case for the IVS, for two reasons: I question the adequacy of the approach, and I believe that it could lead to more errors than a more simple approach.
      • It is unclear why a physician-facing IVS is adequate, as having an IVS mediate communication can lead to errors from misunderstandings, and nuances the understanding of the patient. For example, authors report that at times the LLM hallucinated patient-specific information, this means that the IVS made up information about the patient when the physician requested such information. Moreover, the IVS removes language that let physicians nuance the message, such as verbal cues including manifestations of fear and hesitation given the tone of voice. There are other more simple approaches that do not have these problem, such as simply letting the physician hear the patient, as the information they obtain would be 100% accurate. Authors propose an LLM to solve the problem of remote examination, which introduces further issues than a more simple approach, and there is no discussion whatsoever of other more simple approaches that do not include these issues.
      • The physician has to verbalize commands to control the probe, which is sub optimal: complex commands probably result in wordy sentences (low efficiency) and the LLM may misinterpret a command and wrongly execute it on the patient (low efficacy). A more simple alternative where the physician controls the probe through a dedicated interface is not discussed, yet probably leads to a faster and more accurate exam.
    2. Some elements are missing for reproducibility
      • How is the arm mask is obtained using skin segmentation? Is this bases on color segmentation, depth data, a trained ML system? How is the wrist location identified?
      • A UNet model [20] is used to segment the blood vessels in real-time. How was this model trained? What is the performance of this model? (can we trust it)?
      • participants were recruited, where 7 were medical doctors, but we do not know what are the other 7. This is key, because later authors say “Perceived intelligence was rated xxx and xxx by physicians”. Who are these physicians, the 7 doctors, or all the 14 participants? Then authors state: “System usability was rated xxx by novices but lower by physicians” For the first time we hear about novices. Are these the other 7 participants? If so, why did authors chose 7 physicians and 7 novices? If so, these two groups perhaps should be analyzed separately given their experience.
      • There is no information on how any of the measures are computed. For example, authors report on communication effectiveness with an accuracy of 90.48%, but which formula is used for this computation?
    3. Some claims need supporting evidence “the robot’s end effector maintains the smallest possible angle with respect to the normal direction of each path point, and, as a result, high-quality ultrasound images can be obtained” - how do we know this? This is just an example, there are other claims on the paper that need revising.

    4. Potential confound The authors of this paper acted as patients, where they may have biased the study. Do they have enough medical knowledge to be acting as physicians? How did authors ensure consistency across the participants they guided? Is there an analysis that shows that participants under each author acting as physician obtained similar results?

    5. Data analysis issues Authors compute mean scores on likert data, which is not a good practice as this scale is not continuous, but rather ordinal. The correct practice is to compute median.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The reason for my recommendation is that the approach authors have chosen to address the problem (IVS) might be adding further communication problems (false patient information + disruption in the natural communication process between physician and patient). While the approach is still promising, the submission does not discuss the benefits and challenges of the proposed approach wrt other approaches. Moreover, there is a significant lack of information for reproducing this work.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    Please justify your final decision from above. 8000 characters left Thank you authors for submitting a rebuttal. I believe that key issues I raised have not been adequately addressed, therefore my recommendation is to reject. I detail my response to the rebuttal below.

    1. use case for the IVS.
      • It is unclear why a physician-facing IVS is adequate. While authors argue that IVS could overcome language barriers, this was never the goal of this research project, and thus it seems as an afterthought to try and justify why this approach could be interesting. If this is the reason why IVS is explored, then there is a whole literature on systems that overcome language barriers in medical consultations that needs to be acknowledge. Existing approaches in the literature may be sufficient. I do understand that this paper introduces a concept, and that LLMs will only improve with time. However, the LLM interface in the physician side comes without justification, which will most likely lead to unnecessary errors, and, authors do not acknowledge nor reflect on other approaches (e.g., directly communicating with the patient). This approach thus seems to be overly complex, perhaps creating problems rather than addressing them.
      • Probe control through verbal instructions. I acknowledge that authors have addressed my concern #2.1 and 2.2 for reproducibility.
      • Participants. I’m OK with not having real patients as participants, as authors argue in the rebuttal. However, the issues I raised about the participants as physicians (7 + 7) remains unanswered. Why are results reported segregating novices and physicians? Are the “novices” a group composed by a mix of physicians and engineers ? Why are authors making this distinction in their analysis? These are fundamental flaws in the data analysis.

    I acknowledge that authors have addressed my concern #2.4.

    My point #3 is partially answered, and I trust authors could revise all of their claims.

    My point #4 remains unanswered. How did authors ensure consistency across trials? Is there data that shows this consistency?



Review #3

  • Please describe the contribution of the paper

    The paper describes technology and a user study of to facilitate communication between clinician/robot/patient during remotely performed ultrasound. The technology is well motivated and well described, featuring two locally deployed LLMs facilitating interaction between patient, robot, and clinician. There is an interesting element in that direct communication between patient and clinician is replaced by indirect communication through the virtual sonographer. It would be good if the paper further explored the motivation and consequences of this design choice.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    I very much like the paper’s focus on enabling communication between all participants (clinician/patient/robot) to ensure patient benefit. This is in contrast to many studies which have a more narrow technological focus, and exclude the human involvement. The use of two local LLM to facilitate communication between the three participants is interesting and potentially beneficial.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Importantly, the claims of the paper (quote from discussion “IVS effectively facilitates communication, improving the efficiency and transparency of robotic ultrasound”) are not supported by the data in the paper. Though the paper reports communication accuracy (eg. 86 to 90%), there is no base line data to compare it to, therefore you can’t claim an improvement. The authors either need to report data (from the literature?) of communication accuracy or to reword their claims.

    For me the reported accuracy is worryingly low. To paraphrase the report, 1 in 10 communications regarding medical history are wrong and may contain hallucinated data(90 % accuracy). And 3 in 20 requests by the patient for the robot to pause or release pressure were ignored (86% accuracy). Both these failure modes could represent significant harmful events for the patient so this needs to be addressed further. The paper appears to be an honest assessment of the systems current development, but to me it is overly optimistic about these devices’ current safety.

    It’s a shame that the authors didn’t recruit representative patient participants. It appears that the authors used people involved in the research as patients but it would have been better to have recruited people more representative of the patient population, or at least people with no involvement in the project.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Figure 2 and the accompanying descriptive text could be improved. It’s not clear from figure 2 what the different arrows and colours mean. Is a straight arrow different to a curvy arrow? I think I get it from the text, but the figure itself could be more precise. Here’s my understanding of the process, to see if I’ve understood correctly. The clinician communicates with a local LLM. The physician side LLM communicates either with the RUS or the patient via the IVS. The patient communicates with a local LLM. The patient side LLM communicates either with the RUS or the clinician via the IVS. So there is no direct communication between clinician and patient? This is quite an interesting and perhaps counterintuitive idea and could be explored further. The general idea of replacing direct communication with structured communication through an intelligent agent is presumably not unique, perhaps the authors can find some other examples in the wider literature to support the proposal.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper investigates interesting ways to facilitate communication and effective treatment that will be of interest to the MICCAI audience. The paper requires improvement (change of claims, discussion of acceptable failure rates and failure modes, more precise technical descriptions) before publication but is generally sound.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The author’s rebuttal acknowledges some of the paper’s weaknesses. Some of these weaknesses will be addressed by more technical information and release of the software. Whist some of the weaknesses remain (high failure rate) the authors’ have committed to discussing these in a revised submission. The idea of facilitating clinical communication via two LLMs (whilst potentially flawed) is very interesting and would make for a good discussion at MICCAI.




Author Feedback

Dear Reviewers,

We would like to thank you for your positive remarks on the paper’s contribution to enabling communication among all parties in robotic US (R2). We appreciate the recognition of the system’s novelty and its robustness (R3), the paper’s structure and clarity (R1). We have carefully considered reviewers’ valuable feedback and provide the following responses:

We acknowledge the reviewers’ comments about our claims and will revise to present IVS as an initial step toward enhanced communication in terms of efficiency and transparency in robotic US (R2/R3). The proposed system does not aim at direct communication between physician and patient (R2), as it reduces robot-patient and physician-robot communication. As healthcare becomes more autonomous, it is crucial for patients to understand the robot’s capabilities to improve acceptance, and equally important for physicians to trust the robot by experiencing its intelligence and ability to discuss clinical subjects. Another point in having the IVS as an intermediary is enabling the system to bridge language barriers by interacting with the patient in one language while reporting findings and discussing them with the physician in another. This capability represents another important and unique advantage of our indirect communication approach.

We would like to emphasize that this paper serves as introducing a novel concept. While current LLM has limitations in complex medical contexts (R2/R3), rapid advancements are actively addressing challenges in contextual understanding and reliability. Our contribution establishes a framework that will naturally benefit from these advancements. Moreover, with better sensing capabilities like eye-tracking, heart rate monitoring, facial expression or voice analysis, emotional cues and mental states of the patient could be inferred by IVS and transferred to the physician (R3).

As the intelligence of the IVS expands, the interaction paradigm can evolve from low-level robot control to higher-level semantic clinical communication. Physicians could request specific anatomical views rather than specifying precise probe manipulations (R3). This approach aligns with clinical workflows where physicians may have diagnostic expertise but less procedural experience than specialized sonographers in acquiring certain US views.

For reproducibility (R1/R2/R3), we will provide code upon acceptance. To clarify implementation details (R3), we employ the trained model in MediaPipe: SelfieMulticlass for skin segmentation and HandLandmarker for detecting wrist points. Both models use RGB images as input. For the UNet model (Dice=0.954±0.012, precision=0.942±0.021), we trained on 3000 US images from three volunteers.

Regarding study design (R2/R3): We recruited 7 doctors and 7 biomedical engineers to act as physicians while an author served as the patient. We did not recruit representatives from the patient population for two reasons: first, prior research ( [22] in original submission) examined patient experiences with IVS, allowing us to focus on physician-facing aspects and overall communication; second, our concept prototype did not require real patient participation and would have necessitated additional ethical approval.

For quantitative data evaluation (R1/R3), we measured accuracy as (total items - errors)/total items, where errors represent failed information transfers. Take patient information as an example, each session included 3 patient-specific information items, resulting in 42 items in total. This yielded (42-4)/42 = 90.48%.

Lastly, we will clarify that our questionnaire uses a non-continuous Likert scale where higher scores indicate better performance in the respective category (R1). We will report medians for ordinal data (R3) and enhance Figure 2 with additional caption details. Regarding probe angle and other claims (R3), we will provide supporting references.

We hope these responses adequately address reviewers’ concerns.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    The paper presents an Intelligent Virtual Sonographer (IVS) system, integrating two local large language models (LLMs) to support communication between clinicians, patients, and a teleoperated robotic ultrasound system.

    Summary of Reviews:

    • R1 highlights the paper’s technical completeness and thoughtful design but expresses concerns over the clarity of the evaluation metrics and the subjective results’ interpretation. They recommend clearer descriptions of the rating scale and visuals, and caution against overstated claims.
    • R2 appreciates the human-centered framing and technological integration but raises significant concerns about the validity of the claims (especially communication “improvement”), the low communication accuracy for medical contexts, and the use of non-representative participants.
    • R3 finds the dual-IVS setup and LLM-mediated interaction interesting and robust but shares concerns regarding evaluation and representativeness.

    The authors should consider a rebuttal to address the concerns of the reviewers including:

    • Clarifying claims around communication improvement. Either introduce a comparative baseline or clearly state the absence of one and reword accordingly.
    • Justify or contextualize the acceptability of reported communication accuracy (86–90%) in a medical setting, including any mitigation strategies for potential failures or hallucinated responses.
    • Address the limitations of participant pool, particularly the use of team members as “patients.”
    • Improve explanations around evaluation scales (subjective metrics), clarify visualizations (e.g., figure captions, contrast, meaning of lines), and define terms like “communication accuracy” more precisely.
    • Enhance details on how your system can be reproduced or studied independently (e.g., open-source components, detailed architecture diagrams, demo video link if available).
  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top