Abstract

Understanding medical ultrasound imaging remains a long-standing challenge due to significant visual variability caused by differences in imaging and acquisition parameters. Recent advancements in large language models (LLMs) have been used to automatically generate terminology-rich summaries orientated to clinicians with sufficient physiological knowledge. Nevertheless, the increasing demand for improved ultrasound interpretability and basic scanning guidance among non-expert users, e.g., in point-of-care settings, has not yet been explored. In this study, we first introduce the scene graph (SG) for ultrasound images to explain image content to non-expert users and provide guidance for ultrasound scanning. The ultrasound SG is first computed using a transformer-based one-stage method, eliminating the need for explicit object detection. To generate a graspable image explanation for non-expert users, the user query is then used to further refine the abstract SG representation through LLMs. Additionally, the predicted SG is explored for its potential in guiding ultrasound scanning toward missing anatomies within the current imaging view, assisting ordinary users in achieving more standardized and complete anatomical exploration. The effectiveness of this SG-based image explanation and scanning guidance has been validated on images from the left and right neck regions, including the carotid and thyroid, across five volunteers. The results demonstrate the potential of the method to maximally democratize ultrasound by enhancing its interpretability and usability for non-expert users.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3129_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{LiXue_Semantic_MICCAI2025,
        author = { Li, Xuesong and Huang, Dianye and Zhang, Yameng and Navab, Nassir and Jiang, Zhongliang},
        title = { { Semantic Scene Graph for Ultrasound Image Explanation and Scanning Guidance } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {502 -- 512}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    Ultrasound Scene Graph Generation: The authors introduced a framework by integrating semantic scene graph (SG) representation and LLMs for ultrasound image explanation and providing furhter scanning guidance to the non-expert users. They used a SOTA RelTR (Transformer-based) model to extract the anatomical relationships and generate the SG which is then integrated into an LLM to handle Ultrasound image understanding tasks. By using RelTR model, the authors predict the SG using single-stage appraoch where they detect both the entities (5 different anatomical structures) and predicate (3 different characteristics) simultaneously. This also results in the lateral side and movement which are then used as an input to the LLM in the form of grounding prompt.

    Ultrasound Understanding Tasks: The authors focused on two main tasks named as Image summarization and Scanning Guidance. For summarization, the authors focuses on a general description, focus area, and the relationship b/w entities. Similarly, for summary guidance, the authors provides the user query and the instruction prompt to indicate the relative motion direction of the probe.

    Lightweight vs Heavy LLMs: The authors used both the light and heavy models for the experiment purpose to showcase the effectiveness of such an application in the clinical field.

    Carotid Artery Dataset: This paper also introduces 289 ultrasound images acquired by the authors for the training and testing purpose. They also then annotated the dataset using a lightweight annotation tool which they mention to share upon acceptance of the paper.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Unique Appraoch in Ultrasound: The authors mentioned about the usage of SG in classical vision applications and surgical data science but no such work has been done for ultrasound using LLMs which makes this work worth exploring so as to identify the real-potential of such applications.

    RelTR utilization: The authors uses single-stage approach which avoids the need for explicit object detection task. Also, the identification of lateral side and movement is something that can be useful in many ways like identifying any abnormal movement.

    Dataset acquisition: The authors acquired the dataset from different volunteers and then developed a lightweight annotation tool for SG labeling purposes. This tool can indeed be used for annotations of triplets in other medical applications as well and I will be very interested to see how effective and user-friendly this tool is to interact.

    LLM comparative analysis: The paper also provided a comparison of how the proposed application work with light and heavy weight LLMs.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Weak Language: The language of the paper is weak, as it has a lot of grammatical mistakes. “we first introduce the scene graph (SG) for ultrasound images to explain image content to ordinary and provide guidance” -> what does ordinary refer here? “generating graspable US explanation” graspable?

    Lack of Ablations: There is no compelling evidence that SG is essential. Also, no ablation study shows performance for with versus without SG for the proposed application. I will suggest the authors to also emphasize on why the SG works better? Also, why won’t a simple object detection or triplet detection model work better?

    Fixed Predicates: Even the SG construction is predefined manually, with fixed predicates, there’s no learning of relations or adaptation.

    Lack of Generalization: Only 289 images were used, which makes the generalization claims limited. Also, how did the authors check for the overfitting issues as such details are not provided clearly in the paper.

    Limited Evaluations: The evaluation was constrained to carotid region, which is also why the generalization to other anatomies is unproven. Will the proposed method works better for other anatomies as well if only such limited images are used? How well can the authors relate it to real-time clinical perspective? Furthermore, the solution provided is also intended for the non-expert people so how can it be helpful if someone is not familiar with the terminologies like “Cartilage Ring” or “Carotid Common Artery” etc.? Do the authors have any suggestions for such non-expert users as well to better understand the ultrasound imagery?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The authors proposed an interesting application of scene graphs, and it would be better to explore the generalization capability of such a solution.

    Also, the authors are encouraged to validate the overfitting issue of the proposed method as the size of dataset is very limited.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper provides an interesting work which have not yet explored for ultrasound imagery. However, the paper lacks necessary ablation studies (discussed in weakness) and has generalization issues which limits the scope of the proposed work. The authors are suggested to consider these points in order to make the work more impactful.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors responded to the questions asked and answered most of them. Although the paper lacks some ablations, it still has the potential to benefit the MICCAI community. I therefore suggest the authors explore further the different ablations and introduce more generalization, as that would also help benefit the research society.



Review #2

  • Please describe the contribution of the paper

    The authors propose a method for generating ultrasound (US) image interpretation and scanning guidance. This approach involves object detection using a scene graph (SG) and subsequent image explanation and scanning guidance via large language models (LLMs). The model was trained on 262 carotid artery US images and validated on an additional 27 images.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The paper is well-structured and easy to follow. 2) It addresses a clinically relevant and useful problem. 3) The application of SG and LLM in ultrasound imaging appears to be relatively novel.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) The dataset size, particularly with a validation set of only 27 images, may be insufficient and could impact the reliability of the model’s performance evaluation. 2) The authors should clarify the level of expertise of the annotators to ensure the quality and reliability of the ground truth labels.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Consider incorporating a more extensive and diverse dataset to better evaluate the model’s efficacy across various anatomical structures and imaging parameters, such as image size, depth, and resolution.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The issue addressed in the paper could be valuable for non-expert users; however, the performance still requires further consideration.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This study introduces the scene graph for ultrasound images so as to explain image content to non-expert users of ultrasound and provide guidance. The effectiveness is validated on images from the left and right neck regions. The authors claim that this is the first work introducing SG and LLMs to boost intuitive ultrasound explanation and guidance.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This study is working to address a very clear challenge in the ultrasound imaging space: user-dependency. By leveraging advances in semantic scene graph research in computer vision research, this work aims to create explainability measures and guidance for those with less experience. In my opinion, a major strength is that the authors intentionally address the challenge of how the model will ultimately be deployed by considering its use on resource-constrained portable ultrasound devices, and aim to balance model size and predictive performance. I find this work to be quite novel and could lead to very interesting future work if expanded to other clinical applications with ultrasound, and further validated. Overall, the paper is well written, and claims are cited properly. The figures, especially Figure 1, are very well done and make the overall work very clear. The model implementation is well described, and the team noted that they would eventually make their code available.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    I recognize this is early stage, but I believe the paper could benefit from further explanation to contextualize the validation results. The intuitive presentation in Figure 2 is very helpful to understand the deployability, but the evaluation metrics could be explained further so that the reader can better understand the performance results. Specifically, from a clinical perspective, what are the consequences of if the LLM is wrong – does this lead to a dangerous misdiagnosis?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    I personally find the use of “ordinaries” or “ordinary” to describe users of ultrasound to be nonspecific. I think using terms like, “non-expert users” or something similar would be more appropriate in this context.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Though this work is limited to a relatively small sample size for just one clinical application, I found it to be quite innovative and could have broad impact when expanded upon to other indications in the future.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    As mentioned in my original review, I believe that this is interesting and novel work. I acknowledge that there is limited validation data, but I think this research (though preliminary) is suitable for this conference.




Author Feedback

We thank the reviewers for their valuable feedback. (R2/R1) highly praised our idea of integrating scene graphs (SG) with LLMs to explain US images and guide probe navigation for non-expert users. (R2) noted the challenging problem, novel idea, clear presentation, and potential for broad future impact. (R1) also highlighted the clinical relevance and novelty of using SG and LLMs in ultrasound. (R3) acknowledged the work as a new and interesting application of scene graphs.

Scene Graph Role(R3): SG provides a compact, semantically structured input that enables LLMs to perform downstream tasks with reduced reasoning complexity. This allows even lightweight LLMs to achieve good results, supporting efficient deployment on portable US devices, as highlighted by R2. The traditional end-to-end vision-language alignment (without SG) in computer vision often requires large-scale triplet datasets, which are scarce in US images. Due to limited data (further discussed in the next response), direct comparison with SOTA vision-language methods would underperform and risk being misleading. In addition, compared to other representations, object detection lacks relational reasoning, and triplet detection relies on object detection while capturing only part of the graph. Although we use a limited set of predicates, they are adaptable and vary with scanning view, force, and anatomy—e.g., the thyroid may “encase” or be “contiguous to” the cartilage ring, and other structures may appear “superior to” the vertebral body depending on the view. The results of both tasks confirm the effectiveness of SG integration.

Limited Validation/Data(R1/R2/R3): Such a combination of LLM, SG, and US imaging is presented for the first time to tackle long-standing challenges of explainability and probe guidance for non-experts in POCUS. While larger evaluations across anatomies will bring broader impact, the current dataset size reflects real-world limitations—there are no existing public US SG datasets, and manual annotation of objects and relations is highly labor-intensive without mature tools. To clearly describe all method-related details within strictly restricted MACCAI 8 pages and demonstrate the feasibility of this novel concept, we selected the CCA and thyroid region (the most common examination region) as a representative application. For this specific setting, a modest dataset (10 scans from both sides of 5 human necks) is sufficient for initial validation. Extending to other anatomies is straightforward by adding anatomy-specific predicates, as only the auxiliary relation transformer requires retraining. We will expand the discussion to acknowledge current dataset limitations and emphasize the importance of broader validation.

Overfitting(R3): To prevent overfitting, we apply horizontal flipping and Gaussian noise for augmentation. It is important to remain spatial predicates unchanged. To inspect the potential overfitting, the loss on training data and validation data is compared.

Clinical Risk(R2): Since the system is manually operated by users, incorrect predictions pose no significant safety risk. For Task 1, errors may cause minor knowledge inconsistencies, while for Task 2, they may lead to suboptimal imaging planes without clinical consequences.

Support for Non-Expert Users/Language Clarity (R2/R3): Beyond the suggested language expression, we also asked for help from a native speaker for a thorough proofreading (R2/R3). To assist non-expert users, the system overlays detection boxes with anatomical labels and allows follow-up queries to the LLM, which provides accessible explanations based on its general medical knowledge(R3).

Real-time relevance(R3): We aim to enable real-time scanning guidance on a portable device, where image acquisition and SG generation run at 20 fps. The lightweight LLM is currently used as a back-end service.

Expertise Level of Annotator(R1): All annotations were made by a US expert with 4+ years of experience




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top