Abstract

We propose a novel approach that adapts hierarchical vision foundation models for real-time ultrasound image segmentation. Existing ultrasound segmentation methods often struggle with adaptability to new tasks, relying on costly manual annotations, while real-time approaches generally fail to match state-of-the-art performance. To overcome these limitations, we introduce an adaptive framework that leverages the vision foundation model Hiera to extract multi-scale features, interleaved with DINOv2 representations to enhance visual expressiveness. These enriched features are then decoded to produce precise and robust segmentation. We conduct extensive evaluations on six public datasets and one in-house dataset, covering both cardiac and thyroid ultrasound segmentation. Experiments show that our approach outperforms state-of-the-art methods across multiple datasets and excels with limited supervision, surpassing nnUNet by over 20% on average in the 1% and 10% data settings. Our method achieves ∼77 FPS inference speed with TensorRT on a single GPU, enabling real-time clinical applications.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1458_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

CAMUS dataset: https://www.creatis.insa-lyon.fr/Challenge/camus/ CardiacUDA dataset: https://www.kaggle.com/datasets/xiaoweixumedicalai/cardiacudc-dataset TN3K dataset: https://drive.google.com/file/d/1reHyY5eTZ5uePXMVMzFOq5j3eFOSp50F/view DDTI dataset: https://drive.google.com/file/d/1reHyY5eTZ5uePXMVMzFOq5j3eFOSp50F/view Stanford dataset: https://aimi.stanford.edu/datasets/thyroid-ultrasound-cine-clip TG3K dataset: https://drive.google.com/file/d/1reHyY5eTZ5uePXMVMzFOq5j3eFOSp50F/view

BibTex

@InProceedings{ZhaXia_Adapting_MICCAI2025,
        author = { Zhang, Xiaoran and Chen, Eric Z. and Zhao, Lin and Chen, Xiao and Liu, Yikang and Maihe, Boris and Duncan, James S. and Chen, Terrence and Sun, Shanhui},
        title = { { Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {23 -- 33}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors introduce a real-time 2D ultrasound segmentation model that leverages a hierarchical vision foundational model used in SAM2, and a feature interleaving foundational model as DINOv2. This methodology is evaluated on six public datasets as well as one in-house dataset for the segmentation of cardiac and thyroid images. Their key contribution is a novel architecture for ultrasound image segmentation that integrates semantic features from a foundational model with the hierarchical features during the decoding of masks, similar to a U-Net architecture. The authors also emphasize the inference speed of their proposed architecture, which is designed to be suitable for real-time applications.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The manuscript is well organized, written, and clear in terms of the motivation.
    • Integrating the foundational model DINOv2 with the SAM2 encoder, Hiera, is novel and would interest the readers.
    • The authors include a detailed qualitative and quantitative evaluation of the model on multiple supervision levels and multiple datasets. The selected models are state-of-the-art for semantic segmentation and suitable for comparison with the proposed approach.
    • The model demonstrates strong performance even in low-resource settings, which are common in the medical field.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The authors preferred DINOv2 as a complementary vision foundational model for feature extraction from ultrasound images. While it has demonstrated the ability to represent images with rich semantic understanding, it has not been pre-trained on datasets specific to the domain. The rationale for selecting the foundational model as DINOv2 is unclear. Were MedSAM2 [52] or SAMUS [23] image encoders considered for this integration, or were there any issues that prevented their use? Would the proposed model gain more advantages from their image encoders?

    It is stated that the proposed method demonstrates real-time inference speed. However, the specifics of the environment used for these real-time tests remain unclear; it is unknown whether it simulates resource-limited settings or operates on high-performance GPUs. Given the architectures employed, it is uncertain if the model can be practically deployed in hospital settings, performing in real-time.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    It is not stated whether the codes or the gathered in-house dataset will be made publicly accessible if accepted. Providing access to the codes and resources would enhance the reproducibility of the research and facilitate its use in future studies.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method contains a novel methodology for 2D ultrasound segmentation that utilizes a foundational model for better semantic understanding of the images, and showcases an improved performance over the state-of-the-art methods. The manuscript is well structured and written. There are a few points that should be clarified, but depending on the rebuttal, the paper could be accepted.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    A method was proposed for applying hierarchical visual foundation models to real-time ultrasound image segmentation.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    An adaptive framework was introduced, leveraging the visual foundation model Hiera to extract multi-scale features and interleave them with DINOv2 representations to enhance visual expressiveness. These enriched features are then decoded to produce precise and robust segmentation.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    no

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper proposes a hierarchical visual foundation for real-time segmentation, which significantly improves performance and robustness, and meets real-time deployment requirements.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose a method to interleave features from the foundation models of Hiera and DinoV2 for effective ultrasound segmentation. They perform experiments on six public and one in-house dataset and show improved performance against from-scratch methods as well as other foundation model adaptation techniques for ultrasound images.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The proposed method is effective against other foundation model based methods as well as methods trained from scratch in extensive experiments conducted by the authors, while also allowing for rapid inference. 2) The proposed method is seen to be extremely effective against from-scratch trained methods in low data settings as demonstrated in Table 1.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) The FPS of the baselines during inference would have been interesting to compare against, since the authors do highlight the real-time nature of their work. A single column in a table would be a good addition. 2) A minor weakness lies in the fact that the from scratch methods are extremely limited in settings such as 1% data - they would probably be more effective if pretrained on some large medical image dataset. But this is a trivial issue, as the authors are demonstrating a point with the comparison, hence it does not affect my score.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The authors are highly encouraged to release their code and trained models.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors present a well-written manuscript with effective experiments to demonstrate the efficacy of their method against reasonably powerful baselines. I believe it is a good paper for acceptance to MICCAI 2025.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank all reviewers for their thoughtful feedback and positive evaluations. We are encouraged by the recognition of our work’s novelty (R3), clarity (R1, R3), and effectiveness (R2, R3). Below, we address the specific comments raised.

  1. Code Availability and Reproducibility (R2 and R3). We recognize the importance of reproducibility and have included detailed descriptions of the algorithm, training setup, and evaluation protocol in the manuscript. We are in the process of releasing source code, trained models, and preprocessing scripts, pending institutional approval. While the in-house dataset cannot be publicly released due to institutional restrictions, we are actively exploring options with our collaborators to provide anonymized samples or to host a benchmarking interface in the near future.

  2. FPS of Baselines (R2). Our method runs at ~30 FPS on a single GPU without TensorRT, slower than lightweight CNNs but faster than foundation model-based approaches like SAMUS (15 FPS) and MedSAM2 (17 FPS) on the same GPU, due to our efficient adaptation of Hiera. We will add a detailed discussion in the next revision.

  3. From-Scratch Methods in Limited Data Setting (R3) We agree that from-scratch models could benefit from domain-specific pretraining. We intended to highlight the data efficiency of our approach by contrasting it with purely supervised training under identical limited data constraints.

  4. Choice of DINOv2 over Image Encoder of MedSAM2 or SAMUS (R3) To clarify, our framework integrates hierarchical vision encoders adapted from SAM2, aligning with the broader family of SAM-based approaches, including MedSAM2 and SAMUS (see Section 2, first paragraph). Incorporating MedSAM2 or SAMUS, both fine-tuned specifically for medical image segmentation, may be redundant and unlikely to provide the complementary semantic information necessary to enhance segmentation performance within our framework.

We selected DINOv2 for its generalization capabilities and ability to capture semantically rich features through self-supervised contrastive learning on large-scale image datasets. DINOv2 has demonstrated effectiveness in medical imaging tasks, showcasing its capacity to extract meaningful semantic representations (Song et al, MICCAI 2024; Ayzenberg et al, ISBI 2024). This contrasts with SAM-based encoders, which are primarily optimized for object boundary refinement. In ultrasound imaging, where anatomical boundaries are often ambiguous due to speckle noise, DINOv2’s semantically rich features offer a complementary advantage and improve robustness across diverse imaging conditions. The effectiveness of this integration is validated in our ablation study (Table 6 in the main text).

  1. Real-time Inference Environment (R3) Our experiments were implemented in PyTorch and conducted on NVIDIA L40s GPUs (48 GB), where our approach achieves ~30 FPS(Sec. 4, Implementation Details and Inference Speed). For the optimized TensorRT version, inference was performed on an NVIDIA RTX A4000 (16 GB), achieving ~77 FPS. To our knowledge, this GPU is commonly deployed in clinical settings, making the results more reflective of real-world usage scenarios.

Once again, we thank the reviewers for their constructive feedback. We believe the additional clarifications and improvements will further strengthen the paper, and we look forward to incorporating these in the final version.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top