Abstract

With access to large-scale, unlabeled medical datasets, researchers are confronted with two questions: Should they attempt to pretrain a custom foundation model on this medical data, or use transfer-learning from an existing generalist model? And, if a custom model is pretrained, are novel methods required? In this paper we explore these questions by conducting a case-study, in which we train a foundation model on a large regional fetal ultrasound dataset of 2M images. By selecting the well-established DINOv2 method for pretraining, we achieve state-of-the-art results on three fetal ultrasound datasets, covering data from different countries, classification, segmentation, and few-shot tasks. We compare against a series of models pretrained on natural images, ultrasound images, and supervised baselines. Our results demonstrate two key insights: (i) Pretraining on custom data is worth it, even if smaller models are trained on less data, as scaling in natural image pretraining does not translate to ultrasound performance. (ii) Well-tuned methods from computer vision are making it feasible to train custom foundation models for a given medical domain, requiring no hyperparameter tuning and little methodological adaptation. Given these findings, we argue that a bias towards methodological innovation should be avoided when developing domain specific foundation models under common computational resource constraints.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4487_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/jakobamb/UltraDINO

Link to the Dataset(s)

N/A

BibTex

@InProceedings{AmbJak_General_MICCAI2025,
        author = { Ambsdorf, Jakob and Munk, Asbjørn and Llambias, Sebastian and N. Christensen, Anders and Mikolaj, Kamil and Balestriero, Randall and Tolsgaard, Martin G. and Feragen, Aasa and Nielsen, Mads},
        title = { { General Methods Make Great Domain-specific Foundation Models: A Case-study on Fetal Ultrasound } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents a foundation model, UltraDINO, pretrained with DINOv2 on a large-scale dataset of 2 million fetal ultrasound images. Through experiments on two segmentation tasks and one classification task, the authors aim to demonstrate three key findings:

    1. Pretraining on domain-specific data is beneficial, validating the value of large-scale custom medical datasets.
    2. Well-established methods (like DINOv2) can perform effectively without extensive hyperparameter tuning when applied to medical imaging tasks.
    3. Novel or domain-specific pretraining strategies may not be necessary to achieve strong performance: off-the-shelf models can already yield competitive results with minimal adaptation.
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper pretrained a foundation model for fetal ultrasound images.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The primary limitation of the paper lies in its lack of novelty, both in methodology and in the research questions posed. The core investigation—evaluating the effectiveness of DINOv2 pretraining on domain-specific fetal ultrasound data—does not offer new insights, particularly given that prior work (e.g., RAD-DINO [1] and Ray-DINO [2]) has already demonstrated the utility of DINO-based self-supervised learning in medical imaging domains. As such, the paper mainly confirms previously established findings in a different dataset, without introducing new techniques, insights, or challenges unique to fetal ultrasound. Even as an application paper, it does not push the boundary of the field.

    Additionally, the experimental design related to the third research question (whether tuning novel pretraining methods is necessary) is unclear and weakly justified. Specifically, Figure 3 and the subsection “Tuning DINOv2 for fetal ultrasound” are difficult to interpret. It is not clearly explained what experimental conditions were varied, what conclusions are intended from the results, or how these results support the claim that tuning is unnecessary. As presented, the analysis lacks rigor and fails to convincingly support the authors’ conclusions.

    Lastly, a significant methodological concern is the lack of experimental repetitions and statistical reporting. All results appear to be reported from a single run, without any indication of variance or confidence (e.g., standard deviation or confidence intervals), and no p-values are computed to support claims of performance differences. This weakens the reliability and reproducibility of the conclusions and makes it difficult to assess whether observed improvements are statistically significant or simply due to chance.

    [1] P´erez-Garc´ıa, Fernando, Harshita Sharma, Sam Bond-Taylor, Kenza Bouzid, Valentina Salvatelli, Maximilian Ilse, Shruthi Bannur et al. “Exploring Scalable Medical Image Encoders Beyond Text Supervision.” Nature Machine Intelligence (2025): 1-12. [2] Moutakanni, Th´eo, Piotr Bojanowski, Guillaume Chassagnon, C´eline Hudelot, Armand Joulin, Yann LeCun, Matthew Muckley, Maxime Oquab, Marie-Pierre Revel, and Maria Vakalopoulou. 2024. “Advancing human-centric AI for robust X-ray analysis through holistic self-supervised learning.” arXiv preprint arXiv:2405.01469.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper presents a large-scale pretraining effort using DINOv2 on fetal ultrasound images, it lacks sufficient novelty in both methodology and research questions. Prior work (e.g., RAD-DINO, Ray-DINO) has already established the effectiveness of DINO-style self-supervised learning in medical imaging, making the findings here largely incremental. Additionally, the experimental design is unclear—particularly in addressing whether tuning is necessary—and key claims are not convincingly supported. The absence of statistical rigor (e.g., no repetitions or p-values) further undermines the reliability of the reported results. Overall, the contribution is too limited for acceptance at this stage.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    I agree that application-oriented papers do not necessarily need to introduce novel methods, but they must contribute meaningfully to scientific understanding or research findings. While this paper presents the first foundation model trained on fetal ultrasound data using DINOv2, its key findings largely confirm what has already been demonstrated in prior studies, even those involving different modalities. As such, the claim of being the first to apply DINOv2 to fetal ultrasound is not, in itself, sufficiently innovative to meet the expectations of a MICCAI contribution.



Review #2

  • Please describe the contribution of the paper

    The paper evaluates different strategies for optimizing fetal ultrasound segmentation and classification models when working with a large unlabeled dataset and smaller, task-specific labeled datasets. The authors compare pretraining the foundation model DINOv2 from scratch with transfer learning from existing models pretrained on natural images, various ultrasound tasks, or not pretrained at all. They demonstrate that pretraining DINOv2 from scratch consistently outperforms fine-tuned pretrained models across all evaluated tasks, including few-shot settings. Based on their results, they highlight three key findings: Domain-specific pretraining is advantageous — even smaller models trained on less data can outperform larger models pretrained on natural images. Selecting a well-established self-supervised learning method can yield state-of-the-art performance without additional hyperparameter tuning. A heavy emphasis on methodological novelty in pretraining may be counterproductive.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    are particularly well-supported by thorough empirical evidence. *Overall, the experimental evaluation is careful, detailed, and credible. The scope of the experiments is appropriate for the length of the paper, and both the choice of pretraining dataset and the task-specific datasets are well-justified. The selected evaluation tasks are relevant and well-motivated. *The paper is well-written, with clear and informative figures that effectively support the reader’s understanding of the findings.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • It is unclear why the authors chose to compare their approach to the foundation model USFM, pretrained on ultrasound images, but not to URFM or UltraFedFM. Since UltraFedFM is concurrent work, this omission is understandable, but the authors do not explain why URFM was excluded. Clarifying this choice would strengthen the comparative analysis.

    • It would be highly valuable for the community if the authors planned to release their code, and, most importantly, their pretrained foundation model. It is disappointing that this is not addressed at all, given that the pretrained model arguably constitutes the main contribution of the paper. Furthermore, since the authors do not provide sufficient experimental detail to fully reproduce their results, sharing the model becomes even more important from a reproducibility standpoint.

    • Table 1 appears incomplete. The final six methods lack information about backbone architecture and pretraining data. While it is implied in the text that no pretraining was used, the absence of entries in the table is confusing — especially since one method is explicitly marked “From scratch.” Additionally, classification, few-shot, and segmentation results are missing for these methods, but no explanation is provided. If these evaluations were omitted because they were deemed less relevant, that rationale should be clearly stated in the text.

    • Key Finding 3 is not well supported by the experimental results. The only related evidence is a short paragraph in Section 4, “Tuning DINOv2 for fetal ultrasound,” where the authors state that hyperparameter tuning was challenging and therefore not worthwhile compared to using default settings. This is a strong conclusion based on limited experimentation. It would be more convincing to either provide additional evidence or soften the claim. In fact, the paper would be sufficiently strong if it focused on the first two key findings, which are more thoroughly substantiated.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I believe this could be a highly valuable paper for the community if the authors plan to release the code and the pretrained foundation model. Without that, the impact is more limited. However, the paper does offer meaningful lessons, particularly through Key Findings 1 and 2, which are well supported by the experiments. If the authors can either provide stronger experimental support for Key Finding 3 or choose to de-emphasize it, and address the other concerns outlined in the weaknesses, especially clarifying missing comparisons and incomplete results, I would be inclined to recommend acceptance, particularly if they commit to releasing the code and pretrained model.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have addressed my main concerns constructively and with appropriate clarity. They explained the rationale behind the missing URFM comparison, clarified the omissions in Table 1, and stated that the presentation will be revised to increase clarity. Most importantly, they confirmed that the code will be made publicly available and that trained model weights will be shared with researchers granted access to the underlying data. I appreciate that the authors are doing what they can to make the models available. I also welcome their decision to revise the framing of the third key finding to better reflect the limited evidence supporting it. I recommend acceptance.



Review #3

  • Please describe the contribution of the paper

    This paper introduces UltraDINO, a domain-specific self-supervised foundation model for fetal ultrasound. The key contribution lies in demonstrating that self-supervised techniques developed for general-purpose computer vision, such as DINO, can be applied with minimal domain-specific tuning to train medical foundation models for specific, narrow, anatomical regions that generalize across multiple tasks (classification, segmentation, and few-shot learning). The work argues that domain-specific pretraining is beneficial in the medical setting and shows that hyperparameter configurations from general vision models can be reused effectively. The authors evaluate UltraDINO across a variety of ultrasound datasets and tasks to support these claims.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Clear Motivation and Scope: The paper presents a well-defined research question and situates the contribution clearly within ongoing discussions on domain-specific foundation models in medical imaging.

    • Thorough Dataset Curation: The authors provide detailed descriptions of the datasets used for training and evaluation.

    • Logical Flow and Clarity: The manuscript is well-structured and easy to follow, from problem formulation through to evaluation and discussion.

    • Transparency Regarding Limitations: A particularly commendable aspect is the authors’ honest and balanced discussion of the limitations of their approach, including transferability of findings.

    • Multi-task Evaluation: The model is tested on diverse tasks (classification, segmentation, few-shot learning), demonstrating its versatility and supporting the claim of UltraDINO as a multi-purpose foundation model for ultrasound.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Choice of Classification Metric: The decision to use the F1-score as the main evaluation metric for classification is not fully justified. Since F1-score requires selecting a specific operating point, it may obscure performance trends across different thresholds. Area under the receiver operating characteristic curve (AUC-ROC) or precision-recall curves could offer a more comprehensive view since these metrics provide evaluations across all operating points. Furthermore, the paper does not specify how the F1-score threshold was selected or whether this was particularly optimized.

    • Single-Seed Evaluation: If the results are indeed based on a single random seed, this limits the robustness of the performance claims. This is particularly important given the known variance of self-supervised training methods (e.g., the iBOT loss), as highlighted by the tuning experiment on DINOv2 (Figure 3). Multiple runs with standard deviations or confidence intervals would be needed to assess the statistical reliability of the results. It would be great if the authors could comment for how many random seeds the evaluation was done and what their intention behind that was

    • Fine-Tuning Protocol: While linear probing is aligned with the DINOv2 protocol, details on full fine-tuning are sparse. Key aspects such as the number of epochs/iterations, batch size, learning rate schedules, and use of early stopping are not described. This makes it difficult to assess the comparability of results or replicate the experiments.

    • Image Resolution and Preprocessing: It is unclear what input image sizes were used for the ViT-S and ViT-B architectures. DINOv2 typically uses 518×518 images, which is not divisible by patch size 16. Were the images resized to 224×224 or padded to a standard size? Clarifying these preprocessing steps for each evaluated model would be helpful to enable a fair comparison of the presented downstream results.

    • Computational Complexity: The paper does not address the training and inference costs associated with the evaluated architectures and training schemes. Including metrics such as FLOPs, runtime, or memory usage would provide a more comprehensive comparison beyond model performance alone and allow to assess benefits and limitations of each variant better.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • Lack of Supervised Baseline: Given the relatively small datasets used in evaluation, it would have been helpful to include a fully supervised baseline trained end-to-end. Additionally, a LoRA-based fine-tuning setup (as a more lightweight adaptation method) would provide further context for understanding how UltraDINO performs relative to typical strategies in data-limited medical tasks. Although there is no time/space to do this in this paper, this might be something to consider for future work.

    • No Evaluation on Large-Scale US Dataset: The model’s effectiveness is demonstrated on several curated datasets, but an evaluation on a large-scale ultrasound dataset would be valuable to assess whether the observed performance gains translate to real-world settings. Again not feasible to do here, but maybe the authors want to consider this for future work.

    • Manuscript Quality: The paper contains some minor typos (e.g. under section 3.3 Classification: “validaiton”) and incomplete sentences (e.g. under I Introduction: “URFM [15] combines MIM 1 million ultrasound images of different organs with knowledge distillation from a BiomedCLIP [26] model.”), which should be addressed.

    • Terminology Use: In the conclusion, the word “significantly” is used to describe performance improvements. Since only one run appears to be reported and no statistical tests are shown, this wording should be revised unless further evidence is provided.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper combines methodological soundness with clarity and relevance to the community. The evaluation is extensive across tasks and the core claim, that general self-supervised pretraining methods can be translated effectively to medical domains without major hyperparameter changes, is both well-argued and practically useful. Still, the work has some limitations: results are based on a single random seed and important implementation details (e.g., fine-tuning, image resolution, computational complexity) are missing. These gaps make it difficult to fully assess the robustness and generalizability of the findings and require additional clarification.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    After reviewing the authors’ rebuttal and considering the reviews and concerns raised by the other reviewers, I find that the authors have satisfactorily addressed my main points, and I am now inclined to recommend acceptance. Their responses were clear, well-argued, and reflected a strong understanding of the technical issues raised in the initial review.

    While the other reviewers raise reasonable concerns, I believe these do not outweigh the strengths of the paper. The work makes a meaningful contribution, is technically sound, and is thoroughly evaluated. The rebuttal significantly strengthened the case for acceptance, and I believe the paper will be of interest to the community.




Author Feedback

We would like to thank all reviewers for their thorough and constructive evaluation. We are glad to see our contribution recognized as combining “methodological soundness with clarity and relevance to the community” [R3], and “offering meaningful lessons, particularly through Key Findings 1 and 2, which are well supported by the experiments” [R1]. UltraDINO is the first foundation model trained on a large fetal US dataset, reaching SOTA performance using well-established methods (KF1), while providing insights on the lack of scalability of natural image datasets (KF2), “which are well supported by the experiments” [R1].

We are especially grateful for the constructive criticism, to which we respond in detail below. We believe these amendments substantially improve the quality of the manuscript:

[R1,2,3 KF3 not well supported] We agree that the evidence presented for KF3 in the paper currently amounts to anecdotal quality; we therefore decided to weaken the claims throughout the paper. This demotes KF3 to a more nuanced discussion, in contrast better highlighting the well-supported main KF1 and 2. We further add computational requirements of PT and hparam search in the results and discussion (FLOPs, runtime).

[R1 Code and weights] We are publicly releasing the complete pretraining (PT) and finetuning (FT) codebase with the published paper. We are sharing model weights with researchers who are granted access to the privacy-sensitive PT data (after application at a third party), due to legal reasons. Details on this process are omitted due to anonymity of the submission.

[R1,2,3 Multiple seeds and folds] The reported FT results are averaged across 3 folds, while few-shot results are 5-fold, which was not stated in the original submission and represents a major oversight on our end. Result x_i is only bold when it holds that ∀ k != i : x_i-SEM_i > x_k+SEM_k. We add SEM to the table and limit usage of “significant” to the statistical meaning. SEM for UltraDino B is 0,38 for few-shot FASS, and 0,11 for full FASS. Other SEMs are similar.

[R1,2 Incomplete results table] We understand that the presentation of results is unclear, which will be improved. In general, the six last methods (incl. ViTs) are non-pretrained baselines; we therefore omitted linear probing performance. Further, segmentation architectures (e.g. UNets) are not evaluated on classification. We omit few-shot results of the weaker non-PT baselines due to generally much lower performance than nnUNet. URFM results are missing due to ongoing lack of publicly available code/checkpoints.

[R1,2,3 lack of training details] Full recipes are currently omitted due to space constraints. We instead provide full configurations with our code release. We are adding information on the image resolution in the methods (224px for all models for fair comparison on segmentation tasks).

[R3 Choice of F1] We chose F1 without operating point optimization for two reasons: (i) the dataset is highly imbalanced, leading to poor separation when using AUROC (e.g. UltraDINO ViT-B 0,9940 vs -S 0,9947), while the rank of methods remains the same as F1, (ii) due to convention in previous works and space constraints in the table.

[R2: Lack of contribution] We respectfully disagree with the critique of R2, that our paper does not offer methodological innovation and lacks novel insights compared to RAD-DINO and RAY-DINO: Firstly, this work is submitted as an application study not intended to introduce novel methods, in line with this conference track. Secondly, the paper differs from previous studies in several aspects, such as (1) We present the first study applying DINOv2 to fetal US, investigating clinically relevant tasks (2) The fetal US dataset is of unprecedented size, offering insights into scaling PT data (3) We demonstrate that the performance of natural image models does not scale with increased PT dataset size, indicating hard limits for using general purpose models for medical vision




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper investigated DINO on Fetal~Ultrasound images. There are some concerns lead to my recommendation.

    1. The paper directly apply DINO to Fetal ultrasound and the paper’s findings have been confirmed by other paper, as raised by Review #2.
    2. The paper lacks generalizability. Although it claims to address domain-specific challenges, it focuses solely on fetal ultrasound imaging, which is only a subdomain of the broader ultrasound modality. As a result, the proposed findings and methods may not generalize to other ultrasound applications, such as breast cancer imaging or other anatomical sites.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Although the paper has one reject, the other reviewers have recognized the potential of the work due to the nature of the study, and based on their assessments, I recommend acceptance.



back to top