Abstract

Congenital Heart Disease (CHD) is one of the leading causes of fetal mortality, yet the scarcity of labeled CHD data and strict privacy regulations surrounding fetal ultrasound (US) imaging present significant challenges for development of deep learning-based models for CHD detection. Centralised collection of large real-world datasets for rare conditions, such as CHD, from large populations requires significant co-ordination and resource. In addition, data governance rules increasingly prevent data sharing between sites. To address these challenges, we introduce, for the first time, a novel privacy-preserving, zero-shot CHD detection framework that formulates CHD detection as a normality modeling problem integrated with model merging. In our framework dubbed Sparse Tube Ultrasound Distillation (STUD), each hospital site first trains a sparse video tube-based self-supervised video anomaly detection (VAD) model on normal fetal heart US clips with self-distillation loss. This enables site-specific models to independently learn the distribution of healthy cases. To aggregate knowledge across the decentralized models while maintaining privacy, we propose a Divergence Vector-Guided Model Merging approach, DivMerge, that combines site-specific models into a single VAD model without data exchange. Our approach preserves domain-agnostic rich spatio-temporal representations, ensuring generalization to unseen CHD cases. We evaluated our approach on real-world fetal US data collected from 5 hospital sites. Our merged model outperformed site-specific models by 23.77% and 30.13% in accuracy and F1-score respectively on external test sets.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4958_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: https://papers.miccai.org/miccai-2025/supp/4958_supp.zip

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{SahPra_Selfsupervised_MICCAI2025,
        author = { Saha, Pramit and Mishra, Divyanshu and Hernandez-Cruz, Netzahualcoyotl and Patey, Olga and Papageorghiou, Aris T. and Asano, Yuki M. and Noble, J. Alison},
        title = { { Self-supervised Normality Learning and Divergence Vector-guided Model Merging for Zero-shot Congenital Heart Disease Detection in Fetal Ultrasound Videos } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    The study presents a self-supervised framework for congenital heart disease detection with privacy-preserving multi-center data.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors introduce a Divergence Vector-guided Model Merging strategy to selectively update the teacher model during training.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Comparative experiments are limited; Could the authors discuss more image- and video-level unsupervised anomaly detection methods to better illustrate the task’s difficulty?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    1.An automatic cropping model is mentioned (data section), but not explained, how was it trained, and was multi-center data used? 2.In Fig. 3, is the example image normal or abnormal, and is it cropped? The image edge shows activation; can the full activation map be provided? Visually, the images appear similar, though some focus on the septum while others do not—is this due to a specific threshold or parameter? 3.What does “a sampling rate of 3” mentioned on page 6 refer to? 4.Several tubes are used, but no ablation study is provided, why?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The manuscript is well motivated, proposing a parameter update strategy to preserve data privacy across sites, but includes few comparative algorithms in the experiments.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    While the authors have provided a rebuttal, it does not resolve my concerns about the completeness of the comparative and ablation studies. I am therefore unable to recommend acceptance.



Review #2

  • Please describe the contribution of the paper

    This paper introduces a novel privacy-preserving, zero-shot framework for detecting congenital heart disease in fetal ultrasound videos. The method was validated on a multi-site dataset and achieved promising results.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The application of this paper is new and interesting.
    2. The methodology is novel.
    3. This paper is well-written.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The experiment can be improved
    2. The mechanism for zero-shot anomaly detection is not thoroughly explained
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    1. The experiment can be improved
      • While this paper compare their model with several state-of-the-art merging techniques, there is no comparison with federated learning approaches, which are also designed to address privacy concerns in distributed settings.
    2. The mechanism for zero-shot anomaly detection is not thoroughly explained
      • The paper mentions using a KNN classifier on extracted features, but doesn’t elaborate on the detailed hyperparameter setup.
      • Considering that the model achieved superior performance in zero-shot anomaly detection, the content of the relevant methodology and experiment can be further detailed.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents a novel, technically sound approach to addressing a real-world challenge in CHD detection. Overall, it is good.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors addressed all my concerns, I have no further comments.



Review #3

  • Please describe the contribution of the paper

    The paper presents a novel, zero-shot CHD detection pipeline that (rightfully) models CHD as an anomaly in fetal ultrasound (US) videos. As the first step, a site-specific video anomaly detection network is trained on healthy US videos using sparsely sampled space-time tubes and self-distillation loss. Second, the site-specific models are combined in a weighted fashion using specially computed divergence vectors. The magnitude of divergence of each model from the geometric median dictates the weight of the individual model in the combined framework. Data was available from five distinct sites. Three individual site-specific models were trained. Remaining two sites were used for zero-shot testing. Evaluations compared the individual and combined model with two SOTA models. Evaluation also included comparisons with five other combination strategies versus the proposed DivMerge strategy.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is very well written and clearly structured. Each architecture decision is clearly justified.
    • Formulating CHD detection as a video anomaly detection problems solves the challenge of having limited CHD US training data.
    • The Sparse Tube Ultrasound Distillation (STUD) model gives a (comparatively) lightweight model for a ViT.
    • The DivMerge procedure models geometric median computation, dynamic model weighting, and selective parameter retention in a way that makes sense computationally.
    • The comparison results show a significant improvement considering the other SOTA models are computationally more expensive.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The model architecture needs a large amount of data for training. Without this, the performance of the model suffers significantly. As can be seen with the site-specific models - site 1 (8878 training), site 2 (16074 training), site 3 (1573 training). Model 1 struggles to detect normal cases while model 3 classifies all as abnormal in the zero-shot testing. Individual model 1 precision and F1 scores on site 1 data are also pretty bad. On the other hand, individual model 2 outperforms the DivMerge model on site 2 data.
    • So this model benefits the data-starved sites but only if there are at least a few data-rich sites to train on.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • The work presents a novel, privacy-preserving, zero-shot CHD detection framework.
    • The DivMerge framework takes advantage of the individual trends in the site-specific models making it a much stronger model than a single model trained on all available data.
    • See strengths.
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

R2: Q1. No comparison with federated learning (FL) approaches: Our approach assumes a single communication round, with fully trained local models available at each site. By contrast, FL frameworks have multiple communication rounds and iterative update models. Hence comparing the two is not appropriate. We will add a comment on this in the revised paper. Q2. Hyperparameter setup, experimental details: For zero-shot evaluation, we extract CLS token features from STUD teacher encoder and use a weighted kNN classifier with k=20, chosen for optimal F1-score on the validation set. This information will be incorporated into the revised paper. R3: Q1. This approach helps data-scarce sites, but only when supported by data-rich sites: We agree with the reviewer’s comment. Our approach is specifically designed to boost performance at data-scarce sites by leveraging knowledge from data-rich sites through model merging, which is a core motivation of our work. We will clarify this in the revised paper. R4: Q1. More image- and video-level unsupervised anomaly detection methods to better illustrate the task’s difficulty: One of our main contributions is to show that our proposed STUD, based on sparse tube sampling, achieves comparable or superior anomaly detection performance while requiring only a fraction of the tokens used by SOTA video models like VideoMAE. Most image-level unsupervised anomaly detection methods, such as autoencoders and contrastive learning, operate on single frames and cannot leverage temporal information, making them insufficient for capturing the dynamic and subtle anomalies present in fetal ultrasound. For video, masked modeling approaches such as VideoMAE [1,2] focus on reconstructing low-level features and typically process only short clips, limiting their ability to model high-level semantic patterns and long-range temporal relationships. Contrastive learning methods [3,4] also struggle with high inter-frame similarity and limited augmentation options, resulting in less discriminative representations. Our comparison is designed to benchmark STUD against a representative state-of-the-art model both in terms of performance and parameter efficiency. Q2. Explanation of automatic cropping model: The cropping model mentioned in our paper is a 3D ResNet-18, trained on 738 ultrasound scans of 401 participants from one site, and its robustness was evaluated on data from a second site from [5]. The code and trained model are available at: https://github.com/QianyeYang/FEHT. Q3. Sampling rate of 3: This refers to the frame sampling rate. We select one frame out of every 3 consecutive frames to reduce temporal redundancy. Q4. Clarification of Fig 3:The feature maps shown in Fig.3 are derived from the cropped images obtained by applying the cropping model. All feature maps are visualized using the same threshold and visualization parameters to ensure fairness. Regions such as the septum appear more prominent because the model attention scores are higher in these areas, highlighting its ability to focus on important anatomical landmarks relevant to anomaly detection. Q5. Sparse-Tube Ablation missing?: The efficiency and optimal number of tubes have already been extensively investigated in previous work and validated in our preliminary experiments. We will add ablation results in the revised paper. [1] Tong et al. VideoMAE: Masked Autoencoders Are Data-Efficient Learners for Self-Supervised Video Pre-Training. NeurIPS 2022 [2] Fan et al. Motion-Guided Masking for Spatiotemporal Representation Learning. ICCV 2023 [3] He et al. Momentum Contrast for Unsupervised Visual Representation Learning. CVPR 2020 [4] Oord et al. Representation Learning with Contrastive Predictive Coding. arXiv:1807.03748, 2018. [5] Yang et al. A Deep Learning Framework for Fetal Heart Tracking in Ultrasound Videos: Toward Enhanced Congenital Heart Defects Detection (accepted to FIMH 2025)




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top