Abstract

Echocardiography is a vital non-invasive modality for cardiac assessment, with left ventricular ejection fraction (LVEF) serving as a key indicator of heart function. Existing LVEF estimation methods depend on large-scale annotated video datasets, which are costly and limit adaptability across various clinical settings. Recent vision-language models for echocardiography, such as EchoCLIP, apply image-to-text pretraining but fail to capture crucial temporal dynamics and localized cardiac structures essential for accurate diagnosis. To address these challenges, we propose CardiacCLIP, a video-based framework that enhances LVEF prediction through attention-based frame aggregation and multi-resolution input scaling. Specifically, we introduce MFL (Multi Frame Learning), a novel attention-based mechanism for selectively fusing informative frames, and EchoZoom, a multi-scale feature extraction strategy that refines spatial representations of cardiac structures. As a novel adaptation of CLIP models for few-shot echocardiogram video analysis, our approach significantly improves diagnostic accuracy, reducing MAE by 2.07 on the EchoNet-Dynamic dataset under 1-shot setting. The code is available at https://github.com/xmed-lab/CardiacCLIP.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0034_paper.pdf

SharedIt Link: https://rdcu.be/eHwTb

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04971-1_5

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/xmed-lab/CardiacCLIP

Link to the Dataset(s)

N/A

BibTex

@InProceedings{DuYao_CardiacCLIP_MICCAI2025,
        author = { Du, Yao AND Guo, Jiarong AND Li, Xiaomeng},
        title = { { CardiacCLIP: Video-based CLIP Adaptation for LVEF Prediction in a Few-shot Manner } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {46 -- 56}
}

Reviews

Review #1

Please describe the contribution of the paper

The authors introduce an adapted clip model specialized for LVEF prediction based on LVEF-relevant clinical text and echocardiograms. They introduce an attention-based aggregation scheme to selectively fuse frame representations from the CLIP image encoder (Multi-frame learning). In addition, they fuse features generated by inputting the echonet frames at multiple scales. They evaluate their method in the few-shot setting using the EchoNet dataset. They compare with traditional methods that do not incorporate textual information and other CLIP-based methods that do not contain the multi-frame learning mechanism or multi-resolution feature fusion.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors present a reasonable and apparently effective method for adapting CLIP to the video setting by aggregating frames via the attention mechanism in ref [14]. They also introduce a ““multi-resolution scaling strategy””.

The evaluations effectively demonstrate that their method works well compared to SOTA methods in the low-data regime (where low-data means few samples per EF bin). They also perform a thorough ablation study, demonstrating the importance of each component in their model.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

From my perspective, the authors do not adequately justify why they frame EF prediction as an ordinal regression problem via classification and refinement. This choice feels as significant or more significant than many of the variants in their ablation study, but they do not include it in their tests. In connection to this choice, they frame their work as ““few shot”” (even in the title), where the sense of ““few shot”” seems to be that their aren’t many samples per ejection fraction bin. But since these bins are artificially introduced in order to convert the problem into an ordinal regression, it seems that ““few shot”” is also artificial. The use of this term might mislead many to think that only a few total clips are used to condition a model for effective EF prediction, which is not true (the smallest variant has 84 clips). Better to frame this as the ““low data regime”” not ““few shot””.

While the adaptation of CLIP to video settings is constructive, the current study only considers the use of this architecture on the EchoNet dataset. This limits the degree to which more general conclusions can be made. Ultrasound video is used across many anatomies and tasks and ““mileage may vary””. Compared to many US datasets, EchoNet is relatively clean with consistent views and relatively low artifacts. It is good that the authors explore the low-data regime, but it would be a much stronger work if they explored a broader range of medical video tasks in order to understand the general utility of the approach in more realistic settings.

A more minor point, but the authors discuss the use of ““GPT-4 to generate diverse descriptions corresponding to LVEF intervals, enhancing data efficiency and serving as a form of text data augmentation during training””. How significantly does this impact the performance of the resulting model? This choice might be highly significant, but we can’t tell because it is not included in the ablation study.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

“I couldn’t make sense of the clause ““where text embeddings serve as classifier weights”” in section 2.1. What does this mean? Perhaps the meaning should be clarified in the text.

On page 4, the text says, ““Loss 1 should be updated with Loss 5””. I think these are references to equation numbers, but I think it’s really unclear to follow what is meant by this sentence. “
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the application domain is narrow and there are weaknesses in the problem framing and methodological questions, the overall idea of adapting clip for video predictions is constructive and likely an interesting model for others in the community who would like to perform multimodal inference on video and text data.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

The main contributions of the paper is the combination of the popular CLIP framework with an attention based frame aggregator module inspired by multiple instance learning (MIL) to process echocardiogram videos for ejection fraction regression. The authors proposed a multi-scale encoder of the input frames to combine global and regional features of each frame. The authors demonstrated improved performance of the proposed model against other SOTA CLIP and classical methods in the few shot setting for ejection fraction regression on an open source dataset. The authors conducted a thorough ablation study of the various proposed components of the model.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper was well presented and clearly explained the proposed method. The evaluation of the proposed methods was thorough and compared against a number of SOTA methods on a publicly available dataset. A thorough ablation study of the proposed components was undertaken, and the authors investigated the impact of frame length which is often omitted in the literature but is important practical consideration for real world ultrasound applications. Whilst the individual components of the proposed method are not particularly novel- i.e the application of the popular CLIP framework on echo and the use of MIL to aggregate temporal sequences, existing works to apply CLIP to echo have only used random image frames from a video. This work utilised ideas from MIL to leverage full video sequences, which feels like a logical step to improve video processing with CLIP based models.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Whilst the EchoZoom component appeared empirically to improve the performance of the model, I felt this component did not have as clear a motivation behind it as the rest of the paper.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

As the MIL aggregator in theory provides a measure of importance for each frame, it would have been interesting to see if a greater importance had been assigned to end diastole and end systole frames which are used to determine EF in clinical practice. This is by no means a necessity, but would have been a nice addition to the paper if this was commonly the case.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I thought the paper was well presented and the evaluation of the proposed methods was strong. The proposed method was a logical extension to the original EchoClip paper which used only random image frames.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper
- The paper introduced CardiacCLIP, an novel adaptation of the CLIP framework for left ventricle ejection fraction (LVEF) prediction.
- The authors integrated two key components within the CLIP framework: Multi Frame Learning (MFL), which employs a self-attention mechanism to selectively aggregate informative frames across time; and EchoZoom, a multi-scale feature extraction strategy that enhances the model’s ability to capture fine-grained spatial representations of cardiac structures.
- The paper demonstrated strong performance improvement of the proposed framework on LVEF prediction under a few-shot setting on a public benchmark.
- The broad ablation study indicated the effectiveness of each proposed component and highlights the framework’s reliability for potential clinical application.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The proposed MFL module, which selectively aggregates temporal motion dynamics from informative frames, presents a clear novelty over EchoCLIP’s random frame selection strategy. Its contribution is further justified through the corresponding ablation study.
- EchoZoom, an innovative multi-scale feature extraction strategy, appears highly effective in capturing even subtle anatomical information within the frames, and its contribution is also validated through ablation study.
- Adapting the two proposed modules within the CLIP framework led to superior performance compared to state-of-the-art traditional and CLIP-based methods for LVEF estimation, particularly under limited training data conditions.
- The paper stands out by presenting a broad and clear ablation study that justifies the model’s components and other design decisions, effectively proving the superiority of the proposed CardiacCLIP framework within the limited space.
- The paper was very easy to follow, and the model components and underlying computations are well supported by the presented diagrams and mathematical notations.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Not many weaknesses were found in the paper, but here are a few to mention:
- Although MAR and RMSE are well-known metrics, it would be helpful to state the full form of these abbreviations at least once in the text.
- Including the units (px, mm, or cm?) for the presented numerical result values would provide a clearer picture of the model’s accuracy.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presented a novel adaptation of a CLIP-based approach for LVEF estimation with limited data. The novelty of the model was clearly stated, and the actual contribution was effectively distinguished from existing works. The proposed model was well-supported by an overall diagram, and the computations were easy to follow due to the clearly presented equations and notations. Performance evaluations against state-of-the-art methods were thoroughly presented, and the justification for adding different components was well-supported by an ablation study. Therefore, this paper appears to be a very good contribution for readers, and CardiacCLIP could be highly useful for LVEF prediction in clinical settings.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We sincerely thank all reviewers for their constructive feedback and thoughtful comments. We are encouraged by the overall positive recommendations (two Accepts and one Weak Accept), and we appreciate that all reviewers acknowledged the methodological contributions of CardiacCLIP, its clinical potential, the clarity and organization of the paper, and the comprehensive evaluation under a low-data regime. Below we address the main concerns raised: R1.1 — Justification for classification + regression and few-shot setting: LVEF prediction is inherently a regression task involving continuous percentage values. However, the original CLIP model is optimized for classification through image-text representation matching. Directly using numerical values as text prompts poses challenges due to their rarity in CLIP pretraining, the infinite numeric range, and CLIP’s insensitivity to fine-grained numeric differences. To address this, we reformulate the task as a coarse-to-fine prediction, using classification as an auxiliary stage to guide the regression. This classification then regression paradigm has been adopted in many deep regression tasks like depth estimation [1,3], counting [2], and CLIP-based estimation [3,4]. Regarding the few-shot setting, our goal is to simulate real-world scenarios where annotated LVEF data is limited. We discretize continuous EF values into integer levels and sample up to k clips per level to ensure diverse EF coverage and reduce imbalance. After sampling, we group the clips into coarse EF bins (e.g., <30%, 45–54%) and associate each with descriptive text prompt for classification, followed by regression refinement. References: [1] Deep ordinal regression network for monocular depth estimation, CVPR 2018. [2] Uniformity in heterogeneity: Diving deep into count interval partition for crowd counting, ICCV 2021. [3] Can language understand depth? ACM MM 2022. [4] Teach clip to develop a number sense for ordinal regression, ECCV 2024. R1.2 — Dataset limitation: We acknowledge the reviewer’s concern about the dataset scope. Extending our framework to broader anatomical regions, datasets and tasks is an important direction for future work, which we are actively pursuing. R2 — Motivation for EchoZoom: EchoZoom is motivated by the strong correlation between cardiac pathology and region-specific structural variation. By applying multi-resolution input scaling, EchoZoom enhances representations of critical cardiac structures, enabling the model to highlight clinically relevant regions while preserving spatial fidelity. This design reflects how cardiologists analyze both global and regional patterns to examine subtle structural abnormalities. R3 — Metric and unit clarification: We thank the reviewer for the helpful suggestions. We will define MAE and RMSE in full upon first use in the revised paper and ensure all relevant units are clearly stated for clarity.

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

CardiacCLIP: Video-based CLIP Adaptation for LVEF Prediction in a Few-shot Manner

Author(s):