Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Echocardiography, a vital cardiac imaging modality, faces challenges due to limited annotated data, impeding the application of deep learning. This paper introduces EchoCardMAE, a customized masked video autoencoder framework designed to leverage unlabeled echocardiography data and enhance performance across diverse cardiac tasks. EchoCardMAE addresses key challenges in echocardiogram analysis through three innovations built upon masked video modeling (MVM): (1) Key Area Masking, which concentrates feature learning on the diagnostically relevant sector of the image; (2) Temporal-Invariant Alignment Loss, promoting feature consistency across different clips of the same echocardiogram; and (3) Reconstruction Denoising, improving robustness to speckle noise inherent in echocardiography. We comprehensively evaluated EchoCardMAE on three public datasets, demonstrating state-of-the-art results in ejection fraction (EF) estimation, coronary heart disease (CHD) prediction, and cardiac segmentation. For example, on the EchoNet-Dynamic dataset, EchoCardMAE achieved an EF estimation MAE of 3.78 and a left ventricular segmentation mDice of 92.96, surpassing existing methods. The code is available at \url{https://github.com/m1dsolo/EchoCardMAE}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2462_paper.pdf

SharedIt Link: https://rdcu.be/eHw8s

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05169-1_17

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/m1dsolo/EchoCardMAE

Link to the Dataset(s)

N/A

BibTex

@InProceedings{YanXua_EchoCardMAE_MICCAI2025,
        author = { Yang, Xuan AND Xu, Rui AND Ye, Xinchen AND Wang, Zhihui AND Zhang, Miao AND Wang, Yi AND Fan, Xin AND Wang, Hongkai AND Yue, Qingxiong AND He, Xiangjian AND Chen, Yen-Wei},
        title = { { EchoCardMAE: Video Masked Auto-Encoders Customized for Echocardiography } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15972},
        month = {September},
        page = {171 -- 180}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper proposes EchoCardMAE, a masked video autoencoder tailored for echocardiography. It introduces three key innovations—key area masking, temporal-invariant alignment loss, and reconstruction denoising—to address domain-specific challenges. EchoCardMAE achieves state-of-the-art performance on EF estimation, cardiac segmentation, and CHD prediction across multiple datasets.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1)The three introduced components—key area masking, alignment loss, and reconstruction denoising—are not complex, but are well-motivated and empirically effective in this medical context. 2)EchoCardMAE outperforms or matches prior work on EF estimation, cardiac segmentation, and CHD prediction, showing its multi-task potential.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

1) The temporal-invariant alignment loss is intended to enforce consistency between features extracted from two clips of the same echocardiography video. However, this rests on a flawed assumption: that all such clips should share similar semantic content. In reality, ejection fraction (EF) is defined by the difference between end-diastole (ED) and end-systole (ES). These phases are structurally and functionally distinct, and a clip that captures both ED and ES (or transitions between them) will have very different features from a clip that misses one or both. If one clip contains a full cardiac cycle (ED + ES) and the other captures only diastole or systole, their features should differ—and enforcing similarity via alignment loss may actively harm representation learning.This undermines the effectiveness of the alignment loss, especially for EF regression, where temporal variation is the signal, not noise. Without a mechanism to ensure that sampled clips are phase-consistent or cycle-complete, the alignment objective can become misleading or counterproductive.

2) It does not describe the loss functions used for CHD classification or segmentation, nor the architectures of the downstream heads (e.g., whether MLPs were added). This weakens clarity and reproducibility.

3) The paper omits any detail of the decoder architecture used for pretraining. Since it plays a role in reconstruction, this hurts reproducibility and makes it difficult to assess the contribution of the reconstruction path. also what loss functions for different down-stream tasks?

4) how the Key Area was chosen? what method was used?

5) The paper lacks visualizations or interpretability analysis (e.g., attention maps, phase alignment) that could validate whether the model is truly learning meaningful clinical features or simply relying on data priors.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I recommend this paper with moderate enthusiasm based on its practical, domain-aware contributions but also recognizing significant limitations in its temporal modeling assumptions and methodological clarity.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have satisfactorily addressed the core concerns raised in the initial review, particularly regarding the assumptions behind the temporal-invariant alignment loss.

Review #2

Please describe the contribution of the paper

This paper proposes a new video mask autoencoder for echocardiography. Key region masking, temporal-invariant align- ment and reconstruction denoising strategy are introduced to cope with the special challenges of echocardiography. The extensive experimental results demonstrate the superiority of this approach.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The extensive experiments validate the superiority of EchoCardMAE.
2. The overall writing is easy to follow.
3. The formulation and presentation are clear.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. What is the strategy of finding the region of central sector? Is this strategy accurate and robust for different videos? In addition, are the tokens of background region calculated in the encoder and decoder, given that “By focusing feature extraction on the diagnostically relevant sector” from the paper.
2. Are the results of previous methods in Table 1 from their original papers or retraining?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Please see weakness above.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The paper explores how MVM can be incorporated into building a sequence aware encoder while adding mechanisms for focusing on the region of interest and denoising. The methods are applied to unsupervised training of TTE sequences and evaluated on multiple downstream tasks.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Using the temporal information from US sequences for building powerful feature encoder is underexplored in current research as most of the networks operate on single images
- The idea of locally restricting the masking makes a lot of sense
- The method outperforms SOTA on multiple different tasks and two ablation studies are provided
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- There are some non-transparent aspects of some experiments, for example Table 3: It is not clear how the MemSAM and SAMUS were trained / finetuned in the same fashion. The numbers for memsam reported are the ones for “CAMUS semi” in the orginal paper - how is it ensured that there is no data-breach or unfair comparison?
- Some related work could be mentioned here ie EchoPrime claiming that they work on entire sequences or maybe even https://papers.miccai.org/miccai-2024/paper/2686_paper.pdf from last year which is also related even though not operating on sequences
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- There was no 10-fold split used in ref 12 nor it was on CAMUS. 12 also did not provide 5-fold splits, is it ref 11?
- How do you get the cone shape in the first place? I assume it is not that trivial with Echonet images being downsampled and noisy (so it is not simply a 0 threshold)
- Is the EchoCoTr in Table 2 also trained on Echonet and finetuned on CAMUS in the same fashion?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Even though there are some unclear aspects in the experiment section, I would still call it a rather solid paper, that could be interesting for the community.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Based on my comments and the answers of the authors, I will not change my decision and recommend the paper for publication.

Author Feedback

To ALL Revs.: Basic image processing is conducted to produce a unified central sector region for each video before key area masking. This is only required for building the foundation model, not for downstream tasks. Given that image size of the EchoNet-Dynamic dataset is 112x112, we spatially divide image grid into 14x14 blocks. Each block has a spatial size of 8x8. For each video, we calculate average intensity value for all pixels that are inside each block and along the whole time-axis. This gives a 14x14 map, whose pixel position stands for each block and whose pixel intensity is the averaged value. After thresholding this map, we iteratively erase 0-pixels from outside to the map center. Finally, the rest pixels indicate the central sector region for the video. We find such processing is robust.

To Rev. 1:

Temporal-invariant alignment loss is designed to weaken variance of clip selection. We hypothesis that two clips from the same video have more similar features than clips from different video, rather than that their features are exactly the same. This is why we use an InfoNCE-based loss, rather than a L2 loss. Besides, one clip (totally 16 frames) is sampled by randomly selecting a starting frame and then selecting the rest 15 frames at intervals of 4 frames. Thus, one clip can cover a time range of 64 frames. For the majority of cases in EchoNet-Dynamic dataset, this range can cover at least half of a cardiac cycle. In addition, the ablation study result (Table 5) shows the effectiveness of this loss. Fig.2 also shows a case that EF prediction becomes more steady when this loss takes effect.

Binary cross entropy is used for CHD classification, while a dice loss is for segmentation. The CHD classification head is a single MLP, while the segmentation head has a similar structure as the foundation model’s decoder that is a 4-layered 3-head self-attention transformer. More details can be found in our code that will be released if accepted.

To Rev. 2:

Fig.1 shows that background token together with mask token are concatenated with features extracted by the encoder and then fed into the decoder for calculation. Note that both background and mask tokens are learnable parameters.

The results of prior methods in Table 1 are taken from their original papers. For fair comparison, all methods used the official EchoNet-Dynamic dataset splitting.

To Rev. 3:

Results of MemSAM and SAMUS in Table 3 are derived from the MemSAM paper, where both methods are trained in the same fashion. The MemSAM paper reported two kinds of results, denoted as “CAMUS semi” and “CAMUS full”. The former means only using labels on ED and ES frames for training, while the latter means using labels of all frames for training. Since we can only download ED and ES labels, we use the available labels to train our method and compare it with “CAMUS semi” results.

We will add EchoPrime and EchoFM papers to the reference list in the revised paper.

Sorry for citing mistake. I will cite Ref 11 instead of Ref 12 in the revised paper.

Results of EchoCoTr and CoReEcho in Table 2 are derived from the CoReEcho paper. We follow this paper to pretrain on Echonet and finetune on CAMUS for our method.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The authors propose EchoCardMAE, a masked video autoencoder for echocardiography, featuring key area masking, temporal-invariant alignment loss, and reconstruction denoising. It shows strong results on EF estimation, cardiac segmentation, and CHD prediction. Strengths include well-motivated domain-specific design and competitive performance. Main concerns were addressed in the rebuttal, and the paper was well received overall. I recommend acceptance.

back to top

EchoCardMAE: Video Masked Auto-Encoders Customized for Echocardiography

Author(s):