Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

The identification of cardiac phase is an essential step for analysis and diagnosis of cardiac function. Automatic methods, especially data-driven methods for cardiac phase detection, typically require extensive annotations, which is time-consuming and labour-intensive. In this paper, we present an unsupervised framework for end-diastole (ED) and end-systole (ES) detection through self-supervised learning of latent cardiac motion trajectories from 4-chamber-view echocardiography videos. Our method eliminates the need for manual annotations—including ED ES indices, segmentation, or volumetric measurements—by training a reconstruction model to encode interpretable spatiotemporal motion patterns. Evaluated on the EchoNet-Dynamic benchmark, the approach achieves mean absolute error (MAE) of 3.0 frames (58.3 ms) for ED and 2.0 frames (38.8 ms) for ES detection, matching state-of-the-art supervised methods. Extended to fetal echocardiography, the model demonstrates robust performance with MAE 1.5 frames (20.7ms) for ED and 1.7 frames (25.3ms) for ES, despite the fact that the fetal heart model is built using non-standardized heart views due to fetal heart positioning variability. Our results demonstrate the potential of the proposed latent motion trajectory strategy for cardiac phase detection in adult and fetal echocardiography. This work advances unsupervised cardiac motion analysis, offering a scalable solution for clinical populations lacking annotated data. Code is released at https://github.com/YingyuYyy/CardiacPhase.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4211_paper.pdf

SharedIt Link: https://rdcu.be/eHxcl

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05185-1_31

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/YingyuYyy/CardiacPhase

Link to the Dataset(s)

EchoNet-Dynamic: https://echonet.github.io/dynamic/

BibTex

@InProceedings{YanYin_Latent_MICCAI2025,
        author = { Yang, Yingyu AND Yang, Qianye AND Cui, Kangning AND Peng, Can AND D’Alberti, Elena AND Hernandez-Cruz, Netzahualcoyotl AND Patey, Olga AND Papageorghiou, Aris T. AND Noble, J. Alison},
        title = { { Latent Motion Profiling for Annotation-free Cardiac Phase Detection in Adult and Fetal Echocardiography Videos } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15973},
        month = {September},
        page = {316 -- 325}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper under review proposed a novel method for ED ES detection in ultrasound recordings via self-supervised learning with no need for annotation of the cardiac phases. A structure-motion decomposition approach is combined with a video reconstruction approach, allowing the learning of both the structural components and well as the motion components. The approach is then tested on a large dataset of adult recordings (publicly available) and shows promising results. The method also shows promising results in the case of the fetal heart which is evaluated on an inhouse dataset.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Novel approach that disentangles motion and structure in ultrasound sequences of the heart
- The myocardial motion is decomposed into lateral and septal movements which are anatomically relevant -Comparison to other state-of-the-art methods and competitive results with self-supervised learning
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- limited to only 4 chamber recordings, although extension to other views is possible
- processing of the fetal data includes manual alignment and extraction of the 4ch view. The frame rate and imaging parameters of the fetal data should be given. Informed consent / ethical considerations should be added.
- only general results are given, how well the method is performing in the presence of pathology is not known (especially interesting are pathologies that affect the motion of the myocardium like for example post systolic contraction)
- can the authors comment on the inference speed of the method?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The method is novel and relies on self-supervision to learn timing events in the cardiac cycle. It also shows generalization possibilities since it performs reasonably both on human and well as fetal data. For the proposed task there are large datasets with known ground truth available already, so the method will not have a big impact however motion artifact detection, and pathology-based clustering might be far more promising.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This paper introduces an autoencoder to obtain a latent motion profile to detect cardiac phases in echocardiography videos, inspired by the work in references [15] and [17]. The autoencoder provides a spatially and temporally smoothed video and a structure-motion decomposition. Only two motion parameters per frame are needed to detect the cardiac phase.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Echocardiography video structure-motion decomposition is original.
- The temporal coefficients of the decomposition allow to interpret the cardiac motion and to detect the cardiac phase.
- The decomposition is learned unsupervised without annotation of the cardiac phases. The training criterion is set to the reconstruction.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The presentation of the autoencoder is too concise. The architecture of the convolutional networks and the network extracting the motion parameters should be given, including the depth of the networks, the size of the feature maps, and the number of total autoencoder parameters.
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
The subsequent points must be taken under consideration:
- It is essential to elucidate whether the extraction of the motion components is executed subsequent to the subtraction of the mean frame in Figure 1. This is imperative as the static structural component is appended to them during the reconstruction process.
- It is necessary to determine whether, in the initial term of the loss function, the structure component should be similar to every frame or to the mean of them.
- What means the variation in the values of MAE (+-)?
- The frame rate must be considered in the evaluation, if it varies across the videos.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The novelty of the approach is the primary factor for recommending acceptance. Nonetheless, inquiries persist regarding the comprehension of the proposed methodology, which necessitates substantial enhancement in the presentation.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This study presents an innovative, annotation-free approach to unsupervised cardiac phase detection in apical 4-chamber echocardiography. It leverages a latent motion subspace where cardiac motion is encoded in two orthogonal directions that are physiologically interpretable. Evaluation is carried out in both adult and fetal echocardiography datasets.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The proposed phase detection method is innovative and intuitively motivated. It uses an autoencoder framework to decompose structure and motion in the images. It is observed that the latent motion subspace learns the movement of the septal and lateral ventricular walls, such that the latent motion trajectory follows a cyclic motion pattern whose extremities relate to phase changes (end diastolic and end systolic frames).

Evaluation is carried out in both adult and fetal echocardiography (EchoNet and an internal dataset, respectively). In the adult dataset, the method outperforms another unsupervised method and is comparable to supervised methods. Accuracy in the fetal dataset is comparable to performance on adult echocardiography.

The figures are illustrative of the concepts described in the text.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The detail of ED/ES detection from the latent motion trajectory is somewhat difficult to follow. It seems that the output is a group of ED/ES frame indices. Is a group of indices output because there are multiple cardiac cycles within each video clip? Is it guaranteed that one ED and one ES frame is identified per cardiac cycle? Is the method able to distinguish ED from ES, or is the method identifying only extrema?

As described in the introduction, the ED/ES frames are best defined by valve motion. However, the latent motion subspace is predominantly learning motion of the ventricular walls. This raises the question of whether the accuracy is negatively impacted by cardiac pathology, particularly poor ventricular function.

In addition to reporting error in terms of frame numbers and time, it would have been more compelling to discuss or present how these errors impact clinical measurements, like ejection fraction and longitudinal strain.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The unsupervised approach to learning cardiac motion for phase estimation is interesting. The written presentation and figures are high quality. Application to both adult and fetal echocardiography datasets is compelling.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We are grateful for the valuable feedback from all reviewers. Below are our responses to the questions and suggestions raised:

ED/ES Extraction Explanation (R1): Regarding ED/ES detection from the latent motion trajectory, the proposed method detects all ED/ES frame indices within a video clip. If multiple cardiac cycles are present, it outputs a group of indices—one ED and one ES frame per cycle. From our experiments on fetal data, where ground truth ED/ES annotations are available for each video, we successfully identified 95% of ED/ES frames across all full-length multi-cycle videos. The identification is based on extrema within the latent motion trajectory. Through interpretable motion disentanglement, we know the approximate latent regions where ED and ES occur. These extrema allow us to localize the exact ED/ES frames.

Further Clinical Evaluation (R1, R2): We agree it is important to evaluate the method’s robustness in the presence of cardiac pathology, especially conditions affecting ventricular wall motion, such as poor function or post-systolic contraction. Additionally, we acknowledge that understanding the downstream impact of ED/ES detection errors on clinical measurements like ejection fraction and longitudinal strain is clinically relevant. We plan to investigate both aspects in future work.

Inference Speed (R2): We appreciate the suggestion. The inference of the latent motion trajectory is efficient, taking approximately 0.2 ms per frame (with batch size 1) on an NVIDIA RTX 5000 GPU, as it involves only forward passes through a lightweight network. The extraction of ED/ES frame indices from the latent motion trajectory takes around 6 ms per video (average video length: ~184 frames) on a modern Dell workstation with an Intel Xeon processor. We will include the average inference time in the revised manuscript to provide a complete performance overview.

Architecture Explanation (R3): To preserve space in the paper, we will include the full architecture of the autoencoder in our public code repository. Briefly, the motion component consists of a 2-dimensional subspace coefficient and two basis vectors. These basis vectors are learned globally during training, while the 2D coefficients are predicted frame-wise by the MLP2 module shown in Figure 1. The final reconstruction is a combination of the static structural component and the motion component. Regarding the loss function, we penalize deviations of the structural component from each frame individually, rather than from their mean. This design follows the Fréchet mean formulation and helps avoid reconstruction blur caused by averaging. The notation “MAE (+/-)” was intended to reflect the standard deviation. However, we acknowledge that this format could be misleading (as mean absolute error cannot be negative). In the revised paper, we will report only the mean MAE for clarity. Lastly, since frame rates vary across videos in both datasets, we report both frame-based and time-based error metrics. The time-based error is consistent across datasets and provides a fair basis for comparison.

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

Latent Motion Profiling for Annotation-free Cardiac Phase Detection in Adult and Fetal Echocardiography Videos

Author(s):