Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Automated segmentation of the left ventricular endocardium in echocardiography videos is a key research area in cardiology. It aims to provide accurate assessment of cardiac structure and function through Ejection Fraction (EF) estimation. Although existing studies have achieved good segmentation performance, their results do not perform well in EF estimation. In this paper, we propose a Hierarchical Spatio-temporal Segmentation Network (\ourmodel) for echocardiography video, aiming to improve EF estimation accuracy by synergizing local detail modeling with global dynamic perception. The network employs a hierarchical design, with low-level stages using convolutional networks to process single-frame images and preserve details, while high-level stages utilize the Mamba architecture to capture spatio-temporal relationships. The hierarchical design balances single-frame and multi-frame processing, avoiding issues such as local error accumulation when relying solely on single frames or neglecting details when using only multi-frame data. To overcome local spatio-temporal limitations, we propose the Spatio-temporal Cross Scan (STCS) module, which integrates long-range context through skip scanning across frames and positions. This approach helps mitigate EF calculation biases caused by ultrasound image noise and other factors. We achieved state-of-the-art results on three datasets. Our code is available at https://github.com/DF-W/HSS-Net.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2745_paper.pdf

SharedIt Link: https://rdcu.be/eHwPM

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04947-6_26

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/DF-W/HSS-Net

Link to the Dataset(s)

N/A

BibTex

@InProceedings{WanDon_Hierarchical_MICCAI2025,
        author = { Wang, Dongfang AND Yang, Jian AND Zhang, Yizhe AND Zhou, Tao},
        title = { { Hierarchical Spatio-temporal Segmentation Network for Ejection Fraction Estimation in Echocardiography Videos } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15962},
        month = {September},
        page = {268 -- 278}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes an encoder-decoder network for echocardiography video segmentation. Initially, convolutional networks are employed to process the spatial content of the data. Subsequently, the Mamba architecture is utilized to capture the inter-frame relationships. The objective of this study is to enhance the accuracy of ejection fraction estimation through the integration of these advanced methodologies.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The integration of a combination of spatial convolutional networks and an Mamba architecture that captures temporal features is a novel approach.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The presentation lacks a comprehensive comparison with other related works. Specifically, the article by H. Wei et al., titled “Co-learning of appearance and shape for precise ejection fraction estimation from echocardiographic sequences,” published in Medical Image Analysis in February 2023, could be a valuable addition. In this study, the ejection fraction estimated exhibited a 91.4% correlation with the ground truth. The bias was determined to be -0.1, and the standard deviation was recorded as 4.3. Moreover, the article by N. Painchaud et al., entitled “Echocardiography Segmentation with Enforced Temporal Consistency,” published in the IEEE Transactions on Medical Imaging in October of 2022, is pertinent to the subject under discussion.
- The absolute value of the bias in estimating the ejection fraction is elevated in the reported implementation of various algorithms, suggesting the presence of systematic errors, potentially attributable to the implementation process.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- It would be preferable to express the Hausdorff distance in millimeters.
- It is essential to ascertain whether the absolute value of the bias or its positive value is given.
- It would be useful to calculate the mean absolute value of the estimation error.
- A clear explanation is necessary to elucidate the discrepancies between the results obtained from the implementation of the method described in reference [19] and those reported in the original article.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The approach is novel, but questions exist regarding the authors’ assertion that they have obtained the best results on the ejection fraction estimation, and important related work has not been referenced.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

My comments on the authors’ feedback: [Comparison] In the article by Wei et al. the EF estimation result on the CAMUS Test dataset (https://www.creatis.insa-lyon.fr/Challenge/camus/databasesTesting.html) is also given with a good root mean square error metric. It would be better to use this dataset as a benchmark for fair comparisons. Its size is also 10% of the total dataset. In any case, if the paper is accepted, the comparison with the above mentioned article should include bias and standard deviation, not just correlation, as the evaluation metric. Fairness issues may arise if the comparison depends on the splitting strategy and the implementation is not independent. Continuing the discussion on the comparisons with other methods, it should be noted that in the article [13] in the references a root mean square error equal to 4.98 is reported with the same splitting strategy, while in the present article the evaluation result is 8.68.

[EF estimation bias] Feedback from authors clearly indicates that bias is reported as an absolute value. It would be useful to mention this explicitly in the manuscript. The authors respond that “The reported absolute value of the bias appears relatively high, mainly due to our different dataset splitting strategy. The distribution of test samples directly affects this metric.” This raises a question of robustness. As a related case, it is observed that for the H2Former method and the CAMUS dataset, the absolute value of the bias in the present article is 5.79, while in the article [5] in the references a value of 0.69 is reported. Finally, there is the question of the accuracy of the reported results, since the absolute value of the bias should always be smaller than that of the root mean square error. In Tables 1 and 2, this was violated in three cases. [Hausdorff distance] I agree with the authors that the HD95 for the EchoNet-Dynamic and EchoNet-Pediatric datasets may not be given in mm because the pixel size is not directly known. However, in reference [19] the results on HD are in mm. For the CAMUS dataset, the pixel spacing on the original images is [0.308mm, 0.154mm] as specified in the metadata, with a pixel ratio of 2:1. It may be inaccurate to measure the Hausdorff distance in pixels with a non-square ratio. In addition, the real-world distance allows comparisons with other published results. [Discrepancy in Ref. [19]] Claiming that “the discrepancy is caused by differing dataset splits” can be accepted. However, regardless of the splitting strategy and the specific split, the best trained network for each method should be tested on the same unknown dataset to get a fair comparison.

In conclusion, the proposed algortithm is novel, but questions remain regarding the evaluation results. It is difficult to accept the paper in its current state.

Review #2

Please describe the contribution of the paper

This paper introduces a Hierarchical Spatio-temporal Segmentation Network (HSS-Net) for echocardiography video segmentation, aiming to improve the accuracy of ejection fraction (EF) estimation. The proposed framework employs low-level stages leverage convolutional networks for fine-grained single-frame feature extraction and high-level stages integrate the Mamba architecture to model spatio-temporal dependencies across frames. The global dynamic modeling improves robustness against interference, mitigating segmentation boundary jitter and ensuring stable EF estimation. Experimental results on three datasets show the effectiveness of the proposed model.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper introduces a Hierarchical Spatio-temporal Segmentation Network (HSS-Net) for echocardiography video segmentation, aiming to improve the accuracy of ejection fraction (EF) estimation. The proposed framework employs low-level stages leverage convolutional networks for fine-grained single-frame feature extraction and high-level stages integrate the Mamba architecture to model spatio-temporal dependencies across frames. The global dynamic modeling improves robustness against interference, mitigating segmentation boundary jitter and ensuring stable EF estimation. Experimental results on three datasets show the effectiveness of the proposed model.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. How to rearrange spatial sequence positions into diagonal and anti-diagonal patterns?
2. Please evaluate the value of the proposed model in real clinical application.
3. Why use a combination of Dice loss and binary cross-entropy loss for segmentation?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
1. How to rearrange spatial sequence positions into diagonal and anti-diagonal patterns?
2. Please evaluate the value of the proposed model in real clinical application.
3. Why use a combination of Dice loss and binary cross-entropy loss for segmentation?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Contributions: This paper introduces a Hierarchical Spatio-temporal Segmentation Network (HSS-Net) for echocardiography video segmentation, aiming to improve the accuracy of ejection fraction (EF) estimation. The proposed framework employs low-level stages leverage convolutional networks for fine-grained single-frame feature extraction and high-level stages integrate the Mamba architecture to model spatio-temporal dependencies across frames. The global dynamic modeling improves robustness against interference, mitigating segmentation boundary jitter and ensuring stable EF estimation. Experimental results on three datasets show the effectiveness of the proposed model.

Weaknesses:
1. How to rearrange spatial sequence positions into diagonal and anti-diagonal patterns?
2. Please evaluate the value of the proposed model in real clinical application.
3. Why use a combination of Dice loss and binary cross-entropy loss for segmentation?
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The authors proposed a temporal-Mumba architecture for injecting temporal awareness in their 2D echo segmentation. There is a patch-based temporal cross scan module that looked interesting.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- There is architecture innovation, using an interesting spatio-temporal cross-scan module within the Mumba framework.
- the algorithm seems to be fast compared to alternative and yet performance is enhanced.
- the algorithm is tested across various echo databases, demonstrating generalizability
- there is a good amount of ablation study to justify the design.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. There are other temporal attention models out there, and generally results show that inputting temporal information improves results. I wonder how temporal attention compare to the current temporal-Mumba method?
2. the algorithm requires the input of all time frames at the same time, there may not be a need for that. A further step to improve this could be to test for tolerance for temporal sparseness, and perhaps you only need a few time frames to achieve good results. This may be important because, at present, the method is not easy to extend to 3D echo images, and we all know that Ejection Fraction quantification in 2D is not accurate. If all time frames are required for a 3D echo network, there wont be enough memory.
3. the pitch for the proposed method does not need to be limited to ejection fraction (see title). segmentation can be used for quantifying wall thickness, LVIDD (diameter), LV length, shortening, etc, all of which has a lot of clinical uses.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

I don’t see a link for the codes, and the network sizes and hyperparameter are not shared, so the work is not yet fully reproducible.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

There is architecture innovation, testing is sufficient, results are good. I think this is a good piece of work.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Innovative and good results.

Author Feedback

We thank the area chairs and reviewers. The code will be released to provide more implementation details.

To R1: [Comparison] We will cite these two related works and provide analysis and comparison in the final version. The work [Wei et al.] achieves a 91.4% correlation with the ground truth for EF estimation, as it uses 10-fold cross-validation with a 90/10 training/validation split. Different from this setting, following prior works [21, 23], we split the data into training, validation, and testing sets in an 8:1:1 ratio. Due to these different data splitting strategies, the results are not directly comparable. We have implemented this model’s code provided by the authors under our dataset split strategy, and our method achieves a 2.38% relative improvement in Pearson correlation for EF estimation. It is worth noting that our model achieves the best results compared to other methods under the same setting for a fair comparison.

[EF estimation bias] The reported absolute value of the bias appears relatively high, mainly due to our different dataset splitting strategy. The distribution of test samples directly affects this metric. For example, in Ref. [5], EF estimation bias on the CAMUS dataset ranges from 0.69 to 13.34, with over half of the compared models showing biases above 11, which are higher than our reported values.

[Hausdorff distance] Since the EchoNet-Dynamic and EchoNet-Pediatric datasets do not provide pixel spacing information, we cannot map image pixels to real-world distances. To ensure consistent units for the same metric throughout the paper, HD95 metrics are reported in pixels.

[Discrepancy in Ref. [19]] The discrepancy is attributed to the differences in dataset splits. Ref. [19] uses 5-fold cross-validation with an 80/20 training/validation split, while we split the data into training, validation, and testing sets in an 8:1:1 ratio, following prior works [21, 23]. All methods are evaluated using our dataset split, while the best model on the validation set is used for testing to ensure fair comparison.

To R2: [Spatial sequence rearrangement] We reshape the input into a multi-frame sequence and swap the spatial and temporal dimensions. Next, we build an index map with row-wise offsets across frames to align elements along diagonal or anti-diagonal directions. Finally, we perform index-based extraction to rearrange the sequence.

[Clinical application] Our model accurately segments the left ventricle, improving the precision of EF estimation. It also aids in quantifying key parameters such as wall thickness, ventricular diameter, and length. This facilitates clinical cardiac function assessment, enhances diagnosis and treatment planning, and supports disease progression monitoring.

[Combination of losses] This study requires precise segmentation to ensure accurate EF estimation. Dice Loss focuses on region overlap, addressing small targets and class imbalance. BCE Loss emphasizes pixel-wise classification, refining boundaries. Combining both optimizes global and local structures simultaneously.

To R3: [Temporal attention models] In the comparison experiments, we have included two temporal attention-based methods (PKEchoNet [17] and Vivim [22]). The results demonstrate that our model outperforms these two methods. In future work, we plan to conduct more comprehensive comparisons between our method and additional temporal attention methods.

[Tolerance for temporal sparseness] Following prior studies [5, 17], we uniformly sample 10 frames per heartbeat cycle to reduce memory load from full-frame input. In future work, we will evaluate the model’s tolerance to temporal sparsity as a means of improvement. We will also explore its extension to 3D ultrasound imaging.

[Clinical uses] Our method is not limited to EF estimation and holds broad clinical potential. We will explore its wider applications to further enhance its clinical utility.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The authors propose a Hierarchical Spatio-temporal Segmentation Network that integrates convolutional layers with the Mamba architecture for echocardiography video analysis, aiming to improve ejection fraction estimation. While Reviewer #1 raised concerns regarding bias metrics and fairness in comparisons, the authors provided reasonable clarifications. Reviewers #2 and #3 acknowledged the novelty, robustness, and potential clinical utility of the method. I recommend acceptance.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

Hierarchical Spatio-temporal Segmentation Network for Ejection Fraction Estimation in Echocardiography Videos

Author(s):