Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Left ventricular segmentation and landmark detection from echocardiograms are routine practices in clinical settings for comprehensive cardiovascular disease evaluation. Recently, deep learning-based models have been developed to interpret echocardiograms. However, existing methods face challenges in handling sparse annotations, limiting their clinical applicability. Additionally, their robustness can be significantly influenced by temporal inconsistency (i.e., abrupt prediction fluctuations between consecutive frames) and inter-task conflict (i.e., detected landmarks deviating from segmentation boundaries). To address these issues, we propose a novel semi-supervised framework that integrates: 1) a knowledge distillation method for generating pseudo labels of the numerous unlabeled frames to improve the performance; 2) a Task-aware Spatial-Temporal Network (TSTNet) along with consistency constraints that enhances robustness by enforcing temporal consistency across frames, and inter-task consistency between segmentation and landmark detection. Experimental results on two datasets (a public dataset with 1,000 subjects and a private dataset with 1,950 subjects) show that our proposed framework significantly outperforms the previous approaches. The source code and dataset are publicly available at https://github.com/chenhy-97/TSTNet.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0938_paper.pdf

SharedIt Link: https://rdcu.be/eHwX8

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04984-1_4

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/chenhy-97/TSTNet

Link to the Dataset(s)

https://www.creatis.insa-lyon.fr/Challenge/camus/

BibTex

@InProceedings{CheHao_ASemiSupervised_MICCAI2025,
        author = { Chen, Haoyuan AND Li, Yonghao AND Yang, Long AND Wu, Han AND Zhou, Lin AND Sun, Kaicong AND Shen, Dinggang},
        title = { { A Semi-Supervised Knowledge Distillation Framework for Left Ventricle Segmentation and Landmark Detection in Echocardiograms } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15967},
        month = {September},
        page = {34 -- 43}
}

Reviews

Review #1

Please describe the contribution of the paper

This work presents a semi-supervised framework for automatic left ventricle segmentation and landmark detection in echocardiograms. The proposed framework employs a teacher-student architecture to generate pseudo-labels for unlabeled frames, which are subsequently used to train a task-aware spatial-temporal network. Multiple constraints are proposed to ensure temporal and inter-task consistency. The method was validated on two datasets, one public and one private, and demonstrated superior performance compared to previous approaches.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Multi-consistency constraints: The authors address the challenge of simultaneously predicting left ventricle segmentation maps and landmarks by introducing two loss terms that enforce both temporal consistency (across pseudo-labeled frames) and inter-task consistency. These constraints are sufficiently general to be applicable to other contexts involving similar tasks, enhancing the significance of the paper.

Convincing results: The results clearly demonstrate the proposed method’s superior performance over SOTA (though see Comment 12). Furthermore, the ablation studies provide compelling evidence of the effectiveness of semi-supervision and the multi-consistency constraints.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Unclear novelty: The novelty associated with the use of a knowledge distillation method for generating pseudo-labels and improving performance is potentially misleading. In the abstract, the authors suggest that their approach to pseudo-label generation via knowledge distillation is novel. Later, in the introduction, they claim to be the first to apply this strategy to “segmentation and landmark detection in echocardiograms”. Although I am unaware of any work simultaneously targeting these two tasks (and note that I’ve not made an exhaustive review of the literature), such strategy has been proposed elsewhere for segmentation in echocardiography, namely for left ventricle in DOI: 10.1109/ICASSP48485.2024.10446167 and 10.1016/j.eswa.2025.127084, and for mitral valve in 10.1007/s11517-024-03275-w. Other examples of related semi-supervised strategies, such as those based on dual-stage procedures (e.g., 10.1109/IUS54386.2022.9958670) or image registration (e.g., 10.1007/978-3-031-44521-7_19 and 10.1007/978-3-030-59725-2_45), also exist. These works have not been referenced in the manuscript. While the proposed method may be novel in the context of multi-task learning in echocardiography, the teacher-student approach for pseudo-labeling is well-established and remains fundamentally unchanged whether applied to single- or multi-task settings in echo (or other imaging modalities).

Incomplete methodological description: The description of the proposed TSTNet is insufficient, making it difficult—if not impossible—to fully reproduce the method based on the information provided. Details are missing regarding the “temporal merge”, “spatial weights”, “distance field”, or “task head” blocks seen in Fig. 2. If these components, or even the whole network, are based on prior work, relevant citations should be included and the specific contributions of the current paper clarified. If these elements are novel, the description should be expanded to allow understanding the methodology within the manuscript itself, even if the code is made available later.

Limited ablation studies: While the ablation studies convincingly demonstrate the benefits of semi-supervision and the multi-consistency constraints, they do not address the impact of key algorithmic decisions made at the network level (e.g., the multi-task aggregation module, spatial vs. temporal blocks, etc.). If the network architecture is not entirely novel (see paragraph above), this remark should be reframed accordingly (ignoring it if not novel, or considering only the novel aspects if based on a previous model).
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

1) Based on the aforementioned strengths and weaknesses, consider revising the description of your study’s novelty and contributions. Additionally, while the introduction mentions two contributions, the conclusion references three. Please ensure consistency.

2) In the abstract, you state that challenges in handling sparse annotations limit algorithms’ clinical applicability. Could you elaborate? While sparse annotations may affect accuracy due to limited supervised training data or difficulties in generalizing to non-ED and non-ES frames, they do not necessarily limit clinical applicability. Please comment.

3) Fig. 1 appears to suggest the use of 5 frames as input, while Fig. 2 points to “k” frames (as does the corresponding section of the main text). The implementation details section does not specify the value of k. Please explicitly indicate its value in the text and consider modifying Fig. 1 to replace “5” with “k”.

4) Fig. 1 also implies that “5”/”k” represents the sequence length from ED to ES, as the first and last frames contain GT annotations. This would suggest that k is variable. If so, please explicitly state this in the text and clarify what exactly is the input given to the network. Does it process the whole sequence at once? How was the variable length between ED and ES dealt with?

5) In the first subsection of section 2.1, what are the units of the Gaussian kernel? Pixels or mm? Please specify.

6) Your proposed teacher-student architecture appears to be based on the Mean Teacher method. Please confirm, and if so, include the appropriate reference.

7) In section 2.2, the sequence notation “{I_{i-k}, …, I_k}” contains k+1 frames rather than k frames, as indicated. Please correct this discrepancy.

8) Given that S and L represent the ground truth annotations, shouldn’t Equation (3) instead use the corresponding predictions (either the predicted probability maps or their thresholded versions)? Please verify and correct if necessary.

9) Ensure that all terms and variables used in the formulas are explicitly defined in the text.

10) Be mindful of variable reuse with different meanings across equations. For example, N is initially defined as the number of pixels but is later used in Equation (4) to denote the number of frames. Please revise for consistency.

11) Section 3.1: CAMUS consists of 500 patients (not 1000), covering the systolic portion of the cardiac cycle (although not necessarily half of it). Please correct this and similarly verify the details provided for your private dataset.

12) Regarding your private dataset, please include further details about the included population (exam indication or pathology, gender proportion, etc.), image acquisition (including machine and probe used, image resolution, temporal resolution, etc.), and annotation procedure (by whom and how).

13) From the description, it seems you used custom data splits (and not the official one from CAMUS). If so, did you retrain or apply SOTA methods using your splits? How were their performance metrics obtained? The values in Table 1 differ from those reported in the original publications. Please clarify.

14) What exactly is meant by “high-confidence contrast” augmentation? Please improve the description or provide a reference where this technique is detailed.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the approach demonstrates strong results, concerns regarding its technical novelty and methodological clarity require further clarification. The primary issue lies in the framing of novelty, particularly with respect to the pseudo-labeling strategy and multi-task network design. As noted above, similar knowledge distillation-based pseudo-labeling approaches have been previously explored in echocardiographic segmentation tasks, albeit not necessarily in a multi-task setting. The manuscript does not sufficiently differentiate its contributions from these prior works. The novelty associated with the network design is also unclear. Another concern is the scope of the ablation studies. While the authors provide compelling evidence of the benefits of semi-supervision and multi-consistency constraints, they do not assess the impact of the architectural choices made at the network level. If these architectural components are not original contributions of the authors, the level of technical novelty is questionable. Conversely, if they are, the current ablation studies do not provide sufficient justification for the design decisions made. If the authors’ rebuttal effectively clarifies the novelty of their approach and substantiates its added value, I would reconsider my rating.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have provided helpful clarifications regarding the novelty of their approach, particularly in articulating how their framework differentiates from prior work on semi-supervision in echocardiography by simultaneously integrating spatial, temporal, and multi-task learning. While the manuscript would still benefit from additional ablation studies to isolate the contributions of the spatial-temporal components (and other aspects of the model itself), the rebuttal presents a clearer description of the architecture and a reasonable explanation for the absence of such studies due to space limitations. Other minor concerns (related to the equations, units, dataset description, etc.) were not directly addressed but are expected to be resolved in the final version. Given the clarified novelty, strong results (that I’ve praised before), and considering the notes from other reviewers, I am updating my recommendation to Accept.

Review #2

Please describe the contribution of the paper

The authors propose a semi-supervised framework that integrates knowledge distillation with a task-aware spatiotemporal network (TSTNet), reinforced by temporal and inter-task consistency constraints to enhance robustness across frames and between segmentation and landmark detection tasks. Results are shown on the two echo datasets.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. A semi-supervised framework that integrates knowledge distillation with a task-aware spatiotemporal network (TSTNet) is interesting and effective.
2. The coupling of spatial, temporal, and multi-task aggregation in TSTNet is interesting to incorporate the spatio-temporal information.
3. Different constraints are reinforced by temporal and inter-task consistency constraints, which are essential for improving cross-frame coherency and effective in the estimation of ejection fraction.
4. Cross-dataset performance.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Details of spatial self-attention in the spatial block, temporal self-attention in the temporal block, and spatial weights and cross-attention in the MTA block are missing.
2. Detailed ablation of each consideration and component could highlight the effectiveness and rationale of the authors’ contributions.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Details of spatial self-attention in the spatial block, temporal self-attention in the temporal block, and spatial weights and cross-attention in the MTA block are missing. These are important innovations here in TSTNet. In Fig. 2(d), is it weights or wights?

Again, their ablations in the result section could show the effectiveness of each component in forming a better TSTNet.

In the skip connection path, the signal comes from the encoder’s pooling output. Why is there average pooling again? Doesn’t it downsample the image two times? How do the authors add this to the decoder path if this is the case? Confusing design, or does it need a clear presentation?

How do the authors select those hyperparameters in the estimation of total loss? The combinations of many losses may cause local minima trapping during the training; any rationale for not trapping?

In Table 2, the performances of semi and semi+cons seem similar. If not, are they statistically significant, and what is the p-value? The proposed one performs better than the supervised one (Table 2). What are the computational overheads in both versions?

Cross-domain (echo vs. MRI) and cross-dimension (2D vs. 3D) could be interesting to demonstrate scalability and genericity.

Fig. 3 shows the results of ED and ES frames, which are supervised frames. How is it in the other intermediate frames? Was it coherent over the frames?
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The paper proposes a semi-supervised framework combining knowledge distillation and a custom-designed Task-aware Spatial-Temporal Network (TSTNet) for improving LV segmentation and landmark detection in echocardiograms. The method addresses two key challenges. 1) sparse annotations and temporal inconsistency and 2) inter-task conflicts between segmentation and landmark detection.They employ a teacher-student network using pseudo-labels, and introduce multi-consistency constraints.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper addresses a real clinical problem: the lack of frame-wise annotations in echocardiogram videos and the highly relevant temporal inconsistency.
2. As authors claimed, the first work to use knowledge distillation-based pseudo labeling for joint segmentation and landmark detection.
3. The design of TSTNet, which incorporates spatial-temporal attention mechanisms and multi-task aggregation, is well-motivated and relevant for echocardiogram video analysis.
4. The method shows significant quantitative and qualitative improvements over state-of-the-art segmentation and landmark detection baselines.
5. Detailed quantitative and qualitative comparisons.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. While the authors mention the number of patients (1,000 in CAMUS and 1,950 in the private dataset), it is not clearly stated how many frames in total were used. Since the paper relies heavily on semi-supervised learning with pseudo labels, this information is essential to assess the scale of unlabelled data used. Please clarify how many frames per scan and how many labelled frames per scan (ED and ES)?
2. Since pseudo-labels are generated by the teacher, did authors observe any accumulation of pseudo-label noise over training? How did authors control the quality of pseudo-labels?
3. Could authors clarify the reasoning behind selecting two spatial and two temporal blocks in the TSTNet architecture? I assume computational complexity might have influenced this choice, but it would be helpful if you could elaborate on how this decision balances model complexity, computational cost, and performance. Additionally, have authors experimented with increasing the number of blocks, and if so, how does it impact both performance and inference/training efficiency?
4. Since this is semi supervised approach, could authors comment on impact of different amounts of labelled data? E.g., how does performance changes if only 10% of the keyframes are labelled?
5. Are there failure cases where the temporal or inter-task consistency constraints introduce artefacts or force unrealistic anatomy?
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The authors present a novel semi-supervised framework based on knowledge distillation to address key challenges in echocardiogram segmentation and landmark detection. The proposed methodology is technically sound and demonstrates clear improvements over both supervised and existing semi-supervised approaches. However, the manuscript lacks certain ablation studies and analysis that could further improve methodological clarity and interpretability. That said, considering potential page limitations and the scope of the supplementary material, this is understandable. The questions and suggestions I have raised can be appropriately addressed in the camera-ready version, as they do not significantly affect the core contributions of the manuscript.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

I read the rebuttal and authors have addressed my concerns. I went through the other reviewers comments and concerns as well. I belive rebuttal also addressed those. I am satisfied with the proposed method and hence I am going with acceptance.

Author Feedback

We appreciate the reviewers for their constructive comments. We are delighted with the positive feedback including the novelty of our framework (R3,R4), superior performance (R1,R3,R4), and valid clinical application (R3,R4). Response to major concerns is presented below. Q1. Clarifications of novelty (R1)
As R1 has mentioned there have been many works for semi-supervised learning in echocardiography. However, we would like to highlight the major novelty of our method as a semi-supervised framework that integrates the learning of spatial, temporal, and multi-task through knowledge distillation, while most existing approaches primarily focus on either spatial or temporal learning (as R1 provided). Our framework explicitly models the interactions among multiple related tasks across three different perspectives. The results shown in Table 1 also demonstrate the superior performance of our proposed methods. Ablation studies in Table 2 also highlight the effectiveness of our complementary learning over three different perspectives with incremental improvements. Q2. Details of the TSTNet (R1,R3)
To facilitate a better understanding of our originally designed TSTNet’s architecture, we provide additional descriptions of the following components: 1) Spatial self-attention: Tokenizes spatial patches within frames for intra-frame interaction via self-attention. 2) Temporal merge: Aggregates multi-time-step features using 1×1 convolutions. 3) Temporal self-attention: Tokenizes temporal patches across frames to model inter-frame dependencies. 4) Spatial weight: The attention map from spatial self-attention reweights decoder features via element-wise multiplication and is fused with decoder outputs. 5) Distance field: A dense Euclidean distance map from each pixel to anatomical landmarks, with each channel representing one field. Q3 About the experiments (R1,R3,R4) Hyperparameter setting (R3) A grid-search was employed for the hyperparameters tuning on the CAMUS validation dataset. Model component (R1, R3, R4)We have validated the effectiveness of model components during the period of network design. Due to the length limit, we did not include the results in ablation study. However, our supervised experimental results in Table 2 have already demonstrated that our framework significantly outperforms existing leading supervised segmentation model (e.g., EchoEFNet shown in Table 3), with a substantial improvement. Statistical analysis (R3) A one-sided paired t-test was conducted and the p-value (0.032) is below the significance level of 0.05, statistically verifying the superiority of our loss. Q4 Clarifications of datasets (R1, R4) For all datasets, following clinical workflow, only the annotations of ED and ES are available for each sequence. Therefore, issues related to annotation ratio are not applicable. For the CAMUS dataset, we followed the dataset paper’s protocol. In addition, we performed multiple random splits to validate the robustness. Q5. Concerns on intermediate frames (R1,R3,R4) Clinical application(R1) Segmentation and landmark tracking on intermediate frames are critical for evaluating myocardial motion patterns. Errors on the intermediate frames (R3,R4) Due to the spatial-temporal module in TSTNet, the prediction can preserve temporal consistency and accuracy (teacher network prediction) across time. However, certain error may exist affected by low quality of the pseudo label and inconsistent prediction between the two tasks. Q6. Clarifications of k (R1) For TSTNet, k was set to 15. If a video contains fewer or more than 15 frames, we applied frame interpolation to make it to 15 frames. Q7. Others minor concerns (R1,R3,R4) We appreciate the reviewers’ considerate comments to help us improve the quality of our paper. The remaining concerns will be addressed in the final paper, including clarifications on the Gaussian kernel, improvements to figures and formulas, and the inclusion of proper citations, etc.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

After reading the rebuttal, I agree with the consistent acceptance recommendation raised by three reviewers.

back to top

A Semi-Supervised Knowledge Distillation Framework for Left Ventricle Segmentation and Landmark Detection in Echocardiograms

Author(s):