Abstract

Left ventricular (LV) indicator measurements following clinical echocardiography guidelines are important for diagnosing cardiovascular disease. Although existing algorithms have explored automated LV quantification, they can struggle to capture generic visual representations due to the normally small training datasets. Therefore, it is necessary to introduce vision foundational models (VFM) with abundant knowledge. However, VFMs represented by the segment anything model (SAM) are usually suitable for segmentation but incapable of identifying key anatomical points, which are critical in LV indicator measurements. In this paper, we propose a novel framework named AutoSAME, combining the powerful visual understanding of SAM with segmentation and landmark localization tasks simultaneously. Consequently, the framework mimics the operation of cardiac sonographers, achieving LV indicator measurements consistent with clinical guidelines. We further present filtered cross-branch attention (FCBA) in AutoSAME, which leverages relatively comprehensive features in the segmentation to enhance the heatmap regression (HR) of key points from the frequency domain perspective, optimizing the visual representation learned by the latter. Moreover, we propose spatial-guided prompt alignment (SGPA) to automatically generate prompt embeddings guided by spatial properties of LV, thereby improving the accuracy of dense predictions by prior spatial knowledge. The extensive experiments on an echocardiography dataset demonstrate the efficiency of each design and the superiority of our AutoSAME in LV segmentation, landmark localization, and indicator measurements. The code will be available at https://github.com/QC-LIU-1997/AutoSAME.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2383_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/QC-LIU-1997/AutoSAME

Link to the Dataset(s)

CAMUS dataset: https://www.creatis.insa-lyon.fr/Challenge/camus/index.html

BibTex

@InProceedings{LiuTuo_Think_MICCAI2025,
        author = { Liu, Tuo and Yang, Qinghan and Zhang, Yu and Ge, Rongjun and Chen, Yang and Zhou, Guangquan},
        title = { { Think as Cardiac Sonographers: Marrying SAM with Left Ventricular Indicators Measurements According to Clinical Guidelines } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15969},
        month = {September},
        page = {574 -- 583}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors adapted the AutoSAMUS model to create 2D segmentation and landmark locations from 2D echocardiography, so that computations of ventricular volumes and dimensions can be quickly obtained after that according to idealized current clinical processes.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The idea of interfacing with the clinical side for something that can more directly produce clinically usable measurements is a good one, the performance are good. There is a good set of ablation and SOTA comparison studies.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. I don’t completely understand the prompts, are they only used for training or are they also used for inference? it seems that if point and box prompts are needed for inference, then the algorithm is no longer “auto” in its prediction of landmarks? and prompts are needed for inference, it does imply that the algorithm cant do very well on its own. I would seek clarification on this.

    2. the use of frequency domain for the cross attention FBCA is interesting, what is the basis for this? Is there evidence that this performs better than the regular image intensity cross-attention? having this ablation could have been better to justify the design.

    3. The CAMUS dataset has quite a few unsatisfactory ground truths, one weakness here is that the authors did not test it on other datasets to show generalizability.

    4. In essense the algorithm combines landmark segmentation and LV border segmentation. There are networks for both. It would have been better if the authors compared their landmark segmentation to algorithms specifically designed for those.

    5. In fact, there seems to be a need to direclty compare to AutoSAMUS, but this is missing. This algorithm is essentially AutoSAMUS but adding landmark segmentation. However, there should be a very direct relationship between border segmentation and landmark locations, which means that the output from AutoSAMUS can be used to infer the landmarks quite well, possible even through simple, non-deep learning processing . So how will that compare? If AutoSAMUS is enough, then maybe the current proposal will be redundant.

    Coupled to above question, I don’t see any comparison to assess the compatibility of the predicted landmark locations to the predicted segmentation borders. How does that look?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Links to the codes are not provided, the network sizes and hyperparameters are not given, so the work here is not yet fully reproducible.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    If the algorithm needs prompts for inference before it can work well, then it is not very impressive. I seek clarifications on whether this is the case.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Authors addressed my concern about prompts during inference, and about the ablation study versus AutoSAMUS. Although my comment about compatibility evaluations between prompts and segmentation remains unaddressed (I think this can be evaluated quantitatively, it can even be a constraint to improve results), I think the basic contribution of showing that including landmark prompts can improve segmentation is worthwhile here.



Review #2

  • Please describe the contribution of the paper

    This work introduces AutoSAME, a novel framework that leverages the Segment Anything Model (SAM) for the simultaneous segmentation of the left ventricle (LV) and detection of associated anatomical landmarks, enabling the computation of clinically relevant LV indices. Built upon the AutoSAMUS architecture, the framework incorporates a new heatmap regression (HR) branch for landmark localization and introduces a frequency-aware cross-branch attention mechanism (FCBA). Additionally, a spatial-guided prompt alignment (SGPA) strategy is proposed for the early training stages to inject prior spatial knowledge, thereby guiding the auto prompt generator (APG) to produce task-specific prompt embeddings. The method is evaluated on the CAMUS dataset, where it demonstrates improved accuracy over state-of-the-art (SOTA) approaches. Ablation studies highlight the contributions of both the FCBA and SGPA components.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Novelty: The authors propose a modification to SAM/AutoSAMUS, effectively aligning these models with the clinical requirements for left ventricular analysis in echocardiography. While a straightforward approach might have involved simply reproducing the HR branch from AutoSAMUS, the authors introduce two key modifications that further enhance performance, as demonstrated in the ablation studies.

    Comprehensive experiments: The authors are commended for conducting thorough experiments, which include both SOTA comparisons and ablation studies. Additionally, they assess their strategy from a clinical perspective, evaluating the impact on the accuracy of clinical measurements, thereby providing robust evidence of the method’s added value.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Unclear evaluation details: The baseline model is not clearly defined in the manuscript, creating ambiguity in the comparison. Additionally, it is unclear whether the reported results are based on cross-validation or a single fold, which could affect the robustness and generalizability of the findings.

    Poor English grammar and spelling: The manuscript contains several grammar mistakes and typographical errors. Some sentences also require rephrasing for clarity and readability.

    Lack of code sharing: Sharing the code would significantly benefit the research community, especially since the method is based on the publicly released AutoSAMUS codebase. This is however not mandatory.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    1) Please proofread the manuscript for English grammar and spelling. Several grammatical errors and typos are present throughout the document and should be corrected for the final version. In addition, some sentences—though grammatically correct—may benefit from rephrasing to improve clarity and readability. A non-exhaustive list of suggested edits follows: a. Abstract: “that followed” should be “that follow”; b. Page 2: rephrase “and the apex meanwhile, then”; c. Page 2: “deep assessment LV function” should be rephrased; d. Page 2: consider rephrasing the sentence containing “providing beneficial information (….) adaptively” for clarity; e. Page 3: clarity the meaning of “In contrast, HR tasks typically only focus (…)”?; f. Page 3: “LV quantitation” should be “LV quantification”; g. Page 5: “leaned” should be corrected to “learned”; h. Page 5: consider rephrasing “and anatomical point peripheral details”; i. Page 5: instead of “global and local patterns”, consider talking about low- and high-frequency patterns/information to better reflect the frequency-based design of FCBA; j. Page 7: “majority indicators”? 2) Is there any guarantee of inter-task consistency, i.e. whether predicted landmarks align with the boundaries of your predicted segmentation? If mismatches are observed across your results, consider discussing this limitation in the discussion and how it might be addressed in future work. 3) Consider modifying Fig. 2c to clearly indicate that both point and box prompts are only used during training. As currently presented, this detail may be unclear to readers until they reach the corresponding text. 4) In section 2.1 “Training and Inference”, the manuscript states that paired A2C and A4C views are used as input. However, this is not reflected in Fig. 2 or in the accompanying model description. Please clarify whether view pairing is indeed required and update the figure or text as needed. 5) Despite its simplicity, please comment on the use of a box-shaped mask given the radial symmetry observed in the frequency domain. Would a circular mask with an equivalent radius not be more appropriate? 6) Fig. 3b seems to imply that SGPA is used only in the HR branch, although both Fig. 2 and the accompanying text indicate it is also applied to the segmentation. Consider explicitly stating in the figure caption that SGPA is used in both branches. 7) Please clarify whether the reported results are based on cross-validation using the 8:1:1 data split, or if they correspond to a single fold. If the latter, this could potentially limit both the robustness and generalizability of the reported performance. 8) Please include a description of how the “percentage of correct key points” is computed or, at least, cite any work that provides such description/formulation. One would assume that a given threshold distance was assumed (or multiples), but no details are given. 9) Several implementation details are missing and should be included, such as the input image size, the frequency and range of augmentations used, and so on. Additionally, please explicitly state the units (pixels or mm) used for the Gaussian heatmap standard deviations. 10) In line with the previous comment, consider releasing your source code to support reproducibility. If code sharing is not feasible, ensure that sufficient methodological details are provided. For instance, while it can be inferred that the image encoder adapters, the FCBA implementation, or the input image size are based on the design details of AutoSAMUS and associated codebase, this is not explicitly stated in the manuscript. 11) Please discuss the sensitivity of the method to the choice of 10 epochs for prompt alignment. What would be the impact of extending the alignment phase or applying the alignment loss throughout the full training process? Consider including such analysis in your ablation studies or, at least, comment on it in the results/discussion sections. 12) Please clarify what constitutes the “Baseline” in Table 1. Does it correspond to AutoSAMUS extended with a heatmap regression branch using CBA? This seems to be the case based on the following analysis, but an explicit definition would be helpful. 13) Please include references for all methods reported in Table 2. For methods originally proposed or applied in different tasks (e.g., FM-based models), clarify whether the reported results come from pretrained/frozen models or if they were fine-tuned on CAMUS. Additionally, for completeness, consider including results for AutoSAMUS with the HR branch and CBA in this table.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the manuscript presents some issues related to reproducibility and concerns about the robustness of the evaluation (e.g., unclear definition of the baseline model and the potential use of a single fold instead of cross-validation), along with several instances of poor grammar and spelling, these concerns should be easy to address during the rebuttal process. The novelty of the submission, particularly the proposed modifications to SAM/AutoSAMUS to better align with clinical requirements for left ventricular analysis in echocardiography, is a valuable contribution. The inclusion of both SOTA comparison and ablation studies, along with clinical evaluations, strongly supports the added value of the proposed method. Given that the identified weaknesses should be manageable, and assuming that the authors provide satisfactory clarifications during the rebuttal process, I believe the manuscript merits acceptance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have adequately addressed the main concerns raised in my initial review, particularly those related to baseline definition (AutoSAMUS with HR using CBA), reproducibility (code sharing and clarification of other implementation settings), and evaluation clarity (10-fold validation). While a few minor questions were not directly addressed in their feedback—such as whether view pairing is needed (although this could be clarified later through a modified Fig. 2), the use of a box-shaped mask, and the lack of formal inter-task consistency guarantees (as manual inspection does not constitute a guarantee)—these do not significantly detract from the overall contribution. Assuming the final manuscript reflects the clarifications provided, I consider this a well-constructed and valuable submission, and I recommend acceptance.



Review #3

  • Please describe the contribution of the paper

    A SAM based network that improves on previous SAM Ultrasound methods by tailoring it to complete LV indicator measurements in echo. Specifically, they include contouring of LV and key anatomical landmarking simultaneously. Their cross branch attention mechanism uses segmentation features from the frequency domain to improve the landmark regression. They introduce spatial guided prompts to align embeddings with spatial priors.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Overall excellent work. the motivation for this work is strong and clearly defined - both on a clinical (complete LV measurements in line with clinical guidelines) and technical (foundation models necessary due to lack of data) level.

    FCBA and SGPA modules are interesting additions, which are well explained and have some technical novelty.

    Benchmarked against other multi-task methods for LV measurement [17] demonstrating clear improvements in performance

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    There is comparison to EchoEDNet [17], but there is also a lot of other existing methods for multi-task LV measurements which deserve a mention. E.g

    • Xue et al, Full left ventricle quantification via deep multitask relationships learning, MEDIA 2018
    • Chen et al, DeepCQ, CMPB 2020 (MRI, but relevant) [.. amongst others]

    One of the messages of the paper is that point and contour embeddings are similar and can improve each other. However, it’s not entirely clear why different encoders are used for these two representation. Why is a shared encoding with separate decoders not considered?

    Its not clear if the 1-4% improvements against EchoEFNet are clinically relevant. i.e is this improvement a genuine clinical need? a comparison to inter-observer variability would help contextualise performance in clinical setting.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • Recommend inclusion of references for “Simpsons method”, “echocardiography societies”, “most existing models are restricted by limited training data”.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Excellent technical contributions, clearly connected to the problem statement. Strong quantitative results.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    All my questions have been answered, and my opinion has not changed.




Author Feedback

We thank reviewers for the unanimous recognition of our studies in novelty (R1: “valuable”, R2: “good idea”, R3: “excellent”) and sufficient experiments (R1: “comprehensive”, R2: “a good set”, R3: “strong quantitative results”) and thank R1 and R3 for accepting the paper directly. We want to summarize 4 rebuttal points:

  1. {R1, R2} End-to-end AutoSAME inference without prompts. As R1 mentioned, AutoSAME’s inference does not require the prompts. During training, the novel SGPA successfully optimizes APG with prior spatial knowledge. In the inference, AutoSAME leverages the learned knowledge to adaptively generate high-quality embeddings without prompts, achieving automatic LV assessments without manual interventions.
  2. Superiority of AutoSAME proven by meticulous experiments. 1) {R1, R2} For a fair comparison in comprehensive segmentation and HR, our baseline in Table 1 is “AutoSAMUS extended with a heatmap regression branch using CBA” as R1 described, and the superiority of AutoSAME (last row) against AutoSAMUS (first row) is proven through a direct comparison. For instance, the EF correlation is 0.827 for ours and 0.784 for AutoSAMUS, showing the effectiveness of our advanced designs. Specifically, FCBA enhances the HR branch with disentangled spectral knowledge. By replacing the CBA (first row) with our FCBA (second row), all 5 measures increase by an average of 1.9%. The improvements illustrate advantages of integrating region-to-focus knowledge from segmentation to HR features in the frequency domain, where LV patterns, i.e., overall shape and details, are often more pronounced. 2) {R1} The impressive performance of AutoSAME is supported by fair comparisons, where all pre-trained FM-based models are fully fine-tuned on the CAMUS dataset. Notably, our proposal outperforms all other methods in Table 2, which benefits from the transfer of disentangled spectral knowledge and the introduction of correlated spatial information. 3) {R3} All LV indicators from AutoSAME are calculated following the clinical guidelines. Hence, the improvements of AutoSAME against EchoEFNet and other methods can be meaningful for accurate LV assessment in clinical practice.
  3. High consistency inter-task results yielded from delicate designs. 1) {R1, R2, R3} The segmentation and HR analyse multi-level aspects of the same LV. Segmentation aims to extract morphological characteristics for pixel-level classification, while HR focuses on key landmark acquisition from global image distribution. Given chaining between them, we develop two task-specific CNN encoder branches and encourage interactions, realizing a co-promotion. 2) {R1, R2} As shown in Fig. 4, LV landmarks and boundaries predicted by AutoSAME can be verifiable mutually, and the high consistency between them implies strong reliability of AutoSAME for clinical LV assessments.
  4. Others. 1) {R3} In references of MRI LV assessment mentioned by R3, Xue et al. concatenate CNN and RNN for numerical regression of LV indices, and Chen et al. estimate LV parameters with BiLSTM based on U-Net segmentation. For our target, SAM’s visual understanding is combined with segmentation and HR in echocardiography, achieving accurate and reliable assessment from the co-promotion of closely related tasks. 2) {R1} We appreciate R1’s careful reading, which helps us improve our texts, figures, and tables. We will also refine details, e.g., practice-based choice of epochs and frequency mask shape. 3) {R1, R2} The code will be available after acceptance, which contains detailed implementations, e.g., dataset setting (10-fold), units in Gaussian HR (pixels), input size (256*256 pixels) and threshold in PCK (1/20 of the input size). 4) {R1, R2, R3} A private dataset is under construction to further explore the generalization and practicability of AutoSAME. Meanwhile, we agree that further comparisons recommended by R2 and R3 can be better, and we consider them for future work, as the rebuttal guide suggested.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This is a clear case that all three reviewers recommend Accept.



back to top