Abstract

Ejection fraction (EF) of the left ventricle (LV) is considered as one of the most important measurements for diagnosing acute heart failure and can be estimated during cardiac ultrasound acquisition. While recent successes in deep learning research successfully estimate EF values, the proposed models often lack an explanation for the prediction. However, providing clear and intuitive explanations for clinical measurement predictions would increase the trust of cardiologists in these models. In this paper, we explore predicting EF measurements with Natural Language Explanation (NLE). We propose a model that in a single forward pass combines estimation of the LV contour over multiple frames, together with a set of modules and routines for computing various motion and shape attributes that are associated with ejection fraction. It then feeds the attributes into a large language model to generate text that helps to explain the network’s outcome in a human-like manner. We provide experimental evaluation of our explanatory output, as well as EF prediction, and show that our model can provide EF comparable to state-of-the-art together with meaningful and accurate natural language explanation to the prediction. The project page can be found at https://github.com/guybenyosef/EchoNarrator .

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1027_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1027_supp.pdf

Link to the Code Repository

https://github.com/guybenyosef/EchoNarrator

Link to the Dataset(s)

https://echonet.github.io/dynamic/

BibTex

@InProceedings{Tho_EchoNarrator_MICCAI2024,
        author = { Thomas, Sarina and Cao, Qing and Novikova, Anna and Kulikova, Daria and Ben-Yosef, Guy},
        title = { { EchoNarrator: Generating natural text explanations for ejection fraction predictions } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors proposed a ejection fraction estimation model together with natural language explanations from echocardiograms. The proposed method contains several components: 1) a video regressor for echocardiographic sequence feature representation; 2) a spatio-temporal GCN for ES/ED key point estimation; 3) geometrical attributes computation, including EF estimation (two different regressors); and, 4) a LLM module for EF explanation generation in natural language. The authors designed specific training methods and evaluation metrics. By evaluating on a large public dataset, the authors showed that the proposed model not only achieved SOTA EF estimation, but also provided meaningful explanations.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Novel Human-like explanation of EF prediction: this paper enriches a GCN-based EF prediction with clinically meaningful explanation that is easy for human to understand the meaning/reason behind the predicted EF value. It’s much more human-friendly than other explainable techniques.
    2. Creation of a self-instruction dataset for EF explanation: the authors made use of different large language models and processing techniques to create a self-instruction training dataset from a small initial dataset, which could be of interest to the community to explore explainable EF prediction using natural languages.
    3. Design of cardiac attributes: the authors have considered 6 attributes extracted from the key points around the left ventricle for EF explanation. Further refinement with LLM help to improve the generated explanation in a more coherent and human-like way.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Limited discussion of NLE performance: the authors have listed 7 metrics in Table 2. It can be observed that the proposed NLE-EF model (w self-i or w/o self-i) have very mixed performance across different metrics. It would be nice to see more detailed discussion over different metrics.
    2. Lack of representation of NLE metrics: it’s not clear how mistral contradictions and hallucinations are calculated. Does contradiction of value 1.02 means there will be one contradiction for one response in average?
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors are going to release the codes and self-instruction dataset with the paper. The echocardiography dataset is public. This work is reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Here are several remarks/questions related to the paper for further clarity.

    1. How does the method process/predict the ED/ES frames? How does the proposed method work when the ED/ES frames are unknown?
    2. What were the criterions for the selection of videos in both training and testing sets for video-text annotation? It may be more clear to show the distribution of EF for the samples that are involved in video-text annotation.
    3. It seems that the evaluation in terms of Mistral Model only depend on the attributes from the GCN model. The authors could extend the evaluation in terms of mistral accuracy from the 48 test samples (with video-text annotation) to the entire test data in EchoNet-Dynamic dataset.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper is well written and focus on a very interesting topic for explainable EF estimation using a language AI. The authors have presented the method in a clear way and provided quantitative as well as qualitative results in the paper to show the effective performance of the proposed method. It’s a good benchmark for future research.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors present a combination of models that estimate the left ventricular ejection fraction from echocardiogram videos and justify the estimated ejection fraction with textual output generated by a large language model (LLM). The aim is to enhance the trustworthiness of the estimation in a clinical setting by providing a rationale for the prediction.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The approach is innovative, extracting multiple biomarkers from EchoNet-Dynamic videos by integrating existing methods and introducing new ones (sec 2.2).
    • It utilizes a dual-method for ejection fraction predictions: (1) direct ejection fraction estimation and (2) volume regression at the end-systole (ES) and end-diastole (ED) phases, and then calculation of the EF ratio.
    • The study includes a novel exploration of post-hoc electronic health records (EHRs) and text descriptions of echocardiograms using machine learning.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Employing an LLM to generate paragraphs about the extracted information seems counterproductive. Displaying the extracted biomarkers directly might be more efficient, as reading text is less accurate and more time-consuming for clinicians.
    • The paper lacks standard ejection fraction metrics (R2, MAE, RMSE), which are crucial for evaluating model performance.
    • There is insufficient justification for the numerous choices made in the study, such as the GCN architecture and the LLM base model.
    • Table 1 appears biased or poorly reported. EchoNet [16] offers various configurations, none of which use 16 frames and a single ES-ED cycle. This discrepancy needs correction and a fair acknowledgment of the existing state-of-the-art for EF regression on EchoNet-Dynamic.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The paper does not provide sufficient detail to enable reproduction of the entire pipeline. Although the open-source repository is intended to address this issue, there is no mention of the newly created expert-labeled data used in this work.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • It is unclear whether the accuracy of the method would improve when applied to multiple heart beats, as with the reference state-of-the-art EchoNet method.
    • All significant limitations are appropriately noted in section 4. However, the limited dataset size is a problem, and the evaluation of an LLM by another LLM does not guarantee accuracy which significantly weakens the claims related to the LLM capabilities.
    • Encapsulating the estimated biomarkers in a textual description seems impractical, while directly providing clear numerical values (sec 2.2) to justify the model’s predictions would be highly relevant in a clinical context.
    • The paper’s novelty primarily lies in the combination of existing components (GCN + LLM), and this combination lacks strong motivation.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper’s novelty stems solely from merging existing elements (GCN + LLM), which is not well motivated. It is difficult to understand why reading text would be preferable to reviewing raw numerical values, which are more precise and quicker to interpret, and do not suffer from the limitations of LLMs (accuracy, hallucination). While the work offers some intriguing insights, the problem it addresses is poorly defined or motivated.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper presents a novel Natural Language Explanation (NLE) model for predicting the ejection fraction (EF) in cardiovascular ultrasound, specifically designed to provide explanatory text that clarifies EF predictions in a manner understandable to clinicians. The model integrates spatiotemporal analysis of left ventricle (LV) geometrical features with the analytical capabilities of modern Large Language Models (LLMs) to generate human-like textual explanations. This approach aims to enhance the interpretability of EF predictions, addressing the challenge of explaining deep learning model outcomes in clinical settings, thereby fostering trust and usability among medical professionals.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper introduces an original use of Large Language Models (LLMs) to generate natural language explanations of ejection fraction predictions from cardiovascular ultrasound data. This approach is novel as it combines geometric feature analysis from medical imaging with language models to produce explanations that are clinically relevant and understandable, bridging a significant gap in medical AI between complex data outputs and clinical usability.

    • The proposed model is an end-to-end solution that integrates video encoding, spatiotemporal graph convolutional networks, and natural language processing. This comprehensive design allows for simultaneous ejection fraction prediction and explanation generation, which is particularly advantageous for real-time clinical decision-making.

    • The research introduces specific evaluation metrics tailored to assess the quality of the generated textual explanations in terms of clinical relevance and accuracy. These metrics are crucial for validating the explanations in a medical context, where the accuracy of information can significantly impact clinical outcomes.

    • The paper provides thorough experimental evaluations demonstrating that the model achieves ejection fraction prediction accuracy comparable to state-of-the-art methods while also delivering meaningful and clinically applicable explanations. This dual capability is a significant advancement in the field of medical imaging AI, offering a strong validation of the model’s practical utility in healthcare settings.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The study relies primarily on the EchoNet-Dynamic dataset, which, while comprehensive, limits the diversity of training and testing scenarios. This restricted dataset scope might lead to models that are overly optimized for specific dataset characteristics and may not generalize well to other clinical environments or imaging conditions not represented in the data.

    • The paper does not extensively discuss the limitations or potential failures of the proposed model under varying clinical conditions or with different patient demographics. A more detailed exploration of how the model performs with low-quality images or divergent cardiac conditions could enhance understanding of its robustness.

    • While the introduction of novel evaluation metrics for explanation quality is a strength, the paper falls short in providing a comparative analysis with existing metrics or baselines. This makes it challenging to gauge the true innovation or improvement offered by the new metrics over traditional methods.

    • The model’s heavy reliance on sophisticated LLMs like LLaMA and GPT-4 for generating explanations could introduce complexities in terms of computational resources and scalability. These models require significant processing power, which might limit the deployment of the proposed system in resource-constrained environments such as mobile health applications or in developing countries.

    • The methodology for generating natural language explanations is based on attributes that are predefined and potentially biased towards the training data’s characteristics. This approach might lead to explanations that do not accurately reflect unseen cases or rare pathological conditions, potentially misleading clinicians.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The paper provides a detailed account of the model architecture, training procedures, and the generation of natural language explanations. For enhanced reproducibility, it would be beneficial if the paper included more specifics about the hyperparameters used, version control of the software and libraries, and any preprocessing steps applied to the data.

    While using a public dataset such as EchoNet-Dynamic facilitates reproducibility, the paper could improve by detailing any specific data selection criteria, such as the inclusion or exclusion of certain types of echocardiography videos, or how data splits were determined. This detail helps ensure that other researchers can exactly replicate the training, validation, and testing environments.

    The introduction of novel metrics for evaluating explanation quality is a significant contribution. To aid reproducibility, the paper should provide clear definitions, calculation methods, and perhaps a standalone implementation of these metrics. Comparing these new metrics against established ones in the field could provide a baseline for other researchers to evaluate against and would contextualize the improvements made.

    For complete reproducibility, it would be advantageous if the authors included more comprehensive results, such as the full range of experiments conducted, including failed experiments or those yielding suboptimal results. This inclusion would offer a complete picture of the model’s performance and limitations.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The choice of using specific LLMs (LLaMA and GPT-4) is well justified in terms of their capabilities for generating natural language explanations. However, it would enhance the paper’s clarity if the authors discussed the selection criteria for these models over other potential options. Including comparative insights into why these models were chosen against others could provide a more robust justification for their use.

    While the use of the EchoNet-Dynamic dataset supports reproducibility, a deeper discussion on its limitations would be beneficial. For example, the authors could elaborate on the demographic and pathological diversity within the dataset and its potential impact on the model’s generalizability. Suggestions for future research could include exploring additional datasets or synthetic data to address these limitations.

    The paper introduces novel metrics for assessing the quality of explanations, which is commendable. However, providing a baseline comparison with existing metrics would strengthen the argument for the proposed metrics’ effectiveness. It would be useful for the authors to include a side-by-side comparison of these metrics against traditional ones, discussing any discrepancies in evaluation outcomes.

    The paper could be significantly strengthened by including a section on the clinical impact of the proposed model. This could involve a pilot study or feedback from clinical practitioners who have interacted with the model’s outputs. Understanding how the explanations influence clinical decision-making could provide valuable insights into the practical utility of the research.

    The generation of explanations based on predefined attributes may introduce biases, especially if these attributes do not encompass all clinically relevant scenarios. The authors should discuss potential biases in more depth and explore methodologies to mitigate them, such as incorporating a broader range of attributes or using adaptive algorithms that can learn new attributes from clinical feedback.

    While the experimental validation appears thorough, adding more diverse testing scenarios, such as cross-validation across different institutions or with varied equipment, could further validate the robustness and applicability of the model. This could help in assessing how the model performs under different clinical settings and with varying image qualities.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a novel approach that integrates graph convolutional networks and large language models to generate clinically relevant natural language explanations for ejection fraction predictions in cardiovascular ultrasound. This method enhances the interpretability and usability of AI predictions in clinical settings. The paper is scientifically rigorous, introduces new evaluation metrics for explanation quality, and demonstrates strong performance compared to existing benchmarks. Additionally, the openness of making the model’s code publicly available enhances its impact. Therefore, I recommend accepting this paper.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank the reviewers for their positive feedback and valuable insights. In our response, we clustered the main points to address specific comments and suggestions made by the reviewers.

Extending explanations to multiple cycles: The reviewers suggested extending the explanations beyond single cycles to cover entire heart scans. Our proposed method requires ED and ES detectors to identify ED and ES. Previous work has shown that this can be automated. We agree that a more comprehensive approach is necessary for videos that cover multiple cycles. Potential strategies include adopting temporally denser graphs and integrating ED/ES detection heads directly into the model.

Design choices: Thanks for the recognition from the reviewers that one of the main innovations lies in the combination of echocardiography features coming from a vision model that processes the video and a LLM that converts these features - potentially less meaningful to a human reader - into human explanatory text. Our decision to use a GCN to extract features and LLaMA as LLM was based on experimental considerations, following the evaluation of several alternatives. For reproducibility all details are documented in the repository and supplement, and we plan to make code and data openly available. The 6 features used in our model are based on what the clinician could be observed from the video and limited to the left ventricle. In the future, we plan to extend the features towards the observation of left atrium and the right side of the heart. We anticipate that future research will also explore the use of automated feature selection, employing techniques like reinforcement learning.

Evaluation: Thanks for the opportunity to clarify and discuss our conclusions from the experimental results. As we can see, it boils down to these three points, which we will also emphasize in the camera-ready version. (1) Evaluating free text is a complex task itself, and our experiments showed that many traditional methods fall short in the task since it involves prior knowledge that is field-specific and sensitive. Consequently, we developed a Mistral-based metric that aligns more closely with structured data. We acknowledge that the paper lacks independent experiments on this metric due to space constraints. (2) Our model appears to outperform general-purpose models, such as GPT-4 Vision, as well as models specifically trained with medical data, such as LLaVA-Med. (3) The self-instruction component enhances our model’s performance by which a more diverse dataset is generated and improves accuracy & robustness of the explanations conversely.

Clinical Use & Motivation: The reviewers commented on clinical value and general motivation. The proposed model not only produces LV contours and EF, but also more geometrical features along with a text explanation. The proposed method allows the user to decide to trust the visual prediction or to verify it by assessing more information. This will be a first step towards a holistic AI-assisted diagnosis & reporting and may also facilitate better education of inexperienced clinicians. The reviewers commented on the size and diversity of our annotated data. We note that a lot of data is still implicitly entering our model by the pre-training of the GCN (on the full EchoNet train set) and the LLM. However, we agree that we will need to extend and diverse the data to complement evaluation. Especially for LLM, a thorough clinical evaluation will be necessary.

In conclusion, we are grateful for the high scores and constructive comments from the reviewers. We believe that the enhancements discussed, including the extension of explanations to full scans, more rigorous evaluation, and emphasizing clinical utility, will significantly strengthen our work. We look forward to integrating these improvements and continue to advance the field of AI in echocardiography.




Meta-Review

Meta-review not available, early accepted paper.



back to top