Abstract

Transthoracic Echocardiography (TTE) is the most widely-used screening method for the detection of pulmonary hypertension (PH), a life-threatening cardiopulmonary disorder that requires accurate and timely detection for effective management. Automated PH risk detection from TTE can flag subtle indicators of PH that might be easily missed, thereby decreasing variability between operators and enhancing the positive predictive value of the screening test. Previous algorithms for assessing PH risk still rely on pre-identified, single TTE views which might ignore useful information contained in other recordings. Additionally, these methods focus on discerning PH from healthy controls, limiting their utility as a tool to differentiate PH from conditions that mimic its cardiovascular or respiratory presentation. To address these issues, we propose EchoFM, an architecture that combines self-supervised learning (SSL) and a transformer model for view-independent detection of PH from TTE. EchoFM 1) incorporates a powerful encoder for feature extraction from frames, 2) overcomes the need for explicit TTE view classification by merging features from all available views, 3) uses a transformer to attend to frames of interest without discarding others, and 4) is trained on a realistic clinical dataset which includes mimicking conditions as controls. Extensive experimentation demonstrates that EchoFM significantly improves PH risk detection over state-of-the-art Convolutional Neural Networks (CNNs).

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2686_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2686_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Fad_EchoFM_MICCAI2024,
        author = { Fadnavis, Shreyas and Parmar, Chaitanya and Emaminejad, Nastaran and Ulloa Cerna, Alvaro and Malik, Areez and Selej, Mona and Mansi, Tommaso and Dunnmon, Preston and Yardibi, Tarik and Standish, Kristopher and Damasceno, Pablo F.},
        title = { { EchoFM: A View-Independent Echocardiogram Model for the Detection of Pulmonary Hypertension } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15001},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors present a framework that uses DINOv2 for pretraining on, and feature extraction from, Transthoracic Echocardiography (TTE) videos, enabling a compact, rich representation from multiple video views of the same subject. The extracted per-frame features are used as input to a Vision Transformer (ViT) classifier for the classification of Pulmonary Hypertension. This framework outperforms a baseline model using a weakly supervised learning approach. The authors demonstrate the added value of using pretraining with DINOv2 over other approaches (SimCLR and DINO), and also of the added benefits of using a ViT over a CNN for classification, particularly when using the less performant pretraining mechanisms.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors propose a novel use of DINOv2 pre-training and feature extraction, followed by a Vision Transformer-based classifier for the classification of pulmonary hypertension from TTE.

    The authors show through multiple quantitative analyses the value of the different steps of the proposed framework, comparing DINOv2 to other pretraining methods, and the ViT to a baseline CNN. A comprehensive ablation study is also presented, to evaluate the benefits afforded by each proposed component. The most notable gains come from pre-training a foundation model with DINOv2 and using this for feature extraction as input to a downstream PH classifier. The ViT outperforms the CNN classifier with all pretraining mechanisms, but particularly so with the older methods (SimCLR and DINO).

    This analysis is helpful for the community to see the effect of different pretraining+feature extraction methods on this task.

    The authors also illustrate the use of attention maps at the video, frame and pixel level, enabling better interpretability of the model predictions, and identification of important video views, frames and image features.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The framework as a whole provides a promising use case for the application of recent advancements in image model pretraining (with DINOv2) and the Vision Transformer as a classifier. However, the paper lacks any significant technical novelty, with the main components being largely off-the-shelf models.

    Furthermore, the baseline model is also only briefly described, and it is unclear what the baseline model architecture is, and how it was trained.

    The results section (particularly Tables 2 and 3) lack clarity on what exactly is being compared, details provided below in the feedback.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    While a lot of detail is provided for how the DINOv2 and ViT models are trained, it is unclear how the baseline model was trained.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Main comments:

    • the baseline weakly supervised learning model does not have any references, and it is difficult to understand exactly what was trained. Please provide more details (e.g. architecture, hyperparameters, and ideally one or more references for the same or similar method(s)). There are several recent works that have developed ML models for identifying PH from ultrasound which would be worth contrasting with: https://www.ahajournals.org/doi/10.1161/CIRCULATIONAHA.118.034338 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6814224/ https://link.springer.com/article/10.1007/s11263-024-01996-x

    • The use of an off-the-shelf view classification algorithm is introduced in the Data and Experiments section, separately from the baseline model description in “3.2 Weakly Supervised Classification of TTE Videos”. How well did this classifier perform on its own? How much would this have affected the baseline model? Why use this as a benchmark if the baseline model can take multiple views, similarly to the proposed method?

    • The authors describe in the Results: “To evaluate this hypothesis, we methodically implemented the following architectural adjustments: 1) integration of multiple concatenated views within the WSL framework used as baseline, 2) utilization of multiple pre-training encoders, and 3) adoption of varied architectures for the downstream classification.” It is unclear how each of these changes are reflected in Table 3. Is (1) the case for all rows? Does (2) refer to “Training Mechanism”, and (3) refer to “Method” in Table 3? It would be helpful to clarify this in the text.

    • Table 2 is not referenced in the text. Additionally, the dataset being used for evaluation is not specified - is this CIPHER, Sheffield, or both?

    • Table 1 indicates “Train” “Validate” and “Test” sets for each of the Sheffield and CIPHER datasets. However, in the Results section “Generalization Study”, it suggests Table 1 displays results of training on Sheffield data and testing on CIPHER data: “Specifically, when trained on the Sheffield dataset and tested on CIPHER (see Tab. 1), the model achieved an AUC of 0.73 ± 0.02, F1 score of 0.88 ± 0.01, and accuracy of 0.81 ± 0.01.”. Should this text be referencing Table 5 for the cross-dataset validation?

    • Fig. 2: While this appears to be a very helpful way of visualizing the importance of videos, frames and pixels, it is unclear what the colours refer to in (a), (b) and (d). Presumably the lighter colours in (a) and (b) indicate more important videos and frames, respectively, given the PSAX and A4c views highlighted in (b)? How are the pixel-wise attention maps used to identify useful image features here for diagnosing PH? Were there any new structures, or motion features, that were identified with this?

    Minor comments:

    • Typo in Introduction: “pressues” -> “pressures”

    • Typos in caption for Fig. 2: “pizel-wise” -> “pixel-wise”. “anf” -> “and”

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper proposes a framework that uses recent advancements in the field of computer vision, leveraging DINOv2 for pre-training and feature-extraction and ViTs for classification, for the task of classifying pulmonary hypertension from multi-view TTE. The authors have done a great job comparing to other pre-training methods, and comparing ViTs with CNNs. This is a helpful contribution to the community that particularly highlights the value of DINOv2 as a pretraining mechanism.

    The paper however could be improved by better clarity on a number of points listed above.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors present an architecture that combines a self-supervised learning approach and a transformer model for detection of pulmonary hypertension (PH) in echocardiography. The method is designed to be agnostic to the acoustic views, and provides interpretability at video, frame and pixel levels. Compared to a baseline model developed by the authors, the proposed method performs favourably. Overall, solid groundwork for various relevant downstream tasks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Very engaging read. Sound and versatile approach with a notable potential in echocardiography and possibly other imaging modalities.
    • Incorporates the relatively novel DINOv2 self-supervised learning technique in the context of echocardiography.
    • Provides interpretable output, which, in addition to being relevant for the task at hand, could have a significant potential identifying relevant diagnostic features across a range of conditions.
    • Structured results, detailing experiments across different foundational models, classifiers and parameters that influence the results.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The method section lacks sufficient detail, and the connections between embedding techniques, the transformer, and weakly supervised learning approach is slightly unclear.
      o How was the number of frame features selected? o How was queries, keys and values generated? o What 2D CNN was used in the WSL setup? What objective was used for the end-to-end training? Training details are also lacking.

    • The cross-validation approach is ambiguous: o Clarity is needed on whether it included a separate test set. o It should be specified whether the training data for the foundational model overlapped. o Explanation is needed on whether training datasets were mixed.

    • There is no discussion on the quality or source of the dataset, such as whether it is from a research setting or an outpatient clinic.

    • For someone with limited knowledge on the difficulty of PH detection from echocardiography, it is not evident whether the baseline models used in this paper is representative for state-of-the-art including those based on measurements and traditional metrics and thresholding. The same also applies to the final results. A comparison with existing benchmarks would contextualize the findings more effectively.

    • Inconsistencies and errors in methodological notation and typesetting need correction. For instance, the inconsistent use of variables (i, v, m_v) and typesetting styles (upright vs. italic).

    • Limited elaboration on how the video-level attention is used to select the most relevant videos. Were average values across frames used per video? For Fig. 2 it would also be beneficial for the reader to know the colour scales. From the text a flattened interventricular septum and right ventricle enlargement are examples of traits prevalent in patients with PH. However, this was not apparent from the example shown in Fig. 2.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The notation used for video embedding in section 3.1.1, particularly the interchangeable use of m_v and v for frame counts, should be clarified.
    • Transformer-specific terms like head_h and W_O need definitions for clarity and consistency with variable definitions.
    • Did you use multiple views in the WSL approach? In case, what views did you include?
    • The fixed image resolution and the implications of variable pixel spacing and anisotropy should be addressed. Especially when referring to differences in shape and size of the ventricle.
    • Caption in Fig. 2 has several typographical errors, i.e. “pizel”, “vales”, “anf”.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Interesting work that is believed to be of interest to the MICCAI community. Despite limited novel methodological contribution on the isolated components, the method integration, use of data and clinical application is novel. The multilevel interpretability can be of broad interest.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose an echocardiography foundation model, using a training procedure similar to DinoV2 for image embedding, and an attention module to prioritize important parts of the input video.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is a novel application of self-supervision as building an embedding layer, which is useful for training on long video sequences where it may be difficult to put the entire long video on GPU memory. In this case the long video is a series of videos derived from the same echo exam.

    The authors provide useful illustrations of the model’s attention, and perform a study of training on one dataset and testing on another, which is helpful for discussion of the generalization capability.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Some aspects of the quantitative evaluation is difficult to understand, for example, were all views or just the A4C view used to train the encoder? What about the attention/classifier mechanism? Is this held consistent for the comparison studies in table 3? It is also not entirely clear what the “WSL” refers to. Does it refer to a training method that is architecture agnostic? Or does it refer to a training method and a CNN architecture? These details should be clarified to ensure the reader can better approximate the capability of the methods that are being compared against.

    It appears like the authors compare a CNN trained on one view vs their method trained on multiple views, in which case, the input dataset itself is quite different and it is not a fair comparison. See further comments.

    Furthermore, it looks like the supplementary material is just an identical copy of the paper. Did you mean to submit something else?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    n/a

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    3.1.1: Are the weights of the encoder frozen while the downstream attention module and classifier are trained?

    Tables: Another column of the tables should be dedicated to the choice of whether single view or multiple view input was used, both to train the DINO segment and to train the attention/classifier segment. The choice of inputs is an important variable that should be discussed. By training on only A4C videos, the CNN model that you are comparing with is effectively learning with a smaller dataset.

    As a result, the reader cannot entirely ascertain if the gap in performance is caused by the difference in data volume, versus the genuine improvement of the model; It is not adequately decoupled in your experiments.

    In the “baseline experiment” section, you wrote “When addressing the challenge posed by the varying lengths of videos, we utilized a WSL model that employs attention-based CNNs.” Is there a citation for which WSL model was used? If not, some more information should be provided such as the number of parameters of the model. This will do more to convince readers the fairness of the comparison.

    It is confusing for the 6 methods compared in table 3, do all of them use all views or simply 4-chamber view, and for the training of the encoder or only the attention/classifier respectively? It is tedious to add these details but they are still quite important for the reader.

    Table 4: For the ImageNet row, since ImageNet weights are taken directly, does it make sense to have the (multi) after ImageNet? Furthermore, it would benefit the reader to have a fourth line with the optimium model (EchoFM architecture, multi-view DINO pre-training), instead of having the red text for comparison. This would also give you more space as your table is currently overstepping the right margin.

    Results interpretability section: would “patch-level attention” be more accurate than “pixel-level attention” since the ViT divides images into 16x16 patches? Furthermore, can you elaborate on the sentence “diverging from conventional automated views like A4c but aligning with standard clinical protocol?” If the standard protocol is using the A4C, would it make the statement contradict itself? Do you mean to say that it typically uses A4C, but when (for example) the visual quality is poor, we see it use some other views which are still reasonable from the perspective of detecting pulmonary hypertension?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has some interesting ideas and a good evaluation. However, the lack of clarity regarding some parts of the Results section and the tables is a weakness.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

Reviewer #1: We thank the reviewer for the feedback. We will add more details pertaining to the Weakly-Supervised Learning (WSL) task in the revised manuscript. We would also like to note that our baseline is a re-implementation of the first reference mentioned in the initial comment by the reviewer.

In response to the second comment, we would like to clarify that the baseline is not solely the WSL, but a combination of a view classifier with a CNN-based classifier following it. This is already an improved version of the baseline implemented in: Zhang, Jeffrey, et al. “Fully automated echocardiogram interpretation in clinical practice: feasibility and diagnostic accuracy.” Circulation 138.16 (2018): 1623-1635.

Regarding the remaining comments, we will rectify the spelling mistakes, add references to the tables, and include figure explanations in the revised manuscript.

Reviewer #2: We appreciate the comments by the reviewer. We will address the first point by adding details about the WSL implementation in the revised manuscript.

We will also add details about the different cross-validation schemes used in various experiments in the paper. We ensured that there was no data leakage at any point in the experiment. The foundational model used is self-supervised and does not cause double-dipping as no labels were used in training it. Furthermore, the validation and test sets were isolated from the downstream training. For the remaining comments, we will ensure that the inconsistencies and figure details are addressed in the final version.

Reviewer #4: We thank the reviewer for the comments. Only the baseline experiment was run on top of an A4c classifier, wherein the detected A4c views were used as input to the downstream classifier. All the remaining experiments use all available views. We will add architectural/training details in the revised manuscript. We confirm that the weights for the foundation model encoder were frozen after self-supervised pre-training.




Meta-Review

Meta-review not available, early accepted paper.



back to top