Abstract

Cardiac Magnetic Resonance (CMR) imaging serves as the gold-standard for evaluating cardiac morphology and function. Typically, a multi-view CMR stack, covering short-axis (SA) and 2/3/4-chamber long-axis (LA) views, is acquired for a thorough cardiac assessment. However, efficiently streamlining the complex, high-dimensional 3D+T CMR data and distilling compact, coherent representation remains a challenge. In this work, we introduce a whole-heart self-supervised learning framework that utilizes masked imaging modeling to automatically uncover the correlations between spatial and temporal patches throughout the cardiac stacks. This process facilitates the generation of meaningful and well-clustered heart representations without relying on the traditionally required, and often costly, labeled data. The learned heart representation can be directly used for various downstream tasks. Furthermore, our method demonstrates remarkable robustness, ensuring consistent representations even when certain CMR planes are missing/flawed. We train our model on 14,000 unlabeled CMR data from UK BioBank and evaluate it on 1,000 annotated data. The proposed method demonstrates superior performance to baselines in tasks that demand comprehensive 3D+T cardiac information, e.g. cardiac phenotype (ejection fraction and ventricle volume) prediction and multi-plane/multi-frame CMR segmentation, highlighting its effectiveness in extracting comprehensive cardiac features that are both anatomically and pathologically relevant. The code is available at https://github.com/Yundi-Zhang/WholeHeartRL.git.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1466_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1466_supp.pdf

Link to the Code Repository

https://github.com/Yundi-Zhang/WholeHeartRL.git

Link to the Dataset(s)

https://www.ukbiobank.ac.uk/

BibTex

@InProceedings{Zha_Whole_MICCAI2024,
        author = { Zhang, Yundi and Chen, Chen and Shit, Suprosanna and Starck, Sophie and Rueckert, Daniel and Pan, Jiazhen},
        title = { { Whole Heart 3D+T Representation Learning Through Sparse 2D Cardiac MR Images } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15001},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    1)Proposed a MAE based 3D+T representation learning method for the whole heart, which is learned using multi-view (SA and LA) planes together with temporal information.

    2)The approach ensures a consistent cardiac representation of the same subject in the absence of a few planes.

    3) Multiple downstream tasks are finetuned from the learned representations.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Learning representations with MAE for all 2D+ patches from all cardiac planes together. 2) Robustness to missing planes. 3) Segment all planes end-to-end during inference.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) The overall MAE framework displays minimal innovation. 2) The methodology extensively parallels previous research that employs MAE for processing CT or MRI scans. 3)The study lacks a comparative analysis with other unsupervised learning methods. Including such comparisons could significantly enhance the validation of the proposed method. 4) Lack of comparsion with supervised method version of the method for the downstream tasks. 5) How about the complexity of the network? Since it taks 2D+t patches from all planes as input, the network parameters should be analyzed when comparison with other methods.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The model is difficult to be reproduced. The dataset and code are not available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    See commments in Q6 for details.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The novelty of the method is limited. More experiments are required to justify the superiorities of the method, such as ablation studies when only supervised learning is used for the downstream task with the same backbone, comparision with others methods that using same numbers of network parameters.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    In cardiac MRI typically multi-view stacks are obtained in order to assess the cardiac morphology and function, so this results in multiple 3D+T MRI stacks which are not trivial to analyze together. The authors present a method to obtain a latent self-supervised representation of the heart putting together these different views. They used a mask autoencoder (MAE) to uncover spatial and temporal associations between patches, without the need of labeled data.

    The method seems robust, works well in multiple downstream tasks and it was trained on real-world data of UKB.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors present a self-supervised method to learn a whole heart representation that fuses information from multiple views. The representation is useful for different downstream tasks and does not require labeled data. Moreover, it leverages both spatial and temporal associations.

    Trained on real-world data of UKB. 14k unlabeled cases and 1k labeled.

    Extensive segmentation benchmark

    The representation is robust to missing data / planes.

    Good outlook of potential future applications and limitations

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The predicted phenotypes do not require “ample” temporal information. They can be computed from 1 frame or 2 frames (like the EF). In that sense, why not predicting the time-series volume evolution? instead of the EDV or the EF?

    Regarding the cropping of the images, how to be sure that the cropping always includes the ROI?

    Would it be possible to have more details of the training process? How many epochs? Was a validation set used for early stopping? Were the 100 subjects of the test set the same for all the different methods?

    Table 1, would it be possible to report if the results are significant with respect to the baselines?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    For every subject 6 SA slices were provided. Nevertheless, I believe the number of SA slices per subject might vary. Thus, were certain SA slices discarded? If so, what was the criteria?

    In the t-SNE visualization, what are the parameters of this projection? And why choosing this labels, RVEF and LVM? they don’t seem to be the first phenotypes that one would choose. Were they chosen empirically or there was a certain metric used to identify which phenotypes correlated best with the t-SNE projection?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I believe the authors did a very interesting work, with extensive validation and a sound methodology. I think the paper would be of interest for the community.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper presents a thorough self-supervised learning framework, leveraging Masked Autoencoders (MAE), for whole-heart analysis. This framework utilizes masked imaging modeling to automatically unveil correlations between spatial and temporal patches across cardiac stacks and information form different views. Moreover, the approach is adaptable to various downstream tasks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1)Innovative concept: The paper proposes a 3D+time self-learning-based representation learning method, capable of integrating comprehensive spatiotemporal information as well as information from different views, using unlabeled data. (2)Various applications: The method was trained and evaluated on a large dataset with various downstream tasks including segmentation and some cardiac phenotype prediction (clinical applications). (3)Good evaluation: The performance of each task was assessed through comprehensive comparative experiments and ablation studies. Overall, it is an excellent work.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Statistical analysis is necessary to examine the correlation and significant differences for the prediction of cardiac phenotypes.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No source code provided. It would be highly beneficial if the authors could release their code as open source.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    (1) Offer a more detailed description of the dataset, including information such as TE/TR, pixel spacing, scanner details, slice thickness, slice gap, field of view, etc. This would provide valuable insights for data pre-processing. (2) The abbreviations are not clarified within the context. (LVBP: left ventricl blood pool; LABP: left atrial blood pool …) (3) It might be interesting to compare the phenotypic predictions derived from segmentation model with those obtained from the prediction model.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Strong Accept — must be accepted due to excellence (6)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, this paper is well written and easy to follow. Two primary factors influence my recommendation: (1) The novelty of this approach. (2) The quality of evaluation, which encompasses not only technical aspects but also clinical measures.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank the reviewers for the thoughtful feedback and the recognition of our work! We appreciate that all reviewers found learning a self-supervised 3D+T whole-heart representation to be interesting to the community and its implementation to be important to clinic applications!

We are encouraged by the positive remarks on the following aspects of our study:

  • The novelty of using Masked-Autoencoders (MAE) to process complicated and interlaced multi-view CMR sequences (R1, R4).
  • The implementation on a wide-range dataset of 14k UKBioBank CMR data (R1).
  • The exhaustive evaluation with various downstream tasks (R1, R3, R4).
  • The robustness to missing planes (R1, R3).

We would now like to address the concerns raised by the reviewers.

  1. Novelty of applying MAE to CMR data and comparison to other self-supervised learning (SSL) methods (R3).

We want to reiterate that our approach is novel and unique in its handling of multi-view and spatiotemporal CMR sequences compared to other SSL methods. A major contribution of our work is adapting MAE to manage interlaced multi-view CMR data, which current state-of-the-art SSL methods cannot achieve. No existing SSL methods, including the vanilla MAE, are capable of handling such complex CMR data.

Leveraging the enhancements from SSL, our model surpasses the performance of supervised methods–ViT in phenotype prediction and UNETR+ in segmentation, underscoring the importance of exploring whole-heart representations. This field is still underexplored in the community.

  1. Experiments details and results presentation (R1, R3, R4).

2.1 Trainable parameters (R3). For phenotype prediction, ResNet50 – 24.9M, ViT – 83.5M, and ours – 83.5M. For segmentation, nnUNet – 60M, UNETR+ – 84.6M, and ours – 86.5M.

2.2 Further training details (R1). All experiments are trained with 300 epochs with early stopping strategy. Regarding image preprocessing, we applied the center cropping technique for all subjects. Due to the consistent acquisition protocol of the dataset, we do not observe cropped subjects with partly missing ROI. While we found that 4 to 5 centered short-axis slices were sufficient to cover the relevant cardiac information with clear cardiac contour, we used 6 centered short-axis slices to ensure full coverage.

2.3 t-SNE visualization (R1, R4). We use a perplexity of 15 and a learning rate of 10 for t-SNE. Due to the page limitation, we only showed two phenotype representations, while the other phenotypes, e.g. LVEF, also demonstrate well-clustered and well-distributed representations.

2.4 Further dataset details (R4). We conduct all experiments with CMR data from UKBiobank acquired by 1.5 Telsa scanner (MAGNETOM Aera, Syngo Platform VD13A, Siemens Healthcare, Erlangen, Germany) with typical field of view (mm) 380252 for short-axis view and 380274 for long-axis view. The TE/TR (ms) of the short-axis view is 1.12/2.6 and of the long-axis view is 1.10/2.8. The voxel size (mm) for short-axis slices is 1.81.88.0 and for long-axis slices 1.81.86.0. The temporal resolution is 32 ms.

2.5 Reproducibility (R3, R4). We will make the code of this paper publicly accessible after the double-blind review process.

In the end, we would like to thank reviewers again for their detailed comments and very constructive suggestions!




Meta-Review

Meta-review not available, early accepted paper.



back to top