Abstract

Accurate and contrast-free Major Adverse Cardiac Events (MACE) prediction from Cine MRI sequences remains a critical challenge. Existing methods typically necessitate supervised learning based on human-refined masks in the ventricular myocardium, which become impractical without contrast agents. We introduce a self-supervised framework, namely Codebook-based Temporal-Spatial Learning (CTSL), that learns dynamic, spatiotemporal representations from raw Cine data without requiring segmentation masks. CTSL decouples temporal and spatial features through a multi-view distillation strategy, where the teacher model processes multiple Cine views, and the student model learns from lower-dimensional Cine-SA sequences. By leveraging codebook-based feature representations and dynamic lesion self-detection through motion cues, CTSL captures intricate temporal dependencies and motion patterns. High-confidence MACE risk predictions are achieved through our model, providing a rapid, non-invasive solution for cardiac risk assessment that outperforms traditional contrast-dependent methods, thereby enabling timely and accessible heart disease diagnosis in clinical settings.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2539_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{SuHao_CTSL_MICCAI2025,
        author = { Su, Haoyang and Rui, Shaohao and Xiang, Jinyi and Wu, Lianming and Wang, Xiaosong},
        title = { { CTSL: Codebook-based Temporal-Spatial Learning for Accurate Non-Contrast Cardiac Risk Prediction Using Cine MRIs } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15974},
        month = {September},
        page = {147 -- 157}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    Methodologically, this paper introduces a codebook-based approach for disentangling temporal and spatial information in cardiac MRI sequences. By separately encoding motion dynamics and anatomical structures through vector quantization, the framework resolves ambiguities between cardiac structures while preserving critical motion patterns. Experiments across three cardiac MRI datasets demonstrate that this spatiotemporal disentanglement approach significantly outperforms traditional risk prediction methods without requiring contrast agents or manual segmentation, achieving superior C-index values and clear risk stratification between patient groups.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper presents a compelling approach to cardiac risk prediction by disentangling temporal and spatial features through codebook-based representation learning, effectively capturing complex cardiac motion patterns while resolving structural ambiguities without requiring contrast agents. The experimental results demonstrate significant performance improvements across multiple datasets, achieving superior C-index values compared to traditional methods and providing clear risk stratification between patient groups, validating the clinical potential of this contrast-free approach.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The major weakness of this paper is on the writing - there are quite a lot of technical details that need to be better explained. Most importantly, the way to disentangling of spatial features Z_{\sigma}^{(s)} and temporal features Z_{\tau}^{(s)}, which is the core of this work, is not clear. The general idea might be training two codebooks, one for spatial features and the other for temporal features. But it’s not clearly why they are trained to meet the needs - to me it seems we train a bunch of codebook entries and they are magically divided into spatial and temporal entries.

    Besides, several concepts mentioned in Fig 1, including the motion queries and latent attention map, is not clear. Also it would be great if the authors can explain “temporally synchronized positives z_{i}^{(s,+)}” more clearly.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the general idea is convincing and the experimental results look promising, this paper need more refinement to the text to explain the details more clearly as mentioned in the weakness section.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Authors’ clarification do address my concerns on methodology details. It would be appreciated if the author can add those explanations to the final version, and avoid any unexplained terms appear in figures.



Review #2

  • Please describe the contribution of the paper

    The contribution of this paper is a contrast-free approach for MACE prediction using cine MRI, eliminating the need for annotated masks or contrast agents. This is achieved through a motion-aware multi-view distillation framework that aligns short-axis and long-axis cine features using a KL-divergence–based teacher-student paradigm to capture holistic cardiac motion and anatomy. It outperforms contrast-dependent methods, enabling more timely and accessible heart disease diagnosis.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The writing is clear, and the motivation and methodology are well presented. A key strength of this work is its fully non-contrast approach to cardiac risk prediction, which eliminates the need for manual segmentation masks or contrast imaging, thereby enhancing its practical applicability. By integrating multi-view cine data—short-axis and multiple long-axis views—within a motion-aware distillation framework, the authors effectively capture both spatial and temporal cardiac dynamics. Furthermore, their codebook-based embedding disentangles and encodes these spatiotemporal features into compact ‘tokens’ for efficient representation. The author conducts comprehensive experiments, including Kaplan-Meier analysis and SHAP-based interpretability assessments, to demonstrate the advantages of the proposed method.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    In the ablation study, the lower performance on the RJCCM cohort using disentangled spatiotemporal representations requires further discussion. Similarly, the notably smaller p-value of CoxPH compared to CTSL in the AZCCM cohort should be explained. Additionally, the experimental section lacks details on the sample allocation for training and testing, and the overall training set appears relatively small.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The writing is clear, and the motivation and methodology are well presented. A key strength of this work is its fully non-contrast approach to cardiac risk prediction, which removes the need for manual segmentation masks or contrast imaging, thereby enhancing practical applicability. The proposed method effectively addresses the identified research gap. However, the experimental details require clarification, and there is a lack of discussion regarding certain results—particularly the inconsistencies observed with the proposed method—which warrants further attention.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors propose a two-stage self-supervised learning framework, named CTSL, to integrate both long-axis (LAX) and short-axis (SAX) view Cine MRI with traditional electronic health record (EHR) data for major adverse cardiac event (MACE) risk prediction. The method first employs a DINOv2 architecture with a UniFormer backbone to generate spatiotemporal representations in the embedding space by aligning SAX dynamics with the anatomical patterns of individual LAX views. Then, a VQ-VAE is used to disentangle the representations into spatial and temporal embeddings and generate “deep features” from the trained codebook. Finally, a Cox model is fitted using the multimodal data (EHR + deep features) for risk stratification, outperforming baseline methods from the literature and simpler architectures.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Good level of novelty: The proposed two-stage framework introduces an effective self-supervised method that concurrently utilises the three LAX views and the SAX view to extract both spatial and temporal representations for deep feature extraction. Additionally, the authors design a well-thought-out loss function that guides the DINOv2 backbone to achieve patient-level alignment based on the KL divergence.

    • Good demonstration of clinical feasibility: The authors compare the proposed method with several baseline approaches and conduct ablation studies under different experimental setups, demonstrating the technique’s potential for clinical application.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Heavy computational burden: The primary concern lies in the computational intensity of the proposed pipeline. The authors have not provided details regarding the time required for each stage, both during training and inference. In particular, the preprocessing step — ROI extraction using Farneback dense optical flow — may be computationally expensive. A brief clarification would be helpful to justify the choice of this method over simpler computer vision techniques, such as isolating the only moving region (i.e., the heart) by seeking the differences between two consecutive frames to find the ROI with canny edge detection.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • Limited reproducibility: Reproducibility would be improved if the authors open-sourced the code used in the study. However, the authors did provide the model’s hyperparameters.

    • Good clarity and organisation: The authors did a commendable job of presenting a complex methodology in a clear and well-structured manner, making the manuscript easy to follow.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a highly novel and technically sound two-stage self-supervised framework that addresses an important and clinically relevant task — risk prediction of major adverse cardiac events. The integration of both Cine MRI views (LAX and SAX) with EHR data through a thoughtful combination of DINOv2, UniFormer, and VQ-VAE demonstrates innovation in both representation learning and multi-modal fusion. The proposed alignment strategy and the use of KL divergence to guide spatiotemporal embedding learning are well-motivated and effectively validated through comprehensive experiments and ablations. While the computational burden is non-negligible, it does not detract from the overall contribution, and could be addressed in future work. The method’s potential for real-world clinical use, combined with its strong experimental performance and clear presentation, makes this paper a valuable addition to the MICCAI community.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors provided sufficient clarifications on the methodology, do not see a reason to reject the paper.




Author Feedback

We sincerely thank all reviewers (R1, R2, R3) for insightful feedback on our manuscript. We have carefully considered all comments and provide our responses below.

R1.1 Clarity of Disentangling Spatial Features Z_{\sigma}^{(s)} and Temporal Features Z_{\tau}^{(s)} Re: In Stage II, SA data is processed as two distinct 3D inputs: a (T, H, W) volume for temporal and a (D, H, W) volume for spatial branches, with the respective fourth dimension repurposed as channel features. By these specific designs, it yields distinct temporal and spatial tokens that define their corresponding codebooks.

R1.2 Clarity of Concepts in Fig. 1 Re: ‘Motion queries’ describe tokens carrying spatiotemporal features of myocardial dynamics (collectively termed ‘motion’). In Stage II, these tokens ‘query’ our codebooks to map to the closest embeddings. The ‘latent attention map’ is then generated using the attention rollout technique.

R1.3 Clarity of Temporally Synchronized Positives z_i^{(s,+)} Re: ‘Temporally synchronized’ implies consistent cardiac cycle features for the same patient across 4 CMR views. Strictly following contrastive loss principles, z_i^{(s,+)} represents the sample itself, while z_j^{(s,-)} denotes other samples in the same batch. We posit that loss to prevent model collapse.

R2.1 Discussion on Lower Performance in RJCCM Cohort (Ablation Study) Re: We hypothesize that misjudging noisy Stage I features could lead Stage II quantization to amplify such noise, resulting in flawed codebook embeddings. This idea is supported by observed temporal artifacts in some RJCCM samples. Given that our research objective is to study the impact of learned representations beyond purely improving results, we decided not to exclude these challenging samples. We acknowledge tracking this issue in future work.

R2.2 Explanation for p-value Difference (CoxPH vs. CTSL in AZCCM Cohort) Re: As observed, potent EHR predictors like D2B time in AZCCM (its impact towards MACE supported by Sawano et al., J Cardiol, 2025) can yield very small p-values. This matches CoxPH’s prediction (purple curve, Fig. 2) on AZCCM. Notably, the inconsistent presence of potent indicators (e.g., D2B time, absent in our RJCCM/TJCCM cohorts) limits the extensibility of models reliant on them, unlike image-based approaches. CTSL’s p-value might be larger as its imaging features modulate EHR predictors (like D2B time). However, this does not detract from our model’s enhanced C-index.

R2.3 Details on Sample Allocation for Training and Testing Re: We apologize for this oversight. Rigorous data splitting were employed with ratios of train:val = 7:3 for SSL and train:val:test = 7:1.5:1.5 for survival analysis. Details will be in the revision.

R2.4 Comment on Overall Training Set Size Re: To clarify, our CMR dataset is significantly larger (1393 studies with 4-chamber views (70% for training)) than typical SAX-only public sets like ACDC (150), M&Ms (375).

R3.1 Concern about Computational Burden Re: We appreciate the concern. Our workflow is designed to be feasible on consumer-grade GPUs. Resolution reduction can be achieved in extracting ROIs in our preprocessing steps, which eases GPU memory demands. For reference, the computation times are listed as follows: 1. SSL -> avg 2.12 min/epoch (train+val), 2.37 min (infer); and 2. survival analysis -> avg 3.02 s (train+val), 0.003 s (infer).

R3.2 Justification for ROI Extraction Method Re: Regarding simpler methods like motion differencing, our trials (e.g., K-means) showed this approach often yielded noisy/poorly-centered ROIs as motion is diffuse, not confined to the myocardium. In contrast, Farneback optical flow better localized myocardium by taking 1.38 s per patient, which we consider to be a practical duration.

We hope our responses have clarified the reviewers’ concerns. We thank all reviewers and the AC and are prepared to revise the manuscript accordingly upon acceptance.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top