Abstract

Patients with valvular heart disease often exhibit motion characteristics such as artery movements and anatomic characteristics, thus extracting dynamic features from coronary angiography (CAG) is of great significance for diagnosing. Given the challenge of limited annotated medical imaging data, we propose a novel self-supervised learning framework that integrates masked video modeling (MVM) and video contrastive learning, enabling the model to learn representations with both strong instance discriminability between video segments and local perceptibility between neighboring frames. Specifically, our framework consists of three key components: an off-the-shelf frozen encoder, an online encoder-decoder following the MVM pipeline and a momentum encoder composed of an exponential moving average of previous students. We enhance the integration of contrastive learning and MVM in mainly two ways: the frozen encoder converts the supervision of masked reconstruction from low-level pixels to high-level features; an augmentation strategy called frame shifting, is introduced specifically for video contrastive learning. To validate the effectiveness of our proposed method, we first conducted self-supervised pre-training on over 50,000 self-collected, unlabeled CAG sequences. Subsequently, we performed supervised fine-tuning using two small-scale labeled CAG diagnostic datasets, achieving state-of-the-art performance (98.1% and 75.0% F1-Score, respectively) in both supervised and self-supervised video recognition domains. Our code is publicly available at: https://github.com/ZmingShao/ConMVM.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3504_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/ZmingShao/ConMVM

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ShaZhi_Contrastive_MICCAI2025,
        author = { Shao, Zhiming and Zhang, Yingqian and Wei, Zechen and Ge, Yong and Wang, Chen and Ding, Guodong and Gao, Lei and Zhang, Liwei and Chen, Yundai and Tian, Jie and Hui, Hui},
        title = { { Contrastive Masked Video Modeling for Coronary Angiography Diagnosis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {129 -- 139}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a novel self-supervised learning (SSL) framework that effectively combines masked video modeling (MVM) and video contrastive learning. The framework consists of three main components: a frozen encoder, an online encoder-decoder, and a momentum encoder. The frozen pre-trained encoder enables the MVM reconstruction task to move from the traditional low-level pixel space to the high-level latent features space. Additionally, this choice allows the decoder’s output to be directly used for both reconstruction and contrastive loss computation. In order to address the limitations of standard augmentations in the masked setting (where operations like cropping or flipping can be damaging when a high proportion of masking is used) the authors introduce a novel weaker yet effective form of data augmentation called frame shifting. This tailored augmentation leverages the temporal dimension inherent in video data, generating two equally transformed views by shifting the frame sequence, which can serve as valid inputs for contrastive loss. The entire framework is first pre-trained on large-scale unlabeled coronary angiography (CAG) videos aiming to build a video foundation model. The pre-trained model is then fine-tuned on two specific diagnostic tasks involving the classification of severe mitral regurgitation (MR) and severe aortic stenosis (AS) using relatively small labeled datasets. The reported performance improves over state-of-the-art (SOTA) methods for both MR and AS classification. Ablation experiments are presented to isolate the contributions of each component (e.g., frame shifting augmentation, the choice of contrastive loss functions, and the use of a frozen encoder). Quantitative results show improvements in F1 scores on MR detection when compared to the baseline VideoMAEv2 model, suggesting that the proposed integration yields a more discriminative and coherent feature representation.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed framework is technically well-posed and offers a significant contribution within the medical area where annotated data is often limited and a key issue. The use of large-scale unlabeled data for pre-training mitigates the common bottleneck of annotated medical imaging/video data, making the approach attractive for real-world applications. The integration of the main paradigms of SSL is thoroughly addressed, and the adaptation to video data is supported by concrete design decisions, both of which reflect an understanding of the specific challenges to address. The idea of performing reconstruction in the feature space, made possible through the use of a frozen encoder, adds conceptual and practical value by allowing the learning objective to target more semantically rich representations rather than low-level pixel information, and the decoder’s output to be directly used for both loss functions computation. Moreover, the introduction of a tailored video augmentation that leverages the temporal nature of video data addresses a key limitation in applying standard contrastive learning techniques to masked video inputs, where aggressive spatial augmentations can impair the learning process. The paper situates itself well within the current literature on self-supervised learning and video modeling, referencing contemporary methods and building upon them with clear modifications. The paper’s clinical focus further underscores its practical significance. By centering the analysis on CAG data and validation on two clinical tasks, the research maintains a strong foundation in practical utility. The model overcomes the SOTA methods suggesting that the approach is not only theoretically well-conceived but also capable of delivering meaningful gains in practice. The ablation studies provide further credibility to the approach, as they effectively illustrate the contribution of each module: the effect of frame shifting, the importance of the two learning paradigms, the choice of contrastive loss, and the dimension of the pre-trained frozen encoder.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    While the integration of MVM and contrastive learning is central to the contribution, other studies (such as VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning - https://doi.org/10.48550/arXiv.2106.11250, and Multi-view Masked Contrastive Representation Learning for Endoscopic Video Analysis - https://proceedings.neurips.cc/paper_files/paper/2024/file/55cb562b1f5af71f6707f3ff3c7941e6-Paper-Conference.pdf) have explored similar combinations of these paradigms. Although the proposed implementation introduces novel elements, particularly in how it leverages video-specific augmentations and operates in the latent space, the paper would benefit from clarification on how the proposed integration offers advantages over these existing methods is necessary. Another limitation lies in the scope of the evaluation, which is restricted to two downstream diagnostic tasks based on relatively small labeled datasets. While the results are promising, the clinical robustness and generalizability of the approach remain uncertain. Consider adding experiments or discussions on how the method performs under varying video conditions or on external datasets. This would provide a clearer picture of its clinical robustness. Additionally, the choice of valvular disease classification as a downstream task could be questioned, given that CAG is not a primary modality for valvular assessment. Although the methodological contribution remains valid regardless of the specific clinical context, applying the framework to more standard angiographic tasks (such as identifying stenosis or calcification segments) could enhance its relevance and impact from a clinical perspective. Lastly, the paper would benefit from a discussion of potential limitations, failure cases, or biases in the model’s behavior. Understanding under what circumstances the model might underperform (e.g., due to data heterogeneity, noise, or domain shifts) is essential for a comprehensive evaluation, especially in a clinical context where the consequences of errors can be significant.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The paper would be strengthened by a more in-depth discussion of the role and implications of using a frozen encoder. For instance, consider exploring fine-tuning the encoder could further enhance performance or generalizability. More rigorous statistical analysis or confidence intervals would help assess the true impact of these design choices. Additionally, the complex architecture involving different components raises concerns about computational complexity. The paper could be extended by adding an in-depth analysis of the computational cost, inference speed, or memory requirements. It would also be useful to include a summary table of hyperparameters and training details, particularly whether SOTA models were initialized from scratch or fine-tuned. The paper currently references a work that does not involve masked modeling ([14]): consider citing Masked Contrastive Representation Learning for Self-supervised Visual Pre-training (doi: 10.1109/DSAA61799.2024.10722789) instead. Finally, there are some minor grammar/typos error:

    • in the abstract “a augmentation” should be corrected into “an augmentation”;
    • both in the introduction and in experiments “vessel” should be corrected into “vessels” since there is not only one epicardial vessel on the heart;
    • please modify the sentence “and consequently the latter generally performs better, aligning with practical experience” in 4.2 paragraph that is not straightforward
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a well-posed technical approach, along with a novel augmentation strategy, and includes empirical results that support its claims. The introduction of frame shifting as a video-specific augmentation and the shift from pixel to feature-level reconstruction led to the integration of the two main paradigms of SSL. These choices result in measurable improvements over strong baselines and indicate that the method is capable of learning more informative video representations from unlabeled data. The method demonstrates to overcome SOTA performance on two classification tasks, with comprehensive ablation studies that support the proposed design choices. However, the novelty related to existing literature needs to be more clearly established, and the limited scope of evaluation constrains the generalizability and clinical significance of the findings. A broader external validation, along with a deeper discussion of the model’s limitations and applicability, would significantly enhance the paper’s impact. A score of 4 (weak accept) reflects the potential of the approach along with the need for a more comprehensive validation and clearer delineation of contributions compared to existing literature.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The main contribution of the manuscript lies in the design and validation of a novel self-supervised learning framework for video-based coronary angiography (CAG) analysis. The authors propose a method that integrates Masked Video Modeling (MVM) with video contrastive learning, specifically tailored for the unique characteristics of CAG sequences, which are dynamic and structurally complex but often lack abundant labeled data.

    The proposed framework introduces several key innovations. First, it leverages a frozen, off-the-shelf encoder to transform the supervision signal of masked reconstruction from low-level pixel space to high-level feature space, enabling masked feature modeling rather than traditional pixel-wise reconstruction. This helps bridge the gap between MVM and contrastive learning, which otherwise rely on incompatible data representations. Second, the authors propose a new data augmentation strategy called frame shifting, which is designed to create augmented video views in a way that is compatible with the partial observability of masked inputs. This avoids the excessive distortion caused by strong spatial augmentations, which can undermine contrastive objectives. The method is pretrained on a large-scale, self-collected dataset of nearly 48,000 unlabeled CAG video sequences, and fine-tuned on two downstream diagnostic tasks: detection of severe mitral regurgitation and severe aortic stenosis. On both tasks, the proposed model outperforms state-of-the-art supervised and self-supervised video recognition methods indicating its strong potential as a domain-specific model for cardiovascular diagnostics.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    A primary strength of this work is its well-motivated integration of two complementary self-supervised learning paradigms—masked video modeling (MVM) and contrastive learning—specifically tailored to coronary angiography (CAG). While the general idea of combining MVM and contrastive learning has been explored in the computer vision literature (e.g., CMAE, TPAMI 2023), its application to medical video data is novel and timely. one of the methodological contribution is the formulation of masked feature modeling using an off-the-shelf, frozen encoder to provide high-level feature supervision instead of low-level pixel reconstruction. This formulation not only addresses the information disparity caused by masking but also allows the model to unify reconstruction and contrastive objectives in a more semantically meaningful feature space. The paper also introduces a novel data augmentation strategy called “frame shifting”, specifically designed for video contrastive learning in the context of masked inputs. Frame shifting preserves the temporal continuity of the video and introduces mild temporal variation, avoiding the negative impact of strong spatial augmentations which are common in contrastive learning but detrimental in masked settings. Another strength lies in the large-scale pretraining on a domain-specific dataset of nearly 48,000 unlabeled CAG sequences (however not available for the public), which is a substantial contribution in itself given the general scarcity of well-curated medical video datasets.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    A major weakness is the lack of clinical interpretability and feasibility analysis. While the method is shown to detect severe mitral regurgitation and aortic stenosis with high accuracy, there is no analysis of what features the model uses for its predictions. This omission reduces the potential for clinical adoption and undermines the claim that the model captures physiologically meaningful motion patterns. Additionally, the evaluation does not account for cross-hospital generalization or real-world deployment challenges. All data appear to be collected from a single institution, and the authors do not discuss domain shift, patient demographic variability, or other confounding factors that often impair model performance in clinical practice. The use of internal-only datasets, although understandable due to privacy constraints, limits the generalizability of conclusions.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The primary factors motivating this decision are the practical relevance, solid performance on a single large dataset, and well-executed engineering behind the proposed framework, which collectively represent a valuable contribution to the field of medical video analysis. While the methodological novelty is somewhat limited—largely combining existing techniques from masked video modeling and contrastive learning—the paper presents a thoughtfully adapted solution for a challenging and clinically important domain: coronary angiography.

    The authors effectively identify and address domain-specific constraints, such as the scarcity of labels, the need to model dynamic vascular and anatomical motion, and the incompatibility between strong augmentations and masked inputs. The proposed use of masked feature modeling with a frozen encoder, combined with a video-specific augmentation strategy (frame shifting), is well-motivated and leads to state-of-the-art results on two real-world diagnostic tasks.

    That said, the recommendation is not stronger due to several factors. First, the novelty is primarily in the integration and adaptation of known components, rather than the introduction of fundamentally new algorithms or modeling principles. Second, the evaluation scope is narrow, focusing only on binary classification tasks and lacking external validation or interpretability analysis. The paper would benefit from deeper clinical grounding, such as aligning model outputs with expert visual markers, or exploring generalizability across institutions. Finally, although the improvements in F1 score are non-trivial, they are come at the cost of increased model complexity and training infrastructure.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper introduces a novel framework that integrates the completentary learning strategies of contrastive learning and masked feature learning as a self-supervised learning strategy for diagnostic tasks in coronary angiography. The model is trained and evaluated using actual clinical CAG sequences. The author compare their approach to state of art methods and conduct ablation studies. The soundness of the proposed approach for CAG is convincingly motivated through a discussion of limitations in related methods and clinical data constraints.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    As a novel contribution, the authors propose frame shifting as a weak augmentation strategy to reduce the disparity between contrastive encoders, to more effectively combine masked feature learning and contrastive learning. The authors conduct an empirical evaluation, comparing their method against multiple supervised and unsupervised baselines. The proposed method consistently outperforms state-of-the-art approaches across all evaluated diagnostic tasks.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    One notable weakness of the study is the lack of an investigation into the robustness of the proposed approach. In particular, no cross-validation is applied, which the reviewer would expect when working with small, self-labeled datasets. Furthermore, the data splitting strategy remains unclear—for instance, it is not specified whether group-based splitting was used to prevent data leakage, which is especially relevant in scenarios with potentially correlated samples.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the methodological innovation is well justified and the comparison to baseline methods is sufficiently comprehensive, the interpretability of the reported results, especially in the context of the ablation experiments, would benefit from a robustness analysis, which is currently absent.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We sincerely thank all reviewers for their thoughtful and constructive feedback. Below are our responses to some of the main issues they raised:

Reviewer #1 Q1: The complex architecture involving different components raises concerns about computational complexity. A1: In fact, our method is computationally efficient, mainly due to the introduction of a distillation-like strategy, which allows us to use a lightweight ViT-S architecture for the online encoder during inference. Q2: The choice of valvular disease classification as a downstream task could be questioned, given that CAG is not a primary modality for valvular assessment. Applying the framework to more standard angiographic tasks (such as identifying stenosis or calcification segments) could enhance its relevance and impact from a clinical perspective. A2: Indeed, CAG is not the imaging modality used for valvular assessment, while our clinical application scenario involves assisting physicians in rapidly assessing valvular conditions during emergency CAG procedures for acute patients. Since current CAG diagnostic tasks based on leveraging temporal features are quite limited, we have so far only selected two downstream tasks for evaluation. In future work, we plan to include a wider variety of downstream diagnostic tasks, including neoatherosclerosis classification, stenosis identification and calcification segmentation etc. Q3: It would also be useful to include a summary table of hyperparameters and training details, particularly whether SOTA models were initialized from scratch or fine-tuned. A3: Due to the page limit of the conference, we were unable to provide all configuration details in the paper, many of which are available in our open-source code repository. As for the weight initialization of the SOTA methods mentioned, the self-supervised methods are first pretrained on our unlabeled dataset and then fine-tuned on each downstream task, while the supervised methods are directly trained on the downstream tasks using the pretrained weights released by the original papers for initialization.

Reviewer #2 Q1: The evaluation does not account for cross-hospital generalization or real-world deployment challenges. All data appear to be collected from a single institution. A1: We are currently working on improving the evaluation dataset, including collecting labeled data for specific diagnostic tasks from other hospitals to serve as external validation. It is also worth noting that our pretraining data was collected from multiple medical institutions, theoretically endowing the model with cross-institutional generalizability, which will be fully evaluated in future work.

Reviewer #3 Q1: The data splitting strategy remains unclear—for instance, it is not specified whether group-based splitting was used to prevent data leakage, which is especially relevant in scenarios with potentially correlated samples. A1: In this study, we use a fixed seed to randomly split into training set & validation set, without risk of information leakage since each sample corresponds to a unique patient after preprocessing.

In addition, we also find several other comments from the reviewers to be valuable, and we will consider these in future revisions for this conference or for follow-up work. Reviewer #1 mentioned several related works with similar motivation, which deserve further comparison and analysis. Reviewer #1 also suggested further exploration of the frozen encoder as well as a discussion on the model’s limitations, and kindly pointed out several minor issues. Reviewer #2 raised the important issue of limited model interpretability. Reviewer #3 highlighted the necessity of incorporating cross-validation.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top