Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

High-risk plaque (HRP) detected by coronary CT angiography (CTA) is associated with increased risks of major adverse cardiovascular events such as heart attack. Current identification of HRP characteristics involves labor-intensive segmentation of plaques, requiring substantial time and expert knowledge. In this work, we propose a novel coronary cross-sectional Vision Transformer (ViT) framework that bypasses the need for explicit segmentation by directly predicting the presence of HRP. Our approach extracts cross-sectional slices along the coronary centerline, ensuring that the model focuses on the artery. By leveraging the standard patch-based input of ViT, we capture not only the coronary cross-section itself but also surrounding contextual information (e.g., adipose tissue). Furthermore, we incorporate multiple levels of detail by combining the cross-sections from proximal and distal positions with their corresponding CTA axial planes, forming a comprehensive cross-sectional representation. We also embedded the actual 3D position of each cross-section into the positional encoding of the Transformer to enhance spatial awareness. Experimental results of 3,068 coronary arteries demonstrate that our method outperforms conventional approaches, highlighting its potential to optimize clinical decision-making in the care of coronary artery diseases.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2068_paper.pdf

SharedIt Link: https://rdcu.be/eHdUr

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04978-0_65

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/JZCambridge/ViTAL-CT-MICCAI25

Link to the Dataset(s)

N/A

BibTex

@InProceedings{LeAnj_ViTALCT_MICCAI2025,
        author = { Le, Anjie AND Zheng, Jin AND Gong, Tan AND Sun, Quanlin AND Weir-McCall, Jonathan AND O’Regan, Declan P. AND Williams, Michelle C. AND Newby, David E. AND Rudd, James H. F. AND Huang, Yuan},
        title = { { ViTAL-CT: Vision Transformers for High-Risk Plaque Classification in Coronary CTA } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {681 -- 690}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper proposed an end-to-end approach for classifying high risk plaques by analyzing cross-sectional slices along the coronary centerline of CTA via vision Transformers, sparing the need of segmenting plaque or vessel wall. Multi-scale input combining adjacent frames, axial context and the location of the slice was used in the model.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The main strenght of the study lies in incorporating different scales of features from the images, embedding both global and local representations. The ablation study nicely showed the effectiveness of each module.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The details of the three input streams are not sufficiently described, which may hinder other researchers from reproducing the work.
2. Three streams are unified into s 3-channel input that is fed into the ViT. However, the three channels appear to represent features at different levels, including high-resolution cross-sectional images and low-resolution latent features extracted from ConvNeXt and U-Net. This raises concerns about feature compatibility and how effectively the model handles the resolution mismatch during fusion. Further clarification on the fusion strategy and any preprocessing steps to align feature scales would be helpful.
3. The title “Heart Attack Prediction” is not appropriate, as the proposed model is designed for classifying high-risk plaques, which does not directly indicate the occurrence of a heart attack. While high-risk plaques are considered risk factors for adverse cardiac events, they are not equivalent to heart attacks. Therefore, the title is misleading.
4. The comparison study only included limited models, basically ViT and ResNet. Other SOTA classification models were not involved in comparison.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
1.The details of the three streams of the input needs to be explained. According to Figure 1 (a), the 9 cross-sectional images were arranged into a 3×3 grid layout. The concern is that each cross-section has a relatively small size, hindering the detail representation of the lesion.What is the resolution of the axial plane input? Did it cover the whole scan or only the region of interest indicated by the red box?
1. At what level was the data split into three sets? Patient, vessel or cross-sectional level?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The topic is of clinical importance and the study has novelty in incorporting different scales of features and a clear ablation study is presented. However, the description of the implementation of the multi-stream input needs improvement. The comparison study is weak.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

The authors propose a coronary cross-sectional Vision Transformer framework for the prediction of high-risk plaque features in coronary CT angiography (CCTA). The model integrates cross-sectional views extracted from proximal and distal vessel segments with their corresponding axial CTA slices to enhance contextual understanding and feature representation.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The automated prediction of high-risk plaque (HRP) features represents a valuable advancement for supporting clinical research and improving risk stratification in coronary artery disease.
- The use of the SCOT-HEART dataset, comprising 3,068 annotated coronary arteries, provides a robust foundation for training and evaluating the proposed model, offering both scale and clinical relevance.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The title should be revised to more accurately reflect the model’s objective. Since the model predicts high-risk plaque (HRP) features rather than clinical events, references to “heart attack” is misleading.
- While the Vision Transformer (ViT) architecture is motivated by its ability to capture global context, the application to HRP detection warrants further justification. HRP features are typically small, localized, and often confined to just a few slices. It remains unclear why incorporating information from distal vessel segments would enhance detection in proximal segments. Figure 2, for example, shows that HRP is localized to only 2–3 slices, and the added value of distal context is not evident. A stronger justification or supporting analysis is needed to validate this architectural choice.
- Evaluation should be conducted separately for each HRP feature. The current presentation lacks information on class imbalance across HRP subtypes. For example, features such as spotty calcifications may be more prevalent and easier to detect than rare features like the napkin-ring sign. It is important to demonstrate that the model performs reliably across all clinically relevant HRP features, including the rarer and more subtle ones.
- The training process for the baseline ResNet2D model is unclear. Was the model trained on a per-slice basis? Clarifying the input configuration and training strategy is essential for reproducibility and fair comparison.
- A key limitation of the proposed method is the ambiguity in what the model is actually learning. It is not clear whether the model is detecting local HRP features directly, or merely learning to infer their presence based on surrogate markers such as total plaque burden or compositional features. A more detailed interpretability analysis would be necessary to assess whether the model truly captures the intended fine-grained pathology.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
- It would be great to provide a comparison against a strong convolutional baseline, such as a pure ConvNeXt model, to better assess the added value of the transformer-based approach.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- The primary limitation of the study lies in the lack of performance evaluation per individual HRP feature class. Given the heterogeneity and varying prevalence of HRP features, such analysis is crucial to assess the robustness and clinical utility of the model. Furthermore, a direct comparison against a baseline model such as a pure ConvNeXt architecture would be essential to substantiate claims of architectural superiority and to determine whether the proposed Vision Transformer-based design offers a meaningful performance advantage over established convolutional approaches.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have addressed all previously raised concerns. Some questions remain regarding the evaluation per HRP class and its reported superior performance compared to other methods. Nonetheless, I believe these points merit discussion during the conference.

Review #3

Please describe the contribution of the paper

The authors propose an approach for classification of high risk plaque (HRP) in coronary arteries using CT angiography (CTA). The methods include incorporation of Vision Transformer (ViT) through analysing centerline cross sections and axial plane processing, fused into 3 channel input. The method is validated on a dataset of 3,068 coronary arteries.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Major strengths of the paper:
1. The paper introduces a novel approach to fuse different features from different views and merge them into 3-channel. The novelty of the method lies also in the way these methods are combined with positional embeddings within ViT architecture.
2. The methods is tested on a large dataset of coronary arteries.
3. Ablation studies were performed to prove the added value of each module.
4. References reflect the latest research in the field and adequate comparison with literature is provided.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
The paper does not have any major weaknesses, only minor weaknesses. Some clarifications would help in understanding the method in more details:
1. It may not be clear to the reader why only 4 patients were included in the testing, when 400 manually segmented arteries were available (320 used for training and 80 for validation). Why was then such small number of arteries used for testing in the Unet module?
2. It may not be clear why a dice score of 0.72 was adopted for feature extraction process, were other score tested or how was this threshold determined?
3. It is not clear whether cross-validation was preformed on the used dataset.
4. It is not clear in the section of comparison with literature results if the methods compared are all tested on the same dataset and same patients.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

The authors paper would benefit if some clarifications of the method itself and numerical results are clarified in the abstract to draw the attention of the approach to a reader.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The recommendation is based on the novelty of proposed methods (unique approach organised in an interesting way to handle CT images), combined with extensive experiments and results surpassing the state-of-the-art-results.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank all reviewers for their consensus on technical soundness and clinical significance of our work. Our responses: [R3] Input, Alignment & Fusion (R3-q1, o1) The three streams are stated in Secs 2–3 & Fig 1. Cross-sectional: 0.5 mm resampled sections along the centerline at native spacing (Sec 3.1, Fig 1a). Multi-slice: 9 neighbours (4 proximal + 4 distal) add longitudinal context (Sec 2, Fig 1a). Global: the full axial slice passes through a U-Net (Sec 2.2, Fig 1b). Further clarification will be added on the location matching for each patch. More details can be found in our released code. (R3-o1) Our prior work (>1200 arteries) found the median coronary diameter to be 3.1 mm (2.3–3.9). With 0.35 mm spacing, a ViT patch 16x16 token (5.6 mm) covers the lumen plus a one-diameter peri-vascular margin. (R3-q2) For feature compatibility, we applied spatial alignment by extracting features from the corresponding axial slice (U-Net) and adjacent cross-sections (CTCB) centered at the same cross-section location. We performed channel-wise normalization (batchnorm 2D). The three aligned and normalized outputs are concatenated into a 3-channel input for ViT.

[R2] Global Context Justification (R2-q2) Plaques, such as positive remodeling, soft plaque, and mixed composition, require examining both proximal and distal segments of the artery in clinical practice. Our use of a global/multi-slice context is designed to mimic this reasoning.

[R2] HRP Feature Evaluation (R2-q3) We used HRP collectively based on the actual clinical concern of ref 16 that HRP collectively is associated with an increased risk of MACEs (Sec 1). We noticed the imbalance of subtypes with the extreme case of the napkin-ring sign (~1% arteries, ref 16), so per-subtype metrics could be unstable and potentially misleading. Nevertheless, our work is among the first to propose an end-to-end segmentation-free framework that detects HRP in CTA and serves as a foundation for more granular subtype-specific analysis.

[R2] Model Interpretability (R2-q5) Ref 16 reported HRP in 40% stenotic patients, a performance much lower than our AUC 0.82, precision 0.74 & recall 0.77. Given a high correlation between stenosis and total plaque burden, our model must implicitly encode plaque composition rather than rely on simple surrogate metrics. However, we acknowledge the value of future work explicitly investigating how the learned features relate to HRP.

[R2, R3] Design Justification & Model Comparison (R2-o1, R3-q4) We would like to emphasise that our ViTAL-CT provides multiple advancements for HRP detection rather than just comparing ViT with CNN. As recognised by R3, ablations confirm each module’s gain for HRP classification. Moreover, provided a ViT token matches a coronary cross-section, we believe ViT is more suitable for handling semantic discontinuities at patch boundaries, unlike CNN or Swin-ViT. Finally, ResNets are clinical SOTA models in coronary CT studies (Mu et al., Radiology (2022), Tan et al., European Heart Journal (2023)), partly because of moderate dataset sizes; we therefore report them as a strong baseline.

[R1] U-Net Usage & Dice (R1-q1,q2) U-Net supplies axial-plane context features only. 400 arteries were split 320/80 for train/val (10 slices each artery). A separate 4 arteries (>400 slices) produced Dice 0.72 to select the bottleneck epoch. Dice is not a classification threshold. We will clarify this in Secs 2.2 & 3. [R2, R3] Title (R2-q1, R3-q3) We will retitle: “ViTAL-CT: Vision Transformers for High-Risk Plaque Classification in Coronary CTA.” [R1, R3] Training Details (R1-q3, R3-o2) We adopt a fixed 60/20/20 patient-level split with stratification and no cross-validation, ensuring identical train/val/test cohorts across methods. (R1-q4) Consequently, every baseline is evaluated on the exact same patient set. [R2] Method Reference (R2-q4) “ResNet50 2D” (Table 1) is ViTAL-CT with the ViT module replaced by a 2D ResNet50; all others remain unchanged

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The authors propose a Vision Transformer-based framework for classifying high-risk plaques in coronary CT angiography, integrating multi-scale cross-sectional and axial views into a unified 3-channel input. Reviewers raised some concerns about clarity in input stream fusion and class-wise performance evaluation, but the authors have addressed key points in the rebuttal. Given the methodological novelty and potential for clinical impact, I recommend acceptance.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

ViTAL-CT: Vision Transformers for High-Risk Plaque Classification in Coronary CTA

Author(s):