Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Accurate coronary artery segmentation is critical for computer-aided diagnosis of coronary artery disease (CAD), yet it remains challenging due to the small size, complex morphology, and low contrast with surrounding tissues. To address these challenges, we propose a novel segmentation framework that leverages the power of vision foundation models (VFMs) through a parallel encoding architecture. Specifically, a vision transformer (ViT) encoder within the VFM captures global structural features, enhanced by the activation of the final two ViT blocks and the integration of an attention-guided enhancement (AGE) module, while a convolutional neural network (CNN) encoder extracts local details. These complementary features are adaptively fused using a cross-branch variational fusion (CVF) module, which models latent distributions and applies variational attention to assign modality-specific weights. Additionally, we introduce an evidential-learning uncertainty refinement (EUR) module, which quantifies uncertainty using evidence theory and refines uncertain regions by incorporating multi-scale feature aggregation and attention mechanisms, further enhancing segmentation accuracy. Extensive evaluations on one in-house and two public datasets demonstrate that the proposed framework significantly outperforms state-of-the-art methods, achieving superior performance in accurate coronary artery segmentation and showcasing strong generalization across multiple datasets. The code is available at https://github.com/d1c2x3/CAseg.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0742_paper.pdf

SharedIt Link: https://rdcu.be/eHwU2

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04971-1_61

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{DonCai_Unleashing_MICCAI2025,
        author = { Dong, Caixia AND Dai, Duwei AND Han, Xinyi AND Liu, Fan AND Yang, Xu AND Li, Zongfang AND Xu, Songhua},
        title = { { Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {647 -- 657}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes a coronary artery segmentation framework that leverages a parallel ViT-CNN encoding architecture, incorporating variational feature fusion and evidential-learning-based uncertainty refinement. The authors conduct comprehensive evaluations on one private dataset (CCTA119) and two public datasets (ASOCA and ICAS-100), demonstrating superior performance compared to listed methods.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The study targets a clinically relevant challenge—robust and accurate coronary artery segmentation—which is essential for the diagnosis and treatment planning of coronary artery disease.
2. The experiments are conducted on three datasets, demonstrating the generalization capability of the proposed method.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. In Fig. 2, the encoder is labeled as the Vision Foundation Model (VFM), specifically SAM-Med3D. However, the AGE module is not part of the SAM-Med3D encoder and should not be included within the encoder block. Since SAM-Med3D outputs image embeddings, the AGE module should be clearly separated as a task-specific enhancement component applied post-encoder. Please revise the figure accordingly to avoid confusion.
2. In Fig.2, AGE module performs attention-based enhancement on the image embeddings from the ViT encoder, followed by a reshape and residual connection. However, a direct addition would be invalid due to mismatched shapes, The authors should clarify the operation.
3. The paper lacks a clear description of the loss functions used during training. Given the use of evidential learning, variational modules, and uncertainty-guided refinement, the loss design is critical for understanding and reproducing the method. Please clarify the full training objective, including main segmentation loss, uncertainty-related terms, and any auxiliary or regularization losses.
4. While the ablation study demonstrates the incremental benefits of key components in the proposed framework, it lacks several critical comparisons needed to fully validate the contribution and interplay of individual modules.
5. Although the authors aim to mitigate vessel fragmentation—a common challenge in coronary artery segmentation—the paper lacks dedicated evaluation metrics to quantitatively support this claim.
6. Although the method is evaluated on public datasets (ASOCA and ICAS-100), some SOTA performance comparisons with existing methods are not provided on these datasets.
7. The method involves a complex multi-module design, but the paper lacks any evaluation of training cost, inference efficiency, or model size, which are critical for assessing real-world usability.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

see major weaknesses
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors state that they will provide the missing details, including the loss function and overall framework, and clarify them in the revised version.

Review #2

Please describe the contribution of the paper

The paper proposes a coronary artery segmentation framework that leverages Vision Foundation Models through a parallel ViT-CNN encoding architecture. A ViT encoder captures global vessel structures, enhanced by an attention-guided enhancement module, while a CNN encoder captures local details. These features are fused using a cross-branch variational fusion module that models latent distributions and applies variational attention. Additionally, an evidential-learning uncertainty refinement module based on evidence theory refines uncertain regions.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The method effectively combines global and local feature extraction with adaptive variational fusion and uncertainty refinement, achieving superior segmentation accuracy and strong generalization across multiple datasets.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Although several SAM-related methods are discussed in the related work section, the experimental section does not provide sufficient comparative experiments involving these approaches. Another point of concern is that the proposed method utilizes a foundation model-based encoder, raising questions about whether the compared baseline methods also benefit from such pre-training. If not, the fairness of the comparisons may be compromised and should be carefully clarified.
2. The equations lack proper punctuation and do not follow standard academic conventions—e.g., tensors should be bold, variables italicized, and subscripts non-italic. These issues affect clarity and presentation quality.
3. In the experimental setup, it is unclear why only a subset of the ImageCAS dataset was used. Was the subset selectively curated? Why was the full dataset not utilized for evaluation? The authors are encouraged to clarify the selection criteria and justify this choice to avoid potential concerns about data bias.
4. It would be helpful if the authors could clearly explain how each of the designed modules addresses the specific challenges outlined in the Introduction, such as small vessel structures, under-segmentation, and over-segmentation. A more explicit mapping between the problem statements and the methodological components would strengthen the overall clarity and motivation of the work.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The major factors influencing my score are insufficient experimental validation against relevant baselines, unclear dataset usage, presentation issues in equations, and a lack of explicit connection between the method and the stated challenges.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

My primary concern lies in the persisting ambiguity of the authors’ explanation.

None of the baseline methods leverage pretrained models, whereas the proposed module incorporates a feature extractor derived from the pretrained SAM model. This raises the question of whether the performance gains are attributable to the use of SAM features rather than the proposed design itself. Hence, I am concerned about the fairness of the experimental comparison—would other methods also benefit if SAM-based features were integrated similarly?

Regarding the dataset, the authors claim that the ImageCAS subset was randomly selected, which to some extent hinders reproducibility. Additionally, it is unclear why the full ImageCAS dataset was not used for model comparison experiments, yet a subset was chosen for cross-dataset validation. Furthermore, why is it necessary to ensure similar data scales in cross-dataset experiments? The paper does not provide results illustrating transfer performance from ImageCAS to CCTA119, which further limits the justification.

Review #3

Please describe the contribution of the paper

The authors propose a framework based on the combination of previously published algorithms to benefit from the advantages of parallel encoding and vision foundation models. The highlights of the algorithm include the combination of attention-based and convolutional-based encoders, which resulted in a higher performance compared to similar coronary artery segmentation approaches. In addition, they added evidential-learning uncertainty refinement for mistaken region refining to enhance the segmentation accuracy.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This paper represents a novel approach in medical image segmentation of small structures by providing a framework that benefits from the advantages of vision foundation models whilst adding a plus to the tasks by combining them with parallel encoders. The approach is based on attention mechanism-based and convolutional encoders, which address local and global details, as well as multi-scale feature aggregation, which adds robustness to feature size variability. Their algorithm significantly outperformed current state-of-the-art methodologies, which provides added value to their research.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The authors do not comment on the limitations of CCTA, which requires contrast agent and is limited when severe calcification is present and is subject to heart rate stability. They do not discuss how the algorithm performs in these cases, and neither address if there is a differential effect depending on the risk associated to stenosis and plaque build-up in the coronary arteries (which can influence how scans show using CCTA).
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

They provide a link for the code with all the corresponding folders, but I was not able to open it as it popped “the requested file does not exist”. Authors might want to address this.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(6) Strong Accept — must be accepted due to excellence
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper provides a new robust methodological approach based on the combination of vision foundation models and parallel encoding, which leverages the potential of vision transformers and convolutional networks for local and global detail retrieval by adding a multi-scale feature aggregation module. As a result, this methodology could be applied in future research addressing vessel segmentation in other tissues and thus provides a novel approach with potential for clinical impact.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Authors addressed all the comments from the reviewers.

Author Feedback

We sincerely thank all reviewers for their constructive feedback and insightful suggestions. Here we address their main concerns:

CCTA Limitations & Reproducibility (R3): (1) We will discuss CCTA’s limitations—contrast dependency, calcification artifacts, and heart-rate sensitivity—in the Discussion. While case-level calcification / stenosis labels are unavailable, we plan a stratified analysis to assess robustness across healthy and diseased cases in follow-up work. (2) Code Accessibility: The anonymized link is now functional, with all necessary files and folders included.

Architectural Clarifications (R4.1, R4.2): (1) AGE Module Placement: Fig. 2(a) will be revised to depict AGE as a post-encoder module, separate from the ViT encoder. (2) Shape Alignment: In Fig. 2(b), ViT outputs embeddings (C×D×H×W, C=384). After attention enhancement, a reshape ensures shape alignment for residual addition. A Conv layer then adjusts channels (384→256) to match CNN features for CVF fusion. Fig. 2(b) will detail these dimensions and flows.

Loss Function Details (R4.3): Our training objective integrates an Adaptive Segmentation Loss—dynamically balancing weighted cross-entropy and Dice losses via learnable scalars—and an Evidential Regularization Loss using a Dirichlet-based term to guide uncertainty estimation. Full equations will be added.

Expanded Ablation Study (R4.4): Table 3 already shows incremental improvements: (1) Enhanced-ViT + AGE: +1.25% DSC; (2) CVF vs. Sum Fusion: +1.21%; (3) EUR: +1.19%. These gains will be explicitly highlighted in the revision.

Vessel Connectivity Metrics (R4.5): We will reintroduce clDice (Centerline Dice), originally omitted for space. On CCTA119, our method achieves 93.81% vs. 90.23% for VSNet—quantitatively confirming improved vessel continuity.

Dataset Coverage & SOTA Comparisons (R4.6): We conducted the same comprehensive comparisons against nine SOTA methods on ASOCA and ICAS-100 as on CCTA119 (Table 1), confirming consistent superiority. For cross-dataset generalization (Table 2), we reported results for four representative models to ensure clarity. This selection strategy will be clarified.

Efficiency (R4.7): (1) Model Complexity: Our model has 85M parameters and 224 GFLOPs, comparable to SOTA (e.g., nnFormer: 149M, 250G). (2) Inference Speed: Processes a full CCTA volume in 20 seconds on an NVIDIA 3090 GPU; Manual segmentation by radiologists requires ≥10 minutes per case, highlighting our method’s practical value. Details will be included in follow-up work.

SAM Comparisons & Fairness (R5.1): All baselines (e.g., nnFormer, TransUNet) were trained from scratch without pre-training. Our model uses SAM-Med3D’s ViT encoder only for feature extraction, with its parameters frozen except for the final two blocks. Thus, performance gains derive from architectural innovations—parallel encoding, CVF, and EUR—not pre-training. SAM-related methods were not included in experiments due to task differences: they rely on prompts or interaction in 2D, whereas our setting is fully automatic 3D segmentation. Identical training across three datasets confirms superior generalization.

Equation Formatting (R5.2): All equations will adopt standard formatting: bold tensors, italic variables, and plain subscripts.

ImageCAS Subset Justification (R5.3): ICAS-100 was randomly sampled from ImageCAS (1000 cases) to match the scale of CCTA119 (119) and ASOCA (40), enabling balanced cross-dataset evaluation. Full ImageCAS evaluation will be included in follow-up work.

Module-Challenge Mapping (R5.4): We will map each challenge to its module in the Introduction: Small vessels / under segmentation: CNN encoder for fine details+EUR multi scale fusion to boost weak tiny branches; Over segmentation / fragmentation: ViT global context for anatomical continuity+CVF fusion to balance cues and suppress spurious fragments; Uncertain / low contrast regions: EUR evidential refinement.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

My main concern is that this paper does not compare with various existing foundation models for medical image segmentation, and it is not clear if the benefits come from the use of pre-trained foundation models over other baselines. The authors unfortunately failed to address this critial issue in their rebuttal.

back to top

Unleashing Vision Foundation Models for Coronary Artery Segmentation: Parallel ViT-CNN Encoding and Variational Fusion

Author(s):