Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Accurate prediction of stroke functional outcome, particularly the 3-month modified Rankin Scale (mRS), is crucial for personalized treatment. Vision Transformers excel in medical imaging and multimodal fusion but struggle with stroke MRI due to data scarcity and rigid tokenization, which may miss subtle anomalies. In response, we propose the Lesion-Centered Vision Transformer (LC-ViT), integrating lesion-focused MRI preprocessing, adaptive token merging, and multimodal fusion. LC-ViT extracts axial, coronal and sagittal views centered on ischemic lesions to optimize visibility and employs a pretrained TCFormer (token-clustering transformer) for adaptative token generation. A mutual cross-attention mechanism further integrates imaging and clinical data. Evaluated on a retrospective private cohort comprising DWI MRI and 62 clinical variables (e.g. demographics, neurological assessments.) of 119 stroke patients treated with thrombectomy (65% favorable outcome), LC-ViT achieves a new state-of-the-art performance (AUC:0.80 ± 0.03, Accuracy: 0.77 ± 0.02) significantly outperforming single modality based deep architectures. Our results highlight the potential of lesion-focused tokenization for stroke outcome prediction and interpretability and broader applications in lesion-localized multimodal analysis. Our code will be publicly released upon acceptance.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2108_paper.pdf

SharedIt Link: https://rdcu.be/eG4Dl

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05182-0_29

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/mingtian12345/LC-VIT

Link to the Dataset(s)

N/A

BibTex

@InProceedings{LiuMin_Lesioncentered_MICCAI2025,
        author = { Liu, Mingtian AND Hatami, Nima AND Mechtouff, Laura AND Cho, Tae-Hee AND Lartizien, Carole AND Frindel, Carole},
        title = { { Lesion-centered vision transformer for stroke outcome prediction from image and clinical data } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15974},
        month = {September},
        page = {291 -- 301}
}

Reviews

Review #1

Please describe the contribution of the paper
This work proposes a LC-ViT , a new model that predicts stroke recovery by focusing on the lesion area in MRI scans by integrating with adaptive token clustering, and multimodal fusion. By comparing with other SOTA methods, this methodology gives better results and shows how important it is to focus on the lesion for accurate predictions.
- Lesion-Centered Vision Transformer (LC-ViT), integrates 3 modules Lesion-Centered Views, Adaptive Token Merging and Mutual Cross-Attention.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Lesion-Centered Triamese-ViT Architecture
- TCFormer: Task-Conditioned Token Clustering Transformer
- Integration of Triamese and TCFormer into a unified framework
- Combining clinical data with image features is a demanding task for stroke outcome prediction for personalized treatment.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Results are based on internal cross-validation, which may not fully reflect performance on unseen data. No unseen data used for testing.
- Lack of comparison with Non-ViT Methods for Multimodal Fusion
- Ablation study could be more detailed. For eg: number of tokens retained in TCFormer, effect of different clinical feature combinations remains unexamined.
- While the model architecture demonstrates strong predictive performance, its overall complexity (multiple TCFormer branches, mutual cross-attention, and multimodal fusion) raises questions about scalability and generalizability to smaller institutions with limited computational resources. A discussion on model simplification or lightweight variants would strengthen the practical applicability.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Even though this work involves some novelty, there are concerns given the computational complexity of the given architecture.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This paper introduces LC-ViT, a multimodal model for the prediction of 90-day functional outcomes for ischemic stroke patients. The extraction of visual information from stroke lesions is carried out with three Token Clustering Transformers that receive axial, saggital and coronal views of DWI. The processing of clinical variables was done with an MLP. The integration of multimodal information was done using bidirectional cross-attention modules, followed by classification with a simple subnetwork.

The proposed approach was validated using a 10-fold cross-validation from a private dataset with 119 cases. The proposed methodological components demonstrated superior performance regarding other alternatives from the literature, with exception of the MLP against Linear Regression on unimodal clinical analysis. From a multimodal perspective, the proposed model demonstrated enhanced performance and explainability.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper in general is well-written. Authors clearly state why the functional outcome prediction is important for stroke patients. Furthermore, the components of the proposed approach are clearly explained, and there exists a clear motivation for their incorporation. The results obtained show an important performance boost from using the proposed approach w.r.t. unimodal baselines and other multimodal approaches.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

This work has a big limitation: this method depends on manual annotations of ischemic stroke lesions (which require time from experts, making them costly) to execute the imaging analysis with the TCFormers. This in turn questions if the comparisons w.r.t. imaging unimodal approaches were fair, because their inputs were centered in the brain and hence may not contain the same discriminative information.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

N/A
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The paper proposed LC-ViT which is a lesion-centered ViT to effectively integrates structured clinical data with 3D MRI. Their LC-ViT used adaptive vision transformers for better stroke lesion representation, an adapted TCFormer with dynamic token merging to refine lesion structure modeling; and a unified framework that fuses imaging and clinical data via mutual cross-attention. The author proposed this method in predicting the modified 3-month Rankin Scale (mRS) metric.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper used a novel formulation by a clinical tabular data encoder based on simple MLP. This component specifically involves clinical tabular data, and the way it’s being integrated—likely in conjunction with image features and attention mechanisms—reflects the novel formulation of clinical data it within a multimodal model. The architecture of the TCFormer model also a good approach to tackle the MRI input.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The dataset of ischemic stroke is not large (~100 patients). Clinical data, consisting of 62 variables, should be defined and listed somewhere (cannot find in supplementary). The result of AUC & accuracy is not that high for clinical feasibility. No significant improvement between using the clinical data alone or with the previous work (XTab). The flow of presenting the results in the Table should be more concise.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

NA
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The idea is good but lack of clarity in clinical features descriptions, low variability in dataset input, and unsignificant improvement compared to clinical data alone.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We sincerely thank the reviewers for their thoughtful and constructive feedback. All of the points raised are valuable and will guide both the final version of this paper and our future work.

Limited dataset (R1 & R3) We agree that the current single-center dataset constrains the evaluation of our model’s ability to generalize. We are currently collecting and processing a multi-center dataset of more than 300 patients. In future work, we will report results on this larger, multi-institutional cohort to provide a more comprehensive evaluation.

Computational complexity (R1) Although the pipeline appears complex, it is light-weight and does not require substantial computational resources. Converting 3D MRI volumes into three 2D slices reduces the parameter count compared with 3D models. Image features are extracted with the pretrained TCFormer-Light model, and the tabular branch uses a simple MLP. Overall, the model remains compact. Nevertheless, we will explore additional simplifications in future work.

More detailed ablation study (R1) Due to page limits for the conference paper, we reported only key ablations. We plan to include a more comprehensive ablation study in the extended journal version, including, for instance, effect of different clinical feature combinations.

Manual lesion annotations (R2) R2 notes that our method currently depends on manual lesion annotation. Importantly, we only require the user to point the rough location of the lesion center, not a full precise segmentation mask. In ongoing work, we are integrating automatic segmentation models to obtain lesion masks without manual input.

Description of 62 clinical variables (R3) We apologize for the brief description of the clinical variables. MICCAI rules prevent putting the full list in supplementary material; however, we will include a more detailed description of all variables in the final version, including the number of features per categories.

AUC and ACC results (R3) R3 writes that performance is not that high for clinical feasibility. No significant improvement between using the clinical data alone. Three-month stroke outcome prediction is very challenging—even for experts. Our model achieves AUC around 0.8 which is on par with the current state-of-the art, but our goal is to keep increasing this performance first by scaling to larger datasets (see comment of limited dataset). One reason why our model may not significantly overcome performance of models based on clinical data only is that the considered clinical variables include some important image-related features, such as radiological scores, which are derived by radiologists during their diagnosis. We will discuss this point in the final version of the paper.

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

Lesion-centered vision transformer for stroke outcome prediction from image and clinical data

Author(s):