Abstract

For the early diagnosis of Alzheimer’s disease (AD), it is essential that we have effective multiclass classification methods that can distinct subjects with mild cognitive impairment (MCI) from cognitively normal (CN) subjects and AD patients. However, significant overlaps of biomarker distributions among these groups make this a difficult task. In this work, we propose a novel framework for multimodal, multiclass AD diagnosis that can integrate information from diverse and complex modalities to resolve ambiguity among the disease groups and hence enhance classification performances. More specifically, our approach integrates T1-weighted MRI, tau PET, fiber orientation distribution (FOD) from diffusion MRI (dMRI), and Montreal Cognitive Assessment (MoCA) scores to classify subjects into AD, MCI, and CN groups. We introduce a Swin-FOD model to extract order-balanced features from FOD and use contrastive learning to align MRI and PET features. These aligned features and MoCA scores are then processed with a Tabular Prior-data Fitted In-context Learning (TabPFN) method, which selects model parameters based on the alignment between input data and prior data during pre-training, eliminating the need for additional training or fine-tuning. Evaluated on the Alzheimer’s Disease Neuroimaging Initiative (ADNI) dataset (n = 1147), our model achieved a diagnosis accuracy of 73.21%, outperforming all comparison models (n=10). We also performed Shapley analysis and quantitatively evaluated the essential contributions of each modality.


Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2048_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/huangshuo343/multimodalAD

Link to the Dataset(s)

ADNI dataset: https://adni.loni.usc.edu/

BibTex

@InProceedings{HuaShu_Multistage_MICCAI2025,
        author = { Huang, Shuo and Zhong, Lujia and Shi, Yonggang},
        title = { { Multistage Alignment and Fusion for Multimodal Multiclass Alzheimer’s Disease Diagnosis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15974},
        month = {September},
        page = {379 -- 389}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    Paper aims to improve the classification accuracy of the Alzheimer’s disease by incorporating multi-model data more specifically, it use MRI-PET, FOD and tabular data containing age, sex, MoCA scores.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Idea of using multi-model information to improve the Alzheimer’s classification is useful. 2) Incorporating MRI-PET in contrastive manner is intuitive and useful. Incorporating tabular data, especially cognitive scores like MoCA, is well-motivated.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) I find writing of the paper complicated. For example, section 2.3, it was too quick. What is t? Equation 6 is independent of t? Also Figure 3 and section 2.3 does not augment each other. 2) Table 2, where feeding T1, PET, FOD, Tab what does it mean to feed imaging modality to AdaBoost? It’s not clear what information is used as input and how? 3) Section 2.2 is largely inspired by ALBEF. Can you state the individual contribution of L_{mpc}, L_{mim} and L_{mpm}. Also why 15% random masking? Can you check the impact of masking by varying the percentage? 4) Why momentum model is essential? Can you state the numbers with and without momentum? Also how was \alpha =0.4 selected?

    Minor Suggestion: 1) Text, captions should augment the table/ figures. For example Table 4 and Section 3.2, it does not say what is does contr. (not fused) stands for. 2) In case of space constraint you may get rid of MCC. 3) Clearly highlight the contributions of the paper in the introduction. 4) Table 1 might not be essential and space can be used to expand section 2.3.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I like the idea of incorporating multi-model information to improve the accuracy, however I feel that the technical novelty is limited. Paper will be benefited by improving the writing.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I thank the authors for their response and recommend acceptance. However, I suggest the authors carefully revise the manuscript to improve the writing and address the comments and suggestions raised during the review process.



Review #2

  • Please describe the contribution of the paper

    This paper presents a multimodal framework for multiclass Alzheimer’s disease diagnosis, aiming to distinguish between cognitively normal (CN), mild cognitive impairment (MCI), and Alzheimer’s disease (AD) subjects. The proposed pipeline integrates diverse data sources—T1-weighted MRI, tau PET, fiber orientation distribution (FOD) from diffusion MRI, and Montreal Cognitive Assessment (MoCA) scores. The FOD modality is processed through a Swin-FOD model, while contrastive learning is applied to align features from MRI and PET. These are then combined with MoCA scores and passed to a TabPFN classifier, which leverages prior-data fitting without requiring fine-tuning. The model is evaluated on the ADNI dataset (n=1147) and achieves a 73.21% classification accuracy, outperforming 10 comparative models. The authors also employ Shapley value analysis to interpret the modality contributions.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Strong performance and relevance: The method shows a substantial improvement over existing models in multiclass AD diagnosis, a clinically important and challenging problem due to overlapping biomarker distributions across stages.
    • Well-designed multimodal pipeline: The combination of advanced representation learning (Swin-FOD, contrastive alignment) with a parameter-free TabPFN classifier is both novel and efficient. The inclusion of a broad range of modalities, including cognitive scores, enriches the prediction context.
    • Presentation and clarity: The paper is clearly written, with clean and informative figures that support the narrative and technical claims. The methodology and experiments are well-documented, contributing to reproducibility and ease of understanding.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Limited insights: While the performance gains are clear, the manuscript could benefit from deeper insights into why the various components work so well.
    • Limited related work discussion: The related work section could be expanded to better contextualize the method within the broader landscape of AD classification, especially with respect to multimodal or contrastive learning-based approaches.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The proposed approach is well-thought and demonstrates clear improvements in a relevant clinical task. A few suggestions for refinement and possible directions for future work include:

    • the inclusion of additional insight or ablation analysis to better understand the role of contrastive alignment and how it resolves modality ambiguity.
    • expanding the discussion of related work to include more recent or similar efforts using multimodal or contrastive methods for AD diagnosis.
    • proofreading the manuscript to correct minor typographical errors.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper introduces a robust and well-motivated multimodal approach for Alzheimer’s disease diagnosis. The proposed method demonstrates significant performance improvements on a large public dataset, and its modular design makes it appealing for future extension. While some interpretability aspects and related work discussion could be deepened, the work is technically sound and well-presented.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The decision remains unchanged after the rebutall as this paper is really well-thought, novel and has room for improvement that the authors mentioned will explore in the future.



Review #3

  • Please describe the contribution of the paper
    • This study presented a multistage Alignment and Fusion for Multiclass Alzheimer’s Disease Diagnosis, by integrating Multimodal data
    • Diffusion Imaging data was embedded through multi-level through Swim-Fiber Orientation Distrubition model from dMRI
    • Contrastive learning is used to align MRI and PET features from the same case.
    • Tabular Prior-data Fitted In-context Learning (TabPFN) method was used to combine neuroimaging features and cognitive features (MOCA scores)
    • The SHAP-based analysis based on modality-based feature embeddings showed informative clinical-relevant information for different datatype to predict AD
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • State-of-the-art TabPFN method was introduced to address the research question for improving Alzheimer’s Classification, incorporating both self-attention and cross-attention. The resulted prior information that does not require additional retraining or fine tuning. The result demonstrated improved classification performance compared to other state-of-art models.
    • The contrastive learning using bidirectional loss seems demonstrated the effectiveness of aligning MRI and PET features from the same case.
    • Comprehensive ablation study demonstrated the effectiveness of 1) the TabPFN pretraining; 2) the contrastive learning based multi-modal fusion, and also improved the feasibility for potential clinical translation in cases where limited imaging modalities are available due to resources limitations.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The Swin-FOD path is designed as a relatively independent path to the MRI/PET fusion, which seems to be a bit disconnected in the current version. Given the fact that they are all multi-modal neuroimaging data, it would be beneficial to consider incorporating FOD with MRI/PET together in a single multi-modal fusion framework (such as utilizing multi-view or multi-scale contrastive learning). Or a discussion about it (or why it is not needed or implemented) would be beneficial.
    • Page 6 Table 2 Although the proposed contrastive-learning+TabPFN demonstrated best performance in the ablation studies, the performance of other models seems to be much lower than the usual performance as generally seen while performing AD/MCI/CN classifications when trained on multi-modal neuroimaging data, especially for accuracy, F1 and Prec and recall (all of which have value around 0.4-0.5 wiethout TabPFN).
    • This might be related to the fact that the study contains multi-class classification for three classes that have naturally ordered relationship reflecting different disease progression severity staty, particularly, the MCI class is an clinically ill-defined noisy class. A more accepted and recommended experimental design usually consists: perform CN v.s. AD classification, followed-by classifying MCI subjects who will convert to AD in the follow up visits, which can be derived from the longitudinal data in ADNI
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The technology novelty and significance of the proposed study (Swim-FOD, contrast-learning-based fusion) outweigh it’s limitations (not fully-integrated multi-modal fusion, low classification performance), which leads to my recommendation above.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The author’s rebuttal addressed my remaining comments, some of which will need to be addressed in future research. That doesn’t affect the novelty and strength of the current work.




Author Feedback

We would like to thank the reviewers and area chair for their time and the encouraging comments.

Reviewer 1: Thank you for the encouraging comments. We agree that analyzing the contribution of each component and comparing with additional multimodal and contrastive learning methods for AD diagnosis are important next steps, and we plan to explore them in future work.

Reviewer 2: Thank you for the helpful comments. We will revise the paper accordingly.

  1. Technical Novelty: Our work addresses the challenge of misalignment between different modalities, which can reduce diagnostic accuracy. To solve this, first, we developed a SWIN-FOD model to process the complex 4D FODs efficiently. This is the first work to use the Swin Transformer on FODs. For fusing T1-weighted MRI and Tau PET, we adapted the ALBEF model to handle 3D volumes effectively. To capture relationships between features, we employed the pre-trained priors in TabPFN, avoiding the need for additional feature alignment. This is also the first study to apply TabPFN in multimodal AD diagnosis.

  2. TabPFN The P(t) in Eq. (6) should be P. Thanks. We will revise Fig. 3 to improve clarity. In Eq. (7), the variable t denotes the function mapping inputs x to outputs y under a given prior distribution P. In AD diagnosis, we use Eq. (6) to identify the most probable prior P given the input features. Then, a Bayesian posterior prediction is performed using this selected prior, computing a weighted average of predictions from different task hypotheses. This posterior mean approximates the relationship between x_{test} and y_{test}.

  3. Contribution of Different Components in Section 2.2 The MRI-PET contrastive loss L_{mpc} maximizes the similarity between features from the same subject. We introduce the masked image modeling loss L_{mim} to encourage MRI features to retain sufficient information to reconstruct masked PET regions. Since PET is strongly associated with AD, this guides MRI features toward relevant features in AD diagnosis. We also introduce the MRI-PET matching loss L_{mpm} to ensure that features from different subjects remain distinguishable by preserving discriminative information. To stabilize training under noisy data, we use a momentum encoder that maintains a slowly updated copy of the feature extractor. This provides more consistent pseudo-targets for contrastive learning and improves convergence. The experiments in ALBEF (Li et al., NeurIPS 2021) have demonstrated the effectiveness of each loss component and the momentum model. We selected our hyperparameter values such as the 15% masking and \alpha=0.4 based on those in the ALBEF paper. Exploring hyperparameter settings is an interesting direction for future work.

  4. Features in Table 2 The first half of Table 2 compares different image processing models, where the inputs are volumetric images. The second half evaluates different methods including AdaBoost with vectorized feature as their inputs. More specifically, the vectorized input features are the same tabular features extracted from imaging modalities, along with MoCA scores.

  5. Names in Tab. 3 In Tab. 3, “Contr. (not fused)” are unimodal features after E_3 and E_4 in Fig. 2b. Our two-stage alignment after E_6 improves the performance of fused features (see “contr. (fused)” row). We will clarify them.

Reviewer 3: Thank you for the valuable suggestions. Integrating all modalities within a unified contrastive learning framework is attractive and worth future exploration. We currently treat FODs separately due to its complexity. We employed a Swin Transformer backbone to reduce memory usage and an order-balanced encoder to balance the importance of different orders. For performance evaluation, in this work, we focused on multi-class problem, which is challenging given the noisy MCI class. We will test the performance of our method on the recommended 2-class problem, CN vs AD and CN vs converting MCI, in our future work.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper presents a promising approach that effectively integrates multi-modal information to improve performance. While there were concerns about clarity, technical explanations, and detailed justification, the rebuttal addressed these sufficiently. I recommend acceptance, with the suggestion that the authors carefully revise the manuscript to improve writing clarity and incorporate the reviewers’ feedback for the final version.



back to top