Abstract

In this paper, we conduct an extensive exploration of a Vision Transformer (ViT) in brain medical imaging in a low-data regime. The recent and ongoing success of Vision Transformers in computer vision has motivated its development in medical imaging, but trumping it with inductive bias in a brain imaging domain imposes a real challenge since collecting and accessing large amounts of brain medical data is a labor-intensive process. Motivated by the need to bridge this data gap, we embarked on an investigation into alternative training strategies ranging from self-supervised pre-training to knowledge distillation to determine the feasibility of producing a practical plain ViT model. To this end, we conducted an intensive set of experiments using a small amount of labeled 3D brain MRI data for the task of Alzheimer’s disease classification. As a result, our experiments yield an optimal training recipe, thus paving the way for Vision Transformer-based models for other low-data medical imaging applications. To bolster further development, we release our assortment of pre-trained models for a variety of MRI-related applications: https://github.com/qasymjomart/ViT_recipe_for_AD

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2724_paper.pdf

SharedIt Link: https://rdcu.be/dY6f1

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72390-2_32

Supplementary Material: N/A

Link to the Code Repository

https://github.com/qasymjomart/ViT_recipe_for_AD

Link to the Dataset(s)

https://brain-development.org/ixi-dataset https://adni.loni.usc.edu https://www.med.upenn.edu/cbica/brats https://sites.wustl.edu/oasisbrains/home/oasis-3

BibTex

@InProceedings{Kun_Training_MICCAI2024,
        author = { Kunanbayev, Kassymzhomart and Shen, Vyacheslav and Kim, Dae-Shik},
        title = { { Training ViT with Limited Data for Alzheimer’s Disease Classification: an Empirical Study } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {334 -- 343}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper evaluates the effect of pretraining on the performance of ViT when used for alzheimer detection. Different pretraining methods and configurations evaluated on a combination of different pretraining datasets and downstream datasets. The general finding is that pretraining is helpful.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper leverages multiple datasets and multiple configurations for the pretraining, such as with and without augmentation, different patch sizes etc…
    2. The results are presented in a clear manner, no overhyped claims are made.
    3. The authors consequently made use of measures of uncertainty.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. There is only a limited technical novelty.
    2. The evaluation focuses on a specific use-case / technique
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors included a link for the source in the paper and used public dataset. In addition, the experiments are well-documented, so I am confident that the paper is reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. While the authors made a solid job in evaluation, there are naturally some limitations. In this case, this would be for example the focused use-case and the single pretraining method. Please include (when possible) a discussion of those and other limitations in the study.
    2. Please report the runtimes for the different approaches, maybe in an appendix.
    3. Please discuss how the findings correspond to findings from similar studies in other field.
    4. Since you invested a signficiant amount of time and effort into the pretraining: Will the resulting foundation models be published?
    5. There are multiple (masked) pretraining methods available. Could you elaborate on the reasons for selection the specific one you choose?
    6. I suggest to reduce the list of contributions and to write them more precise and shorter
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well-done and well written. It is interesting to read and the evaluation is done quite good. There is only a limited technical novelty, but I would consider the paper more of an evaluation paper. For this, sufficient data had been used. For a stronger accept, the paper would have required more evaluations of different approaches and use-cases, right now it is rather specific for a single use-case.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper explores various training strategies for Alzheimer’s disease classification from 3D brain MRI using Vision Transformers (ViT). The methods investigated include knowledge distillation, pre-training on cross-task datasets, and regularization techniques. Given that ViTs are data-intensive, the paper offers a tailored training regimen to pre-training and fine-tuning of ViTs. Although the paper does not introduce novel insights relative to the existing body of work in natural image processing, its adaptation for medical imaging datasets like ADNI is a valuable contribution to the MICCAI community.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Comprehensive Empirical Analysis: the paper effectively explores a variety of training strategies tailored to Vision Transformers, including knowledge distillation, pre-training on diverse datasets, and sophisticated regularization techniques, filling a gap in the medical imaging community. 2) It is well-written with a clear presentation of results.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) Limited new insights compared to the natural imaging domain: Existing empirical studies on pre-training and knowledge distillation have already shown performance enhancements in limited data scenarios [1]. Also, research has shown that a CNN teacher (Resnet-152 in the paper case) can boost ViTs [2]. Additionally, ViTs has been shown to outperform CNNs even without extensive pre-training or data augmentation with sharpness-aware minimization [3] . 2) The exploration of self-supervised pre-training on cross-task datasets, particularly for transformer-based architectures, has been extensively studied. MAE and MoCov3 demonstrate the effectiveness of fine-tuning pre-trained models and linear probing, albeit on natural imaging rather than out-of-distribution data (ADNI in this case). Nevertheless, different studies have been proposed for fine-tuning transformers with new insights on out-of-distribution data (OOD) (ADNI can be considered OOD to natural imaging more than BrATS) [4]. 3) There is no comparison with traditional models under similar parameterizations, such as DenseNet-121, which could provide a clearer evaluation of ViT and the released pretrained models.

    References: [1] Rajasegaran, J., Khan, S. H., Hayat, M., Khan, F. S., and Shah, M. Self-supervised knowledge distillation for few shot learning. In British Machine Vision Conference, 2020. [2] J. Bai, L. Yuan, S.-T. Xia, S. Yan, Z. Li, and W. Liu, “Improving vision transformers by revisiting high-frequency components,” in ECCV, 2022.
    [3] Chen, Xiangning, et al. ‘When Vision Transformers Outperform ResNets without Pre-Training or Strong Data Augmentations’. International Conference on Learning Representations, 2022, https://openreview.net/forum?id=LtKcMgGOeLt. [4] A. Kumar, A. Raghunathan, R. M. Jones, T. Ma, and P. Liang, “Fine-tuning can distort pretrained features and underperform out-ofdistribution,” in ICLR, 2022

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1) The paper should clarify how its empirical observations differ from existing insights on natural imaging. 2) Justify the choice of comparison frameworks, particularly why traditional CNN-based methods were not comprehensively compared despite evidence suggesting that even basic CNN models (e.g DenseNet-121) might surpass transformer-based approaches.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Though the paper complement insights previously analysed on natural imaging , these insights could be of interest to the MICCAI community.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper conducts a comprehensive empirical study on using Vision Transformers (ViT) to classify Alzheimer’s Disease (AD) with limited 3D brain MRI data. It explores the pre-training of ViTs using self-supervised learning (Masked Autoencoders) with non-homogeneous datasets, and then fine-tunes these models for AD classification. The study investigates the effects of masking ratios and pre-training data sizes, evaluates the model’s performance in low-data scenarios against baseline models, and examines the efficacy of various training strategies like data augmentation, regularization, and knowledge distillation. Additionally, it contributes to ongoing research by releasing the pre-trained ViT models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This work faces an interesting problem, consisting in designing data-efficient strategies for applying Vision Transformers to the AD classification problem. The paper is well written. The context is properly introduced, and the methodology is clear. The results robustness is assessed with multiple runs. Additionally, the release of the pre-trained ViT models for MRI applications is useful for the reproducibility of the obtained results.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    While the paper compares ViT pre-training with training from scratch, a more comprehensive comparison with other state-of-the-art methods for AD classification could have provided better context for the performance evaluation. I personally think that benchmarking the proposed self-supervised pre-training with a baseline represented by fine-tuning an ImageNet pre-trained ViT would have been useful to better frame the obtained results in the state-of-the-art.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • I personally think that adding a short motivation for the decoder design in the pre-training with masked autoencoders could be beneficial for the reader.
    • I found the knowledge distillation through attention experiment interesting, and I personally think that adding few details more on its mechanism may benefit the understanding of this experiment.
    • I suggest the authors to add an explanation for the bold entries in the caption of Table 2.
    • In the ablation studies performing statistical tests could provide further information on the importance of each tested component.
    • In the knowledge distillation experiment, the authors mention that a teacher model (3D ResNet-152) is trained “for each seed and fold”. I suggest the authors to add a reference to the evaluation protocol, to avoid possible confusion in the reader.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents a comprehensive empirical study on the application of Vision Transformers for Alzheimer’s Disease classification using limited 3D brain MRI data. The study’s key strength lies in its well-designed methodology, systematically investigating various training strategies to overcome data scarcity. The authors have demonstrated the effectiveness of transferring pre-trained ViT features from non-homogeneous datasets. Despite its strengths, the paper could have been enhanced by a more exhaustive comparison with existing methods.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

First, we would like to express our gratitude to the reviewers for their thorough and constructive feedback. We note that reviewers’ feedback will be reflected in the final camera-ready submission as much as the rules and space permit.

  • Comparison with other state-of-the-art and inclusion of other use cases: In our work, we mainly focused on selecting an optimal training recipe for Vision Transformer with various methods and techniques that would result in elevated performance on AD classification. Empirical experiments substantially elevated the performance with a range of techniques, including pre-training using MRI datasets unrelated to the downstream task, data augmentation, regularization, and knowledge distillation. Nevertheless, we believe that comparisons with other methods, including convolutional networks such as DenseNet121, as well as application to other MRI use cases are important and remain the focus of future extension of our work.
  • Reason for choosing the Masked Autoencoders as the pre-training method: Masked Autoencoder pre-training method remains one of the best and most state-of-the-art methods used in various fields and recent advances [1], [2], [3]. In addition, the MAE method is notable for its scalability, simplicity of implementation, and masking of a high proportion of input (up to 75%), which leads to a decreased computational burden. [1] A. Kirillov et al., “Segment Anything,” arXiv:2304.02643 [cs], Apr. 2023, Available: https://arxiv.org/abs/2304.02643 [2] C. Feichtenhofer et al., “Masked Autoencoders As Spatiotemporal Learners,” arXiv, May 2022, Available: https://arxiv.org/abs/2205.09113 ‌[3] S. Woo et al., “ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders,” arXiv, Jan. 2023, Available: https://arxiv.org/abs/2301.00808
  • Limited new insights compared to natural images domain: First, our core architecture is specifically engineered to handle 3D input shapes, whereas the data and associated weights for natural images are predominantly in 2D (e.g. ImageNet). This difference poses challenges in transferring pre-trained features from the natural images domain to the medical imaging domain, which typically involves 3D data. Hence, in our work, we do not incorporate weights of pre-trained ViT on natural images. Second, given the existing works that have already investigated the effect of pre-training, and knowledge distillation in limited data settings, our work focuses on the investigation of the 3D medical imaging domain in limited data settings. Nevertheless, the provided reference [1] by Reviewer #6 seems not to consider Vision Transformer architecture, meanwhile it is a central method of our work. Third, the reference [2] by Reviewer #6 about the knowledge distillation with a convolutional teacher provides substantial and valuable findings, which will be considered in future work. Yet, it is important to note that the dataset used for training in [2] is significantly larger than the ADNI1/2 datasets that we apply in our work. Also, as we have shown in our ablation experiments, hard-label distillation without pre-training did not have a significant effect on AD classification. Finally, comparisons with sharpness-aware minimization optimizer in reference [3] by Reviewer #6 require additional computational experiments, which we will also consider in our future work.
  • The exploration of self-supervised pre-training on cross-task datasets: We note the reference [4] provided by Reviewer #6, and we will consider managing between fine-tuning and linear probing to avoid errors when performing downstream training of pre-trained features obtained through datasets unrelated to the downstream task.
  • Performing statistical tests: We will carefully consider statistical tests for inclusion in our extended paper. Given that our experiments were conducted using cross-validation with three repetitions, we believe that this way of evaluation demonstrates the reliability of our observations.




Meta-Review

Meta-review not available, early accepted paper.



back to top