Abstract

Vision transformer (ViT), powered by token-to-token self-attention, has demonstrated superior performance across various vision tasks. The large and even global receptive field obtained via dense self-attention, allows it to build stronger representations than CNN. However, compared to natural images, both the amount and the signal-to-noise ratio of medical images are small, often resulting in poor convergence of vanilla self-attention and further introducing non-negligible noise from extensive unrelated tokens. Besides, token-to-token self-attention requires heavy memory and computation consumption, hindering its deployment onto various computing platforms. In this paper, we propose a dynamic self-attention sparsification method for medical transformers by merging similar feature tokens for dependency distillation under the guidance of feature prototypes. Specifically, we first generate feature prototypes with genetic relationships by simulating the process of cell division, where the number of prototypes is much smaller than that of feature tokens. Then, in each self-attention layer, key and value tokens are grouped based on their distance from feature prototypes. Tokens in the same group, together with the corresponding feature prototype, would be merged into a new prototype according to both feature importance and grouping confidence. Finally, query tokens build pair-wise dependency with such newly-updated prototypes for fewer but global and more efficient interactions. Extensive experiments on three publicly available datasets demonstrate the effectiveness of our solution, working as a plug-and-play module for joint complexity reduction and performance improvement of various medical transformers. Code is available at https://github.com/xianlin7/DMA.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1620_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1620_supp.pdf

Link to the Code Repository

https://github.com/xianlin7/DMA

Link to the Dataset(s)

https://github.com/xianlin7/DMA

BibTex

@InProceedings{Lin_Revisiting_MICCAI2024,
        author = { Lin, Xian and Wang, Zhehao and Yan, Zengqiang and Yu, Li},
        title = { { Revisiting Self-Attention in Medical Transformers via Dependency Sparsification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15011},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This manuscript proposes a dynamic self-attention sparsification method for medical transformers by merging similar feature tokens for dependency distillation under the guidance of feature prototypes.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This manuscript revisit self-attention in medical transformers and propose dependency merging attention (DMA) for joint complexity reduction and performance improvement in a plug-and-play manner. Extensive experiments on three publicly available datasets demonstrate the effectiveness of the proposed method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    This manuscript does not highlight the unique problems encountered by VIT in medical scenarios. And comparative methods based on linear transformers should also be considered.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    (1) The idea of leveraging prototypes for attention sparsification is not novel; the authors should highlight the differences between their approach and existing methods. (2) Most of the comparative methods are based on attention sparsification, but there are numerous studies on linear transformers that should also be considered. (3) In addition, the baselines in this manuscript are based on natural scenes. The authors should pay more attention to the differences between medical and natural scenes and address the unique challenges of VIT in medical analysis.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The details are shown in Q6 and Q10.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors propose a new method for dynamic self-attention sparsification in medical transformers, focusing on merging similar feature tokens to distill dependencies under the guidance of grouping tokens based on their proximity to feature prototypes. This approach aims to improve the computational efficiency and memory consumption of ViT applied to medical imaging, where challenges include smaller datasets and higher noise levels compared to natural image processing. The authors validate this method across three public datasets, demonstrating its effectiveness as a plug-and-play module that enhances performance while reducing complexity.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Innovative approach: The paper introduces a novel technique for sparsifying the self-attention mechanism in medical transformers, which optimizes computational efficiency and memory usage, addressing significant challenges in medical image processing.
    • Robust validation: The method has been rigorously tested and validated on three publicly available medical datasets, demonstrating its effectiveness and robustness across different types of medical imaging data.
    • Practical implementation: The proposed method is presented as a plug-and-play module, which can be easily integrated into existing transformer models, making it highly practical and accessible for broader adoption in the medical imaging field.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Scalability with larger datasets: Although the proposed method demonstrates effectiveness on the datasets used, its scalability and performance on significantly larger datasets have not been explicitly tested. This raises questions about how well the approach would generalize across a broader range of medical imaging tasks or in clinical environments with much larger data volumes.

    • Dependency on prototype quality: The performance of the method heavily relies on the quality of the feature prototypes generated. If the prototypes are not representative of the underlying medical imagery, the effectiveness of the token merging could be compromised, leading to suboptimal model performance.

    • Computational cost of prototype generation: The initial computational cost involved in generating and updating the feature prototypes also increases the total processing time. Therefore, it would be beneficial to discuss this aspect further to assess its impact on overall system efficiency.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?
    • It is better to provide some statistical analysis over the metrics to make sure the proposed method is significantly better than other baselines.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    NA

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The strengths slightly outweigh the weaknesses.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposes a dynamic self-attention sparsification method for medical transformers, addressing the limitations of token-to-token self-attention in medical image analysis. The authors highlight that medical images have smaller sizes and lower signal-to-noise ratios compared to natural images, leading to poor convergence and noise issues with vanilla self-attention. Moreover, the heavy memory and computation requirements of token-to-token self-attention hinder its deployment on different computing platforms. The proposed method is evaluated on three publicly available datasets, demonstrating its effectiveness in reducing complexity and improving performance in various medical transformer models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The author demonstrated a particularly clear presentation of the ideas and graphical representations in this paper, which is highly commendable and serves as an excellent example for colleagues to learn from.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper still has some deficiencies in certain details, such as the description of methods and implementation details. Therefore, it is recommended that the authors make corresponding revisions in the future.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    It is recommended that the authors provide additional implementation details to enhance reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. There have been numerous token pruning methods proposed recently, and it is recommended that the authors compare them with DMA.
    2. The authors’ description of the method is well-written, but it does not align well with Figure 3, resulting in a suboptimal reading experience. Conversely, the inclusion of Figure 3 in the supplementary material provides a more reader-friendly explanation.
    3. Additional implementation details need to be provided.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The clarity and organization of the paper, including its adherence to formatting guidelines and the logical flow of ideas, are crucial for my evaluation. Additionally, the novelty and comprehensiveness of the paper also serve as important criteria for the aforementioned assessment. Overall, the manuscript is well-written and technically sound.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank the reviewers for their valuable comments and recognition of the effective method (R1, R3, R4), extensive experiments (R1, R3), innovative (R3, R4), practical implementation (R3), and clear presentation (R4). Major concerns are addressed as follows and modifications will be reflected in the final version: Q: Unique problems encountered by ViT in medical scenarios. (R1) A: Analysis of ViT in medical scenarios is presented in the third paragraph on Page 2 and Fig 1. Overall, compared to natural scenes, medical datasets are of small scales, similar backgrounds, and low signal-to-noise ratios, resulting in ViT’s global dependencies and attention maps being sub-optimal and collapsed. The visualization results in Fig 4 (c) indicate that the proposed DMA can solve the problem of uniform dependency in medical scenarios and generate richer attention matrices. Q: Comparative methods based on linear transformers should be considered. (R1) A: Linear attention is a type of sparse attention. As summarized in Table 1, we have compared the proposed method with linear attention, e.g., kNN [7], axial attention [16], BRA [18], STA [19], and PaCa [20]. Since linear transformers are typically based on linear attention, the effectiveness of our method can be demonstrated through comparison with linear attention. Q: Compare with token pruning methods. (R4) A: Thanks for your suggestion. Considering that most token pruning methods are designed for classification tasks and often come with performance degradation, we did not compare DMA with token pruning methods. We will compare DMA with token pruning methods in the future. Q: Differences between DMA and existing methods. (R1). A: The differences between DMA and existing methods are presented in Fig 2. Existing prototype-based methods (i.e., Fig 2 (g)) generate each prototype by merging all feature tokens, which ignores the differences between prototypes and the heterogeneity across feature tokens. Comparatively, DMA (i.e., Fig 2 (h)) first divides feature tokens into different groups based on their feature distributions and only feature tokens located within the same group are merged to form a new prototype. As illustrated in Fig 2 of the supplementary material, our merging region is adjusted according to the context, the merging strategy is more flexible, and the merged dependencies have a higher signal-to-noise ratio. Q: Dependency on prototype quality. (R3). A: Yes, the quality of prototypes will affect the token merging results, and further affect the model performance. Therefore, we introduced the prototype loss to help the model generate high-quality prototypes. The visualization results in Fig 2 of the supplementary material demonstrate that DMA can generate high-quality prototypes for reasonable dependency merging. Q: Computational cost of prototype generation. (R3). A: Comparison results on GPU memory cost and computational complexity (FLOPs) are summarized in Fig 4 (b). Although the generation and updating of prototypes will increase the computational cost, it can greatly reduce the redundant calculations in global dependency modeling, finally resulting in a significant reduction in GPU memory and FLOPs. Q: Additional implementation details. (R4). A: The deployment of DMA in ViTs is illustrated in Fig 3 of the supplementary material. All efficient-attention methods are used to replace the vanilla self-attention in TransUNet for comparison. All models are implemented based on PyTorch and trained for 400 epochs by an Adam optimizer with a batch size of 8 and a learning rate of 0.0001 under the same experimental environment. Random rotation, contrast adjustment, and gamma augmentation are adopted for data augmentation. Q: Scalability with larger datasets (R3) A: Thanks for your suggestion. We will evaluate the performance of DMA on larger datasets in the future.




Meta-Review

Meta-review not available, early accepted paper.



back to top