Abstract

In recent years, transformer-based image classification methods have demonstrated remarkable effectiveness across various image classification tasks. However, their application to medical images presents challenges, especially in the feature extraction capability of the network. Additionally, these models often struggle with the efficient propagation of essential information throughout the network, hindering their performance in medical imaging tasks. To overcome these challenges, we introduce a novel framework comprising Local-Global Transformer module and Spatial Attention Fusion module, collectively referred to as Med-Former. These modules are specifically designed to enhance the feature extraction capability at both local and global levels and improve the propagation of vital information within the network. To evaluate the efficacy of our proposed Med-Former framework, we conducted experiments on three publicly available medical image datasets: NIH Chest X-ray14, DermaMNIST, and BloodMNIST. Our results demonstrate that Med-Former outperforms state-of-the-art approaches underscoring its superior generalization capability and effectiveness in medical image classification.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0867_paper.pdf

SharedIt Link: https://rdcu.be/dV58J

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72120-5_42

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Cho_MedFormer_MICCAI2024,
        author = { Chowdary, G. Jignesh and Yin, Zhaozheng},
        title = { { Med-Former: A Transformer based Architecture for Medical Image Classification } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15011},
        month = {October},
        page = {448 -- 457}
}

Reviews

Review #1

Please describe the contribution of the paper

This manuscript proposed Med-Former, a novel framework comprising Local-Global Transformer (LGT) module and Spatial Attention Fusion (SAF) module for medical image classification on three public datasets, including chest X-ray, skin lesion and blood cell images. The LGT module uses a parallel path with different window sizes to capture local and global information, and the SAF module concatenates two different stage features with spatial attention. The Med-Former model demonstrates SOTA performance according to the Tables provided though not comprehensive in the manuscript.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The author proposed LGT and SAF module based on Swin transformer and achieve superior performance.
2. Ablations experiments and comparison with other methods demonstrated effectiveness of their proposed method and their SOTA performance on the three datasets.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The proposed method is lack of interpretation of its results as it is served to assist in diagnosis. Authors provide some correctly classified and misclassified samples and GRAD-CAM visualizations about their ablation studies, but with limited interpretation.
2. The evaluation uses different metrics for different dataset without justification and BM dataset only compared with two methods, making it not convincing that the proposed model achieves SOTA performance.
3. Authors claim that their model can address the information loss problem compared to ViT and Swin Transformer, but there is limited justification about what is the information loss in ViT and Swin Transformer and how the proposed method can solve this problem. The experimental setting on the three datasets using the ViT and Swin Transformer is also unclear in the manuscript.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

1.Authors use different metrics for different dataset for classification without justification, such as AUC for Chest Xray1, ACC for DermaMNIST and BloodMNIST (BM) to compare the performance with other methods in Table 1 and Table 2. Many literatures use AUC, ACC or both to evaluate the classification performance. To ensure a fair comparison, it is better to compare all methods using the same metrics. 2.In Table 2, only two methods are included for comparison for BM dataset, while other datasets have four. More comparisons about other methods in BM dataset should be added to demonstrate the effectiveness of the proposed method. 3.Figure 4 indicates that the Med-Former attends to the essential regions of the image, while focusing less on the background. It would be better to adding the ground truth of the region of interest/ lesion to better illustrate that the model has learn the critical feature of the chest x-ray images. 4.The details of the dataset are not given, such as the resolution of images, the number of training and testing samples, etc. 5.Authors argue that previous models, such as ViT and Swin transformer, suffer from information loss from earlier layers, and the proposed Med-Former can address this problem. What is the information loss that ViT and Swin transformer have and why Med-Former can address this problem? Discussion with references, visualizations, and justifications can be provided to make this argument valid and convincing. 6.The proposed method is lack of interpretation as it serves to assist with diagnosis. More visualizations results, such as the highlight of region of interest, and interpretations can be given to provide more information to assist with diagnosis.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Reject — could be rejected, dependent on rebuttal (3)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This manuscript is well organized, and many experiments are conducted to demonstrate its effectiveness, indicating it SOTA performance in three public datasets. However, reasonable and fair comparisons with SOTA methods are not well presented, and the analysis of the result can be further improved, especially the interpretation of the visualizations. The challenge that the authors aim to solve can be explained with more details.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Weak Accept — could be accepted, dependent on rebuttal (4)
[Post rebuttal] Please justify your decision

The authors have addressed my questions regarding the compared dataset, metrics, information loss and visualization problems. It would be better to have some clinically relevant evidence or other supporting materials to indicate the clinical value. I am willing to raise my rating.

Review #2

Please describe the contribution of the paper

The paper proposes a variant of “(shifted) window” attention models, where not only windowed attention is shifted to include context, but this happens concurrently for smaller and larger windows. Also, the authors propose to improve information flow between layers (here: stages) of a transformer using a mechanism called spatial attention maps. The paper demonstrates the performance gain on three benchmark classification datasets and also provides an ablation experiment to quantify the incremental benefits of the two architectural additions.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper is presented in clear English and easy to read. The authors seek to motivate their development from a clinical need for reliable and robust classification of images. They hypothesize that even in high-performance transformer architectures like SWIN there’s information loss that needs to be remedied to achieve better performance. Based on this assumption, they propose two mechanisms to regard local and global information, and to ease the information forwarding to subsequent layers.

The main contribution are two architectural building blocks (that are, however, specifically designed to work together, so that they aren’t actually two modular developments). The authors propose to use three datasets for validation of the approach, and set up their experiments so that both a validation study including ablations can be conducted, and comparison with other approaches on the same data are possible. This structured performance characterization is for me one strength of the paper.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

There are largely two major points I have. (1) to me it is unclear how exactly the two architectural additions are implemented. There is no visual explanation how windows are constructed, how their size relation might help to merge local and “global” information, how they are shifted, and how exactly which “feature maps” are fused in the spatial attention block. It is also not explained what is calculated in this so-called spatial attention fusion. Is this just a concatenation (in channel dimension?) of spatially resampled feature maps? I consequently fail to understand why the spatial attention block is explicitly called “spatial” - how is the notion of spatial relations enforced? Transformers don’t care about ordering of tokens unless encoded explicitly, but there is no such mention. (2) I wondered a lot about the final description of the actual model used for testing. The two “window sizes” m and n are apparently 7 and 3, which is very small if aggregation of larger context blocks should be achieved in one of the windows. What results is that a lot of detail information will be forwarded into the attention (which runs between these windows), yielding feature maps (?) that are then run through MLPs. Those must be huge, I suppose, for larger images, and might actually be the component in this architecture that takes care of long-range dependencies.

I dare to think that your architecture essentially builds a kind of CNN (given the 7/3 pixel wide windows) without weight sharing, where the many MSA between the windows and even larger MLPs take care of information flow. There was the MLP Mixer before that also showed on-par performance with transformers, while not using any attention mechanisms at all…

Of lesser importance, but obviously connected, is the lack of a clear mathematical description (if no intuitive image/graph can be drawn to elucidate the data flow) that would help to understand what exactly is computed where. This applies not only to the SAF, but also the Local-Global Transformer blocks. Also in this context, a breakdown of how many trainable parameters are where in this architecture (and a comparison to ViT and SWIN) would be very helpful.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Do you have any additional comments regarding the paper’s reproducibility?

The paper lacks a detailed enough description of both the implementation and the data preparation to reproduce the results without an implementation. This also affects the review as written before.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

I tried to be detailed and constructive above, so please address those points in a rebuttal. Beyond the above points, a slightly more informative description of the datasets would be good. I would recommend to revisit the figures. I think it should be possible to indicate visually what the different blocks are actually doing. If this is not possible, rather omit Fig. 2 which is rather uninformative to me and/or could be merged with Fig. 1. Instead, provide clear mathematical descriptions what is actually calculated. This will also aid to understand the code once it gets released. Some details more on a few paragraphs: In the Introduction, p.2 top paragraph, you summarize “old stories” about vanishing gradients in CNNs – seems very out of place and can be boiled down to one sentence, if at all. One paragraph later, you state that e.g. SWIN transformers suffer from information loss. How do you know this? This is a claim without justification. Similarly, in your “contributions” you claim that your proposal “enhances the extraction of contextual information”, but I don’t see clearly how you validate this. GradCAM can’t really elucidate this. The GradCAM “explanations” looked more focussed even, contradicting the claim that more context contributed.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Judging from the quantitative evaluation, the proposed architecture seems to provide a performance gain. I miss the above points in the paper, so that some amendments would for me make this publication much better. Trusting that the code will be released, I don’t see this as a point preventing presentation, so that I opt for a weak accept.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

They introduced a transformer-based architecture for medical image classification.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

By designing the Local-Global Transformer and Spatial Attention Fusion modules, they enable the model to learn both local and global information. This innovation facilitates the effective propagation of essential information throughout the network.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Why the proposed method works better than the compared ones are not analyzed in detail.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

Why the proposed method works better than the compared ones are not analyzed in detail. Specifically, in the experiments, the analysis on the results is quite limited. Please elaborate on that.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Accept — should be accepted, independent of rebuttal (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The proposed method is sound.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Author Feedback

We sincerely thank all reviewers (R) for the meticulous review. Below, we address the questions (Q) and comments (C):

[1] Interpretation [R4-Q1/C3, R6] Following R4, we will add GT bounding boxes for ROIs in Fig.4. The comparison in Fig.4 shows that integrating cross-layer information enhances the Swin’s capability to understand contextual information (Swin + IF concat). Additionally, the SAF module as a plug-in, effectively propagates essential information (Swin + IF SAF). The proposed LGT module enables the model to extract crucial contextual information by focusing on the ROI, though it still slightly extends its focus beyond the ROI, which the SAF module addresses (LGT + IF SAF). These findings indicate that MedFormer (LGT + IF SAF) attends to essential information necessary for diagnosis.

[2] Different metrics for datasets [R4-Q2] Aligning with other SOTA models [8, 12, 14, 4] that used only AUC for evaluation on the NIH dataset, we utilized AUC for fair comparison. ACC was the primary metric for comparison on the BM and DM datasets, and we used the same metric for fair comparison. We value the feedback and can report both AUC and ACC for all the datasets for our model, though other SOTA methods only report one metric on the three datasets.

[3] Information loss in ViT & Swin. Why can Med-Former address it? [R4-Q3] As the network depth increases, there is a loss of information from earlier layers in both ViT and Swin. Techniques like patch-merging layers in Swin, can worsen this issue, leading to subpar performance. MedFormer addresses this by combining information extracted by preceding layers (phase/stage) through the SAF module. The connection between different stages also enhances information propagation within the network. This is supported by ablation results in Table 3, where models with these connections (Swin + IF concat, LGT + IF concat) show improved performance compared to those without the connections (Swin, LGT). Performance is further enhanced with SAF modules (Swin + IF SAF, LGT + IF SAF), evidencing that SAF modules facilitate essential information propagation within the network.

[4] Comparisons on BM dataset[R4-C2]. BM is a new dataset in 2023. At the time of submission, only two published methods are available for comparison.

[5] Details on architecture [R5-Q1] Window construction/size: Smaller windows capture finer details, while larger windows capture global contexts. Combining information from different window sizes enhances contextual understanding. Window Shifting: In Fig.2, LGT module has two layers, ‘l’ and ‘l+1’. In ‘l’, windows are in a grid-like pattern, while in ‘l+1’, windows shift horizontally and vertically by a certain number of patches, enhancing the model’s ability to capture relationships across window boundaries. SAF: SAF combines the output feature map of the current stage (fA) with that of the previous stage/phase (fB). It resizes fA to match the dimension of fB, computes attention maps for both, combines these in channel dimension, and forwards to the next stage. Small Window Sizes: The MedFormer employs a hierarchical structure, allowing the capture of global context even with smaller window sizes. Shifted Window mechanisms aggregate information across adjacent windows, capturing broader contextual information. We acknowledge MLPs also capture long-range dependencies, and their size is particularly crucial for larger images. Relation with CNN: Our model can be seen as a type of CNN with the added capability to handle long-range dependencies and enhanced global contextual extraction. Comparison with MLP Mixer: Transformers are superior in extracting contextual information (i.e, shape and texture) compared to MLP mixers (arxiv:2106.13122).

We will adopt the following suggestions: providing more details on datasets [R4-C4, R5-C1], updating Fig. 1 and 2 with more architecture details [R5-Q4], comparing the number of trainable parameters [R5-Q5], and simplifying the introduction

Meta-Review

Meta-review #1

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

A local-global transformer module and a spatial attention fusion module are proposed to help the model efficiently propagate essential information from medical images. The experiment results show that the proposed model improves the conventional Swin-transformer and achieves state-of-the-art performance in three benchmark classification datasets. Despite its performance gains, the inconsistency of evaluation metrics is observed (AUC for chest x-ray but ACC for DM and BM). The quality of this paper can be improved if a theoretical analysis of the proposed module and a more concise description of the method can be provided. There is a lack of explanation and analysis of how Med-Former can augment information loss in earlier layers compared to ViT and Swin. In rebuttal, the authors clearly addressed the concerns about the dataset, metrics, and information loss. All reviewers reached the agreement that this paper is acceptable.
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

A local-global transformer module and a spatial attention fusion module are proposed to help the model efficiently propagate essential information from medical images. The experiment results show that the proposed model improves the conventional Swin-transformer and achieves state-of-the-art performance in three benchmark classification datasets. Despite its performance gains, the inconsistency of evaluation metrics is observed (AUC for chest x-ray but ACC for DM and BM). The quality of this paper can be improved if a theoretical analysis of the proposed module and a more concise description of the method can be provided. There is a lack of explanation and analysis of how Med-Former can augment information loss in earlier layers compared to ViT and Swin. In rebuttal, the authors clearly addressed the concerns about the dataset, metrics, and information loss. All reviewers reached the agreement that this paper is acceptable.

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

All reviewers agree that this paper should be accepted.
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

All reviewers agree that this paper should be accepted.

back to top

Med-Former: A Transformer based Architecture for Medical Image Classification

Author(s):