Abstract

Medical image segmentation is of significant importance for computer-aided diagnosis. In this task, methods based on Convolutional Neural Networks(CNNs) have shown good performance in extracting local features. However, they cannot capture global dependencies, which is crucial for medical image. On the other hand, Transformer-based methods can establish global dependencies through self-attention, providing a supplement to local convolution. However, the expensive matrix multiplication in the self-attention of a vanilla transformer and the memory usage is still a bottleneck. In this work, we propose a segmentation model named EMF-former. By combining DWConv, channel shuffle and PWConv, we design a Depthwise Separable Shuffled Convolution Module(DSPConv) to reduce the parameter count of convolutions. Additionally, we employ an efficient Vector Aggregation Attention (VAA) that substitutes key-value interactions with element-wise multiplication after broadcasting two vectors to reduce computational complexity. Moreover, we substitute the parallel multi-head attention module with the Serial Multi-Head Attention Module (S-MHA) to reduce feature redundancy and memory usage in multi-head attention. Combining the above modules, EMF-former could perform the medical image segmentation efficiently with fewer parameter counts, lower computational complexity and lower memory usage while preserving segmentation accuracy. We conduct experimental evaluations on ACDC and Hippocampus dataset, achieving mIOU values of 80.5% and 78.8%, respectively.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1181_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1181_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Hao_EMFformer_MICCAI2024,
        author = { Hao, Zhaoquan and Quan, Hongyan and Lu, Yinbin},
        title = { { EMF-former: An Efficient and Memory-Friendly Transformer for Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15008},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces an efficient and memory-friendly transformer for medical image segmentation. Specifically, the Depthwise Separable Shuffled Convolution Module (DSPConv), efficient Vector Aggregation Attention (VAA), and Serial Multi-Head Attention Module (S-MHA) are developed to construct EMFormer for medical image segmentation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    EMFormer is proposed to perform medical segmentation. It seems that the proposed method can achieve an excellent balance between performance and efficiency.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The DSPConv module is similar to the module proposed in shuffleNet. The proposed method is mainly based on SegFormer. As far as I know, many methods have achieved higher performance on the ACDC dataset, such as SwinUnet. Why do the authors not compare the proposed method with SwinUnet? Can the authors claim their method achieves SOTA?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please give more comparison results with the proposed method, such as SwinUnet, etc.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Some contributions are overclaimed.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper contributes a new Transformer-based model tailored for medical image segmentation that balances efficiency, memory usage, and segmentation accuracy, which are crucial factors for practical applications in the medical imaging domain.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper introduces a novel Depthwise Separable Shuffled Convolution Module (DSPConv) that significantly reduces parameter count for medical image segmentation tasks. It also presents an efficient Vector Aggregation Attention (VAA) mechanism to lower computational complexity in attention-based models. The proposed EMFormer model achieves competitive performance with reduced memory usage, demonstrating its potential for practical medical imaging applications.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper does not fully explore the generalizability of EMFormer across a broader range of medical imaging datasets. The computational efficiency gains may come at the cost of increased model complexity, which could be a concern for real-time applications. Additionally, the paper could benefit from further discussion on the clinical implications and potential integration of the proposed model into existing diagnostic workflows.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    See weaknesses

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    See weaknesses

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposes a Transformer-based approach to medical image segmentation that stands out for its efficiency and reduced memory demands. It integrates several modules like DSPConv, VAA, and Serial Multi-Head Attention to reduce the number of parameters, computation complexity, and memory usage while maintaining the effectiveness of the model. Despite its lightweight design, it delivers impressive segmentation accuracy, as evidenced by mIOU scores of 80.5% and 78.8% on the ACDC and Hippocampus dataset, respectively. Its ability to improve overall evaluation metrics further validates the model’s effectiveness.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper presents a medical image segmentation model EMFormer, which is clinically significant as it assists in computer-aided diagnosis.
    2. EMFormer introduces a novel Depthwise Separable Shuffled Convolution Module (DSPConv) that significantly reduces the parameter count and computational complexity, making it efficient for medical image segmentation tasks.
    3. The paper presents a new Vector Aggregation Attention (VAA) that simplifies the attention computation process, reducing the computational cost while maintaining the ability to capture global dependencies. A Serial Multi-Head Attention Module (S-MHA) is proposed to decrease memory usage and computational redundancy, which is crucial for processing medical images.
    4. The paper mentions achieving high mIOU values on ACDC and Hippocampus dataset, implying quality segmentation performance when compared with other models.
    5. The authors discuss designing novel segmentation heads for future research, indicating awareness of limitations and future directions.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The abstract is poorly written in my opinion. It has too many abbreviations (e.g DWConv, PW-Conv, ACDC) that makes it difficult for someone with no prior knowledge of the domain to understand.
    2. There is no clear information on the diversity of patient data or disease manifestations.
    3. Results are compared with prior models, but the clinical significance and contribution are not explicitly discussed in the provided context.
    4. The figures are hard to read and the text in figures are too small.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Since source codes are not specifically provided, author must list transformer parameters and detailed layer parameters.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. I suggest the authors rewrite the abstract, limiting the use of abbreviations and technical terms to make the article accessible to a wider audience.
    2. Although this is a medical image segmentation work, a previous work has already used the name EMFORMER in 2020 (EMFORMER: EFFICIENT MEMORY TRANSFORMER) for low latency streaming speech recognition. The author needs to clarify and perhaps change the name of the model such that it doesn’t conflict or confuse readers. Also, there is a need to clearly state the difference in methodology between the proposed method and the pre-existing work in 2020 (Same name, same title, different application) to ensure there is no conflict of understanding for potential readers.

    The above stated paper can be found below: @article{Shi2020EmformerEM, title={Emformer: Efficient Memory Transformer Based Acoustic Model for Low Latency Streaming Speech Recognition}, author={Yangyang Shi and Yongqiang Wang and Chunyang Wu and Ching-feng Yeh and Julian Chan and Frank Zhang and Duc Le and Michael L. Seltzer}, journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, year={2020}, pages={6783-6787} }

    1. The authors have not explicitly detailed data collection (no reference or links for any of the dataset). This needs to be added, because future researchers working in this direction may want to test their methods on the same dataset for fair comparison. Furthermore, detailed data preprocessing and post-processing methods needs to be listed to provide context to readers.
    2. The methodology section needs to be written with a clear and concise structure, eliminating redundancy and focusing on strengths of the proposed method.
    3. The authors need to check several statements, they need to be rewritten for clarity. Grammatical errors and poor structure of sentences, for example: a. …EMFormer ensures segmentation accuracy on “several 2D medical image” while…(Conclusion, Page 8) b. And we replace VAA with additive attention proposed in Swiftformer[21], which has a similar attention calculation to our VAA, the mIOU value decreases. (Do not start a sentence with “And”) c. In section 2.3, it is best not to start the subsection with “Meanwhile, the multi-head ….”. Ther should be a clear logical flow that correlates subsections, for example, what is the relationship connecting 2.2 and 2.3? Authors need to clearly express a workflow of methodology to enhance reader clarity.
    4. The author should discuss the clinical significance and contribution, along with patient diversity and potential cause of good/poor segmentation performance.
    5. Consider increasing the font size of the figures, especially figure 2.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    a. Accessibility and Clarity: The abstract’s use of abbreviations and technical terms could limit the paper’s accessibility to a broader audience. b. Naming Conflict: The use of the name “EMFORMER” could lead to confusion with a previously published work. A name change and a clear distinction in methodology would prevent potential misunderstandings among readers. c. Data Transparency: The absence of detailed data collection, preprocessing, and post-processing methods could hinder reproducibility and fair comparison with future research. d. The methodology section requires restructuring to eliminate redundancy and highlight the strengths of the proposed method more effectively. e. The paper contains grammatical errors and poorly structured sentences that detract from its professional quality and readability. f. A discussion on the clinical significance, patient diversity, and factors affecting segmentation performance would enhance the paper’s contribution to the field.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The need for a more accessible abstract, the clarification of the model name to avoid confusion with pre-existing work, the lack of detailed data collection and processing information, the necessity for a clearer methodology section, the presence of grammatical errors, and the absence of a thorough discussion on clinical significance and patient diversity are all valid concerns that warrant attention. However, these issues do not overshadow the paper’s contributions and potential impact, suggesting that with revisions addressing these points, the paper could provide valuable insights into medical image segmentation. My decision remains “weak accept”, contingent upon the authors’ successful revision in response to the feedback provided.




Author Feedback

First of all, I would like to thank the reviewers and chairs for their questions and suggestions on our work. We have noticed the questions involves different aspects, mainly the details of the modules and the generalizability of the model, the origin and processing of the dataset, the readability, etc. And then I will respond to your questions.

To Reviewer#1: Q:Why not compare the method with SwinUnet? A:We are grateful for suggestions for further experimentation and we will consider swinUnet as a comparison in future work. But in this paper, we have chosen the CCNet model, which is an excellent segmentation model proposed recently. And we think it has better performance than swinUnet. Experiment results demonstrate the good performance of our model compared to CCNet.

Q:The DSPConv module is similar to shuffleNet. A:In our DSPConv module, our ideas of “convolution only on part of the channel” and “mixing the convolved and unconvolved parts by channel shuffle” are not found in shuffleNet. Moreover, we have compared with shuffleNet in the ablation studies, proved that DSPConv has better performance.

Q:The method is mainly based on SegFormer. A:Firstly, the reason we imitate the Segformer is that it has a lightweight decoder which fits better with our lightweight intentions. And then, we designed the DSPConv, VAA, and S-MHA modules to replace the original modules of Segformer. As can be seen in the ablation studies, our EMFormer is completely different from the Segformer.

To Reviewer#3: Q:The paper does not fully explore the generalizability of EMFormer across a broader range of datasets. A:Thank you for your question. In our experiments, we have carefully selected the dataset to try to include both large and small targets, expecting to maximize the generalizability of our models and achieve better performance on different shapes and target sizes.

Q:Real-time applications. A:We’ve done the delay calculation before, and the delay of EMFormer is lower than Unet, reaching 0.03s. In the future, we will open source our codes to verify our claim

Q:Further discussion on the clinical implications. A:In the Conclusion section, we mentioned, “Experimental results demonstrate…” This further demonstrates that for clinical applications, our method can achieve good segmentation results without demanding hardware resources. This will help to improve the efficiency.

To Reviewer#4: Q:Naming Conflict. A:Thank you for your question, we did not realize that the name “EMFormer” has been used, and we will revise the title appropriately. After reading the “EMFormer” you provided, we found that it is a method applied in the field of NLP and that it reduces computations by storing features in “memory”. Therefore, it is still different from our proposed “EMFormer”

Q:Data Transparency. A:The ACDC dataset is from a previous work “Deep learning techniques for automatic MRI cardiac multi-structures segmentation and diagnosis: is the problem solved?”. The hippocampus dataset is from “The medical segmentation decathlon”. They are both publicly available datasets. In data processing, we remove frames with only the background and split the sequence into 2D slices. Additional details are mentioned in section 3.1 of the paper.

Q:Need a discussion on the clinical significance. A:In the Conclusion section, we mentioned, “Experimental results demonstrate…” This further demonstrates that for clinical applications, our method can achieve good segmentation results without demanding hardware resources. This will help to improve the efficiency.

Q:Accessibility and Clarity. A:We’ll be revising the abstract to ensure readability, rewriting some sentences in the paper to avoid grammatical errors, and increasing the font size of the figs

Q:Source codes are not provided. A:In the future, we will open source our codes.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top