Abstract

Convolutional neural networks have primarily led 3D medical image segmentation but may be limited by small receptive fields. Transformer models excel in capturing global relationships through self-attention but are challenged by high computational costs at high resolutions. Recently, Mamba, a state space model, has emerged as an effective approach for sequential modeling. Inspired by its success, we introduce a novel Mamba-based 3D medical image segmentation model called EM-Net. It not only efficiently captures attentive interaction between regions by integrating and selecting channels, but also effectively utilizes frequency domain to harmonize the learning of features across varying scales, while accelerating training speed. Comprehensive experiments on two challenging multi-organ datasets with other state-of-the-art (SOTA) algorithms show that our method exhibits better segmentation accuracy while requiring nearly half the parameter size of SOTA models and 2x faster training speed. Our code is publicly available at https://github.com/zang0902/EM-Net.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1923_paper.pdf

SharedIt Link: https://rdcu.be/dV51d

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72114-4_26

Supplementary Material: N/A

Link to the Code Repository

https://github.com/zang0902/EM-Net

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Cha_EMNet_MICCAI2024,
        author = { Chang, Ao and Zeng, Jiajun and Huang, Ruobing and Ni, Dong},
        title = { { EM-Net: Efficient Channel and Frequency Learning with Mamba for 3D Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15009},
        month = {October},
        page = {266 -- 275}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents a Mamba-based 3D medical image segmentation model called EM-Net. Specifically, it inserted Mamba blocks and efficient frequency-domain learning (EFL) layers into a U-shape segmentation network. The authors conducts experiments on BTCV dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Slightly better performance than other competing methods on BTCV dataset.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • It seems that the authors make a serious mistake. The ‘Synapse’ dataset and ‘BTCV’ dataset are the same multi-organ segmentation dataset provided by multi-atlas labeling Beyond The Cranial Vault (BTCV or BCV) challenge hosted in Synapse (https://www.synapse.org/#!Synapse:syn3193805/wiki/217785). It is confusing that the authors claim that they validate their method on the ‘two’ datasets, and reported largely different performance on the ‘two’ dataset.
    • The authors claim that they compare the proposed EM-Net with eight SOTA methods. But the popular 3D segmentation method nnUNet is not compared. Note that nnUNet is still a promising method in most 3D medical image segmentation tasks, especially in multi-organ segmentation on CT image. [1] Isensee, Fabian, et al. “nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation.” Nature methods 18.2 (2021): 203-211. [2] Isensee, Fabian, et al. “nnU-Net Revisited: A Call for Rigorous Validation in 3D Medical Image Segmentation.” arXiv preprint arXiv:2404.09556 (2024).
    • Why integrate FFT and IFFT into the UNet block? Its benefits have not been demonstrated.
    • The proposed method is simply a combination of Mamba and U-Net, which lacks novelty.
    • The adopted metric training speed (TS) in Table 3 is not a rigorous metric to evaluate model efficiency.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • The authors mention the use of both the ‘Synapse’ and ‘BTCV’ datasets; however, these are part of the same challenge, which might have led to a misunderstanding in the experimental design and results interpretation. It is crucial to clarify this point, as it currently presents a serious flaw in the validity of the experimental evaluation.
    • Considering nnUNet’s prominence in the field, a direct comparison would provide a clearer benchmark for assessing the performance of the proposed method.
    • The rationale behind integrating FFT and IFFT into the U-Net architecture is not clearly explained. Describing the specific benefits or improvements these integrations provide would help in understanding the proposed method’s uniqueness and effectiveness.
    • The combination of Mamba blocks and a U-shaped network is presented without sufficient novelty or a clear explanation of how this combination synergistically improves segmentation performance. More detailed discussion on the innovation in the architectural choices or the specific problems these choices address would be beneficial.
    • The use of training speed (TS), which depends on many factors, e.g., GPU type, CPU status, Memory, etc., as a metric is unconventional.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper is rejected due to the confusion dataset usage, insufficient comprehensive comparisons, unclear methodological contributions, and a lack of substantial novelty.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Thanks for the feedback. If the authors, other reviewers, and ACs all agree that it is reasonable and proper to regard BTCV and Synapse as two datasets for method evaluation, I am also OK with that. By the way, the performance of UNETR++ mentioned by the authors is 87.22 on Synapse dataset, and 83.28 on BTCV dataset.



Review #2

  • Please describe the contribution of the paper

    This paper introduces a novel efficient 3D medical image segmentation framework named EM-Net, which consists of the following components: 1) CSRM block for channel squeeze-reinforce Mamba that learns to attend to specific regions, 2) EFL layer for efficient frequency-domain learning, and 3) a Mamba-infused decoder to further improve segmentation performance while suppressing memory costs. Comprehensive experimental results reveal that EM-Net outperforms other SOTA methods across two datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. EM-Net effectively captures attentive interaction between regions by integrating and selecting channels, while also utilizing the frequency domain to harmonize feature learning across different scales. This dual approach enhances the model’s ability to extract relevant features for segmentation tasks.
    2. The introduction of the Channel Squeeze-Reinforce Mamba (CSRM) block in the decoding stage demonstrates a unique method for feature selection and integration. This block helps in improving segmentation accuracy while maintaining efficiency in computation and memory usage.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The interpretability of the model’s decisions and feature representations are not extensively discussed. Providing insights into how EM-Net makes segmentation decisions and the clinical relevance of the learned features could enhance the paper’s impact and practical utility.
    2. The paper claims high efficiency and low computational complexity without providing a comparative analysis of computational resources.
    3. The paper does not compare state-of-the-art methods on these two datasets, including but not limited to [1][2][3], casting doubt on the performance of the state-of-the-art.
    4. More ablation study on CSRM block and CSRM-F block should be conducted.

    [1] Jaus, Alexander, et al. “Towards unifying anatomy segmentation: automated generation of a full-body CT dataset via knowledge aggregation and anatomical guidelines.” arXiv preprint arXiv:2307.13375 (2023). [2] Xing, Zhaohu, et al. “Diff-unet: A diffusion embedded network for volumetric segmentation.” arXiv preprint arXiv:2303.10326 (2023). [3] Wu, Junde, et al. “MedSegDiff-V2: Diffusion-Based Medical Image Segmentation with Transformer.” Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 6. 2024.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The interpretability of the model’s decisions and feature representations should be extensively discussed.
    2. A comparative analysis of computational resources should be provided.
    3. State-of-the-art methods on these two datasets should be compared.
    4. More ablation study on CSRM block and CSRM-F block should be conducted.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper introduces two novel mamba-based blocks to enhance the performance of 3D medical image segmentation, and claims that the model has the balance of efficiency and high performance. However, the experiments are limited, more datasets and sota methods should be listed, and more investigation on how the two blocks works should be provided.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    Motivated to find an alternative to CNNs whose receptive field is limited and Transformers whose self-attention mechanism is computationally expansive, the authors propose a novel Mamba-based 3D medical image segmentation model called EM-Net. The proposed model is equipped with channel squeeze-reinforce Mamba (CSRM) blocks and efficient frequency-domain learning (EFL) layers. EM-Net is employed in two challenging 3D multi-organ segmentation datasets (Synapse, BTCV) with encouraging results.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Sound and innovative methodology, based on Mamba
    • Strong experiments with comparisons with 8 state-of-the-art methods including CNN-based, Transformers-based and U-Mamba.
    • Better segmentation accuracy than state-of-the-art methods while requiring nearly half parameters and 2x faster training speed
    • Robust performance across different organ sizes
    • Methodological contributions rigorously assessed through an ablation study
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Better positioning required in Sect.1 with respect to related works based on Mamba
    • Comparisons with existing methodologies could be confirmed using a statistical analysis through t-tests.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?
    • The code will be released
    • Experiments are performed on publicly-available datasets
    • Implementation details exhaustively given
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The submitted paper is well written and the method is of high interest for the medical community. The methodological contributions are sound and innovative. Experiments are very well conducted. Comments provided below should be taken into account for further improvements.

    Main comments:

    1 - A better positioning is required in Sect.1 with respect to related works based on Mamba, and especially [14].

    2 - Comparisons with existing methodologies could be confirmed using a statistical analysis through t-tests. It would be particularly useful to compare your results with the ones from Swin UNETR and U-Mamba which are close.

    Minor comments:

    3 - At the beginning of Sect.2, the STEM and CSRM-F acronyms are not defined.

    4 - In the equation M = FI + ω(Ms + Me), you should describe in the text what is ω.

    5 - Typos : - “we modulate the spectrum of M” instead of “we modulate the spectrum of m” in Sect.2.1 - “speen” instead of “Spleen” in page 7

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Innovative methodological contributions
    • Strong experiments with many comparisons and ablation study
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision
    • Innovative methodological contributions
    • Strong experiments with many comparisons and ablation study
    • Better positioning required with respect to related works based on Mamba




Author Feedback

We thank all the reviewers for their thoughtful reviews. We will modify the notation and figures of the draft as suggested.

  1. The concerns of novelty and frequency domain utilization (e.g., FFT/ IFFT usage) (R5) We argue that our EM-Net proposed two customized layers to address two trade-off challenges faced by existing approaches, rather than a mere embedding of Mamba block to the U-Net (e.g., [14]). Firstly, most existing methods struggled to balance long-range dependency and retain complex 3D positional encoding. As diverse channels may emphasize different image regions (10.1109/TMI.2022.3197180), we proposed the CSR (channel squeeze-reinforce) Mamba layer (Fig.2 upper) to eliminate redundant channels and calibrate the focus to better adapt to different targets. Note that this layer also incorporates the Mamba for its efficiency in long sequence summarization while it could be easily replaced with other linear transformer blocks. Secondly, the trade-off between global and local information has been a long-standing issue for solving segmentation. Common approaches tend to either focus solely on the former or lean towards the latter, resulting in decreased accuracy when facing multi-scale targets. As the frequency domain (FD) provides a comprehensive representation of both local and global signals (arXiv.2304.10864, WACV2024), we proposed the EFL (efficient frequency learning) layer (Fig.2 lower) that transforms the task to FD through FFT. This learnable frequency filter can highlight features with diverse scales regardless of the common architecture designs (e.g., layer depth, kernel size) to suit the needs of a specific target. The final model exploits the combination of the two and its decoder also utilizes the CSRM blocks to enhance feature selection and merging capabilities while minimizing additional computational and memory costs.
  2. More SOTA methods should be compared with t-test and visualization validation, as well as inaccurate descriptions of the datasets (R3, 4, 5) We thank the reviewer for pointing this out. Synapse and BTCV do correlate, while the latter contains 5 more organs and thus inevitably yields different results. Following UNETR++ (10.1109/TMI.2024.3398728), we chose the two considering their popularity and representativeness among SOTA works. However, we also validated our approach on the Flare dataset and obtained similar conclusions (ours obtained the highest Dice of 65.3%), while these results were omitted due to limited space. In the final manuscript, we can simply replace Synapse’s results with Flare’s for a more rigorous and comprehensive validation of our method. Meanwhile, we also implemented nnUNet and Diff-Unet following the reviewers’ comments and they scored 78.56% and 62.27% on BTCV, which are in line with the outcomes in the original manuscript. A T-test was also performed to confirm that the performance differences are statistically significant (largest p= 0.035 < 0.05). We also modified Fig.3 accordingly with Grad-Cam visualization to enhance the interpretability of the model’s decisions. We have also made our code public for reproducibility.
  3. Ablation study and other metrics to evaluate model efficiency (R4, 5) In the original manuscript, we ablated the CSRM and CSRM-F blocks by constructing variants A, B, and C of the proposed EM-Net (see Fig.1b). Specifically, CSRM comprises three CSR Mamba layers, and CSRM-F enhances this by adding two EFL layers alongside one CSR Mamba layer. Variants A and C constitute CSRM only or CSRM-F only, while B swapped the positions of two inserted blocks. Results suggested that while all these variants yielded high Dice scores, the proposed EM-Net exhibited a remarkable enhancement in training speed without compromising the accuracy (table 3 rows 5-8). Following the comment, we also add the Gflops metrics in Table 3 (besides the MEM and TS metrics). Our Gflops are 15.97% of those required by U-Mamba, while also attaining a higher Dice coefficient.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper proposes a novel Mamba-based 3D medical image segmentation model called EM-Net, addressing the limitations of CNNs and Transformers. EM-Net incorporates Channel Squeeze-Reinforce Mamba (CSRM) blocks and efficient frequency-domain learning (EFL) layers. It demonstrates strong performance on two 3D multi-organ segmentation datasets (Synapse, BTCV), outperforming state-of-the-art methods with fewer parameters and faster training speeds. The methodology is innovative, the experiments are thorough, and the results are promising. However, some improvements are needed, such as clearer positioning in the related works, statistical analysis of comparisons, detailed ablation studies, and addressing confusion regarding dataset usage. Despite these issues, the paper is well-written, the approach is methodologically sound, and the results are robust.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper proposes a novel Mamba-based 3D medical image segmentation model called EM-Net, addressing the limitations of CNNs and Transformers. EM-Net incorporates Channel Squeeze-Reinforce Mamba (CSRM) blocks and efficient frequency-domain learning (EFL) layers. It demonstrates strong performance on two 3D multi-organ segmentation datasets (Synapse, BTCV), outperforming state-of-the-art methods with fewer parameters and faster training speeds. The methodology is innovative, the experiments are thorough, and the results are promising. However, some improvements are needed, such as clearer positioning in the related works, statistical analysis of comparisons, detailed ablation studies, and addressing confusion regarding dataset usage. Despite these issues, the paper is well-written, the approach is methodologically sound, and the results are robust.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top