Abstract

Recent advances in Vision Transformers (ViTs) have significantly enhanced medical image segmentation by facilitating the learning of global relationships. However, these methods face a notable challenge in capturing diverse local and global long-range sequential feature representations, particularly evident in whole-body CT (WBCT) scans. To overcome this limitation, we introduce Swin Soft Mixture Transformer (Swin SMT), a novel architecture based on Swin UNETR. This model incorporates a Soft Mixture-of-Experts (Soft MoE) to effectively handle complex and diverse long-range dependencies. The use of Soft MoE allows for scaling up model parameters maintaining a balance between computational complexity and segmentation performance in both training and inference modes. We evaluate Swin SMT on the publicly available TotalSegmentator-V2 dataset, which includes 117 major anatomical structures in WBCT images. Comprehensive experimental results demonstrate that Swin SMT outperforms several state-of-the-art methods in 3D anatomical structure segmentation, achieving an average Dice Similarity Coefficient of 85.09%. The code and pre-trained weights of Swin SMT are publicly available at https://github.com/MI2DataLab/SwinSMT.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0415_paper.pdf

SharedIt Link: https://rdcu.be/dZxeu

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72111-3_65

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0415_supp.pdf

Link to the Code Repository

https://github.com/MI2DataLab/SwinSMT

Link to the Dataset(s)

https://zenodo.org/records/10047292

BibTex

@InProceedings{Pło_Swin_MICCAI2024,
        author = { Płotka, Szymon and Chrabaszcz, Maciej and Biecek, Przemyslaw},
        title = { { Swin SMT: Global Sequential Modeling for Enhancing 3D Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15008},
        month = {October},
        page = {689 -- 699}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes Swin SMT, which replaces the MLP layer in Swin Transformer Block with Soft MoE to handle multi-class segmentation tasks. Experiments on the TotalSegmentator-V2 dataset demonstrate its performance advantages.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1.Swin SMT uses the Soft MoE module to handle complex multi-class segmentation tasks. It is theoretically and experimentally effective. 2.The description of the paper is clear, and the technical details are sufficient, which allows the readers to understand all the details easily. 3.The experimental results are convincing. The model is trained on a large-scale dataset and gives some statistical information.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1.My main concern is the computational complexity, particularly with Soft MoE, which has quadratic complexity and demands significant GPU memory. The paper restricts Soft MoE to the bottom layer due to these constraints. Training is conducted on A100 40G GPUs with a batch size of 1. However, the Swin UNETR model may require minimal GPU resources. Given these factors, it’s crucial for the authors to demonstrate whether implementing a Soft MoE module leads to substantial GPU resource consumption. 2.The paper contains table formatting errors that need attention.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    A more robust demonstration of the computational complexity of MoE could be provided, and a similar approach could be attempted to ascertain if there are performance gains in scenarios with fewer categories.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    It is based on the strengths and weaknesses that were mentioned earlier.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    In the work presented, the authors successfully incorporate Soft Mixture of Experts (SMoE) into the famous Swin Transformer segmentation architecture. SMoE was recently introduced by Puigcerver et al. as a differentiable alternative to sparse Mixture of Experts. Mixture of Experts replaces the MLP layer within the transformer architecture and allows for scaling capacity without large increases in training or inference costs. In an extensive (the first of its kind) benchmarking study on the TotalSegmentator-V2 dataset, the authors show that the presented Swin Soft Mixture Transformer outperforms existing segmentation models using comparable parameter counts while being faster at inference.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The work reflects the efforts of a highly skilled team. The integration of SMoE into a segmentation model represents a novel approach. The thorough evaluation on the TotalSegmentator-V2 dataset as well as the attained results serve as compelling evidence for the success of the presented Swin SMT architecture.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    While reading the work, I encountered some minor difficulties that initially made grasping the paper challenging. Nevertheless, it’s worth noting that both Swin Unetr and SMoE represent highly evolved methods compared to others in the field. The authors have done an admirable job presenting their work.

    • Regarding Fig. 1, it would be helpful if the caption explicitly defined the variable “N” (is it N=HWD/8?) for better clarity. Additionally, defining the “W-MSA” and “SW-MSA” layers briefly in the caption, as they are explained later in the paper, could aid comprehension.
    • Fig. 2 could benefit from a more intuitive presentation, perhaps a workflow diagram similar to the one used in the original SMoE work. This could enhance understanding, even with the well-written caption.
    • Maybe the authors could spend a sentence or two explaining the reason for the behaviour presented in Fig.3. Why isn’t the patch size (128**3) the limiting factor for the ability of the model to capture global features. It took me some time to understand that it is likely due to the relative large overlap of 50% during SW inference?
    • How my slots where used in the ablation study? s=m/n
    • As the best model perfromance is achieved using n=32 experts? Is there a reason for not using more?

    Another minor weaknesses that the authors could adress is concerning the benchmarking approach:

    • The team is clearly well equiped with hardware, but for smaller research teams VRAM utilization during training and inference are quite relevant. It would be interesting to include those numbers, as well as training runtimes into Table 1.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Depending on the hardware requirements of training, for reproducability reasons, the authors could think of releasing the model weights. As the authors like to adress pre-training in a follow-up, model weights trained on the TotalSegmentator dataset could already be a good starting point for fine-tuning…?

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Including the harware requirements and training runtimes into the otherwise, already well done and profound comparability study, would be highly benficial for other research groups. Also, considering answering the questions raised in the section above could support readers in understanding the work.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The only real weakness of the work is the missing information regarding training hardware requirements. Otherwise, the benchmark and the performance of the introduced method speak for themselves.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors proposed Swin SMT, a Swin transformer incorporating the soft Mixture-of-Expert(MoE) with the swin vision transformer to the whole-body CT organ segmentation task. Experiments are presented to evaluate the Swin SMT on the large-scale totalsegmentator V2 dataset, against multiple state-of-the-art models. The results show that Swin SMT can improve the segmentation performance while improve the inference speed.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The proposed method is the first to investigate the soft MoE applied to Swin Transformers for medical segmentation task.
    2. The experiments are done to evaluate Swin SMT against multiple state-of-the-art segmentation models. The significance analysis is done to support the robustness of experiments.
    3. The authors present the ablation study to provide more insight into the efficacy of model components and the efficiency of the various models in terms of model size and inference speed.
    4. The explanation of the proposed method is in gerneal easy to follow. The figures are self-closed.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The language and paper format needs to refine, for example, some sentences appear repetitively in sections and the authors mentioned results already in Introduction section. some abbreviations are not defined before use, for example FFN in section 2.3 and SwinMT in section 3
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors will make the implementation public and the dataset is also available. The result of Swin SMT should be able to reproduce. However, the authors might not provide details to reproduce other models in the evaluation.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. In Section 1, the sentence ‘Soft MoE allows for … and inference modes.’ repeats several times in the abstract& Introduction section.
    2. I suggest not to present your reuslt in the introduction section.
    3. the citation of totalsegmentator V2 should be incorrect. Please refer to (https://zenodo.org/records/10047292)
    4. as far as I know, nnUNet does not allow for exchanging optimizer. By the statement ‘To enhance the performance parity among the compared methods, we employ the same hyperparameters, including optimizers, learning rates, and learning rate schedulers, following the configurations presented in their original works.’ in the quantitive result section, do the authors also manage to change the hyperparameters for nnUNet?
    5. The paragraph ‘performance analysis’ in section 3 can be refactored to a discussion section.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the paper is somehow sloppy, the propsoed method is of interest to the Transformers related medical segmentation research, and the presented experiments are robust.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We would like to thank all the Reviewers for their insightful comments and constructive suggestions. Below we address the main concerns raised, among predominantly positive feedback.

  1. R1 (Clarity problems related to Fig. 1 and Fig. 2) We will resolve those problems in the camera-ready version.
  2. R1 (Limiting factor for the ability of the model to capture global features) Indeed patch size is the main reason for problems related to capturing global features but even inside 128**3 patches some models have problems capturing patch-wise global features which we address with SMoE.
  3. R1 (Number of slots in the ablation study) We used s=m/n slots.
  4. R1 (As the best model performance is achieved using n=32 experts?) We only used 32 experts in this article due to computational limitations. In our further research, we plan to test more experts to find the number of experts after which we get no significant increase in performance.
  5. R3 (The language and paper format needs to refine) We will address those issues in the camera-ready version.
  6. R1 and R4 (GPU resource consumption of the proposed network) Even though we increased the number of experts. We extracted global (path-wise) features, and our Swin SMT model took nearly the same time to finish training as the Swin UNETR model. We also have not noticed noticeable increases in VRAM consumption during training and inference of the Swin SMT model compared to Swin UNETR.
  7. R4 (Table formatting errors) We will fix those errors in the camera-ready version.




Meta-Review

Meta-review not available, early accepted paper.



back to top