Abstract

Medical imaging data is inherently heterogeneous across different modalities and clinical centers, posing unique challenges for developing generalizable foundation models. Conventional entail training distinct models per dataset or using a shared encoder with modality-specific decoders. However, these approaches incur heavy computational overheads and suffer from poor scalability. To address these limitations, we propose the Medical Multi-Modal Mixture of Experts (M4oE) framework, leveraging the SwinUNet architecture. Specifically, M4oE comprises modality-specific experts, each separately initialized to learn features encoding domain knowledge. Subsequently, a gating network is integrated during fine-tuning to dynamically modulate each expert’s contribution to the collective predictions. This enhances model interpretability as well as the generalization ability while retaining expertise specialization. Simultaneously, the M4oE architecture amplifies the model’s parallel processing capabilities, and it also ensures the model’s adaptation to new modalities with ease. Experiments across three modalities reveal that M4oE can achieve 3.45% over STU-Net-L, 5.11% over MED3D, and 11.93% over SAM-Med2D across the MICCAI FLARE22, AMOS2022, and ATLAS2023 datasets. Moreover, M4oE showcases a significant reduction in training duration with 7 hours less, while maintaining a parameter count that is only 30% of its compared methods. The code is available at https://github.com/JefferyJiang-YF/M4oE.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1472_paper.pdf

SharedIt Link: https://rdcu.be/dY6kW

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72390-2_58

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1472_supp.pdf

Link to the Code Repository

https://github.com/JefferyJiang-YF/M4oE

Link to the Dataset(s)

https://amos22.grand-challenge.org/Dataset/ https://atlas.grand-challenge.org/ https://flare22.grand-challenge.org/

BibTex

@InProceedings{Jia_M4oE_MICCAI2024,
        author = { Jiang, Yufeng and Shen, Yiqing},
        title = { { M4oE: A Foundation Model for Medical Multimodal Image Segmentation with Mixture of Experts } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {621 -- 631}
}

Reviews

Review #1

Please describe the contribution of the paper

This work introduces the Medical Multimodal Mixture of Experts (M4oE) framework, a new approach to medical multimodal image segmentation that addresses the challenge of data heterogeneity across different imaging modalities. It is built on the SwinUNet architecture and employs modality-specific experts combined with a gating network to dynamically select the most relevant features for segmentation tasks. The proposed method demonstrates good performance over existing models on multiple medical image segmentation datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

(1) The target of the work that aims to build a unified foundation model to handle heterogeneous multimodal medical data is promising and important for future medical computing and diagnosis.

(2) The proposed M4oE framework is technically sound, with a well-thought-out approach that includes a two-phase training strategy and a dynamic selection mechanism for handling varying class numbers across modalities. The incorporation of a gating network for feature modulation is rational to enhance the model’s adaptability and interpretability.

(3) Extensive experiments across mutliple benchmarks verify the effectiveness of the proposed method.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

(1) The main design, the mixture of experts with a gated module, shares similar ideas with some previous dynamic networks, such as [1]. The major differences and novelty should be further illustrated.

(2) From Table 2, the competitor STU-Net-L outperforms the proposed method on both AMOS-CT and AMOSCT+MRI dataseds, and only slightly inferior than M4oE on FLARE22, which cannot fully verify the effectiveness of the multi-modal foundation model, opposite to the original target of “foundation model”.

[1] Yanwei LI, et. al., “Learning Dynamic Routing for Semantic Segmentation”, in Proc. CVPR 2020.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Do you have any additional comments regarding the paper’s reproducibility?

I believe the experimental results presented in this paper are reproducible. The authors have provided detailed descriptions of their methodology, including the experimental setup, dataset used, and implementation details.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

Please kindly refer to the weakness part.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The merits overweight the drawbacks, I tend to accept currently.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The paper proposes an innovative foundation model for multimodal medical image segmentation based upon a mixture of experts, named M4oE. The authors proposed M4oE’s two-phase training with modality-specific experts and, a fusing gating network, and a shuffle-adaptable linear projection architecture for multi-modality and label mapping.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

M4oE achieved competitive performance on multimodal medical datasets and showed promising transferability, with reduced need for reconfiguration across different modalities.

The paper is overall well-written and easy to follow. The gating mechanism to dynamically combine the expert outputs is interesting.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. It is not clear how the proposed model is a foundation model. By definition, foundational models are trained on a huge amount of data and fine-tuned on specific datasets. Not sure the sample size of the datasets used are big enough to call this a foundation model
2. The authors need to provide more rationale behind the selection of baselines. Are they SOTA models for medical image multimodal medical image segmentation? Also, except SAM-MED2D, are the other models foundation models?
3. The results in Table 2 should be backed by pairwise statistical comparison between the proposed method and each of the baselines.
4. Minor comment: M4PoE is a self-supervised model. I am curious why the authors didn’t include any SOTA self-supervised models as part of the baselines
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Do you have any additional comments regarding the paper’s reproducibility?

No
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

Please see my detailed comments in the weakness section.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper has sufficient methodological contributions with minor flaws in the experiments. I am happy to change my rating from weak accept to accept based on authors responses to my concerns as listed above.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

the paper proposes an innovative foundation model named M4oE, which can easily allows efficient scaling to large heterogeneous datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. the proposed method is a universal model that seems to be adaptable to more different kinds of medical images than other foundation models. And the article puts forward a lot of novel and interesting ideas, such as two-phase training, shuffle-adaptable linear projection.
2. The overall expression logic and structure of the article are very clear, so that readers can understand
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

It may be possible to use more types of medical image data, such as endoscopic imaging, retinal imaging, etc
Please rate the clarity and organization of this paper

Excellent
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
1. It may be possible to use more types of medical image data, such as endoscopic imaging, retinal imaging, etc.
2. In this article, experts exist to distinguish between different imaging modalities (CT, CE-MRI, MRI), whether it is possible to be more granular to specific tasks as experts
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Strong Accept — must be accepted due to excellence (6)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The overall expression logic and structure of the article are very clear and the method is noval enough.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Author Feedback

Reviewer #3 Q1: Regarding the major differences and novelty of our method compared to the Dynamic Routing paper, could you please provide further clarification?

A: The major differences and novelty of our method compared to the Dynamic Routing paper are as follows:

Methodology and Architecture: The Dynamic Routing paper focuses on dynamic routing for general semantic segmentation in computer vision, while our M4oE paper introduces a novel approach for multi-modal medical image segmentation using the Swin Transformer. Our method specifically addresses the challenges of multi-modal medical imaging and incorporates the M4oE to enhance representation capacity for different modalities.

Technical Focus and Application Domain: While the Dynamic Routing paper covers computer vision tasks in general, our M4oE paper specifically targets multi-modal medical image segmentation. This domain involves handling data from different imaging modalities and integrating information from these modalities to improve segmentation accuracy.

Model Components and Training Strategies: The Dynamic Routing paper proposes a model with a basic unit for feature aggregation and a gate unit for path selection. In contrast, our M4oE paper introduces a model that consists of a Swin Transformer backbone network and an M4oE component comprising a gate network and modality-specific expert networks. Our training strategy involves expert pre-training and fine-tuning of the gate network.

Q2: From Table 2, it appears that the STU-Net-L baseline outperforms our proposed method on both AMOS-CT and AMOSCT+MRI datasets, and is only slightly inferior to M4oE on FLARE22. Could you explain this contradiction to the original target of a “foundation model”?

A: While STU-Net-L performs better than our proposed method on the given datasets, it is important to consider the context and limitations of the comparison. STU-Net-L is trained on a single dataset in each task, while our method leverages multiple datasets for training in multi-tasks. Models trained on a single dataset tend to capture subtle features specific to that modality, while utilizing multiple datasets emphasizes the relative balance of features across different modalities. Additionally, the significant difference in model size between STU-Net-L and our model can impact performance. Comparing our multi-modal foundation model directly to STU-Net-L might not provide a fair assessment of the effectiveness of our approach. We appreciate your feedback and the opportunity to address these concerns.

Reviewer #4:

Q1. How is the proposed model considered a foundation model if the sample size of the datasets used is not extensive enough?

A: The current sample size of our datasets may not meet the criteria of a traditional foundation model due to computational constraints. However, we plan to address this limitation in future work by scaling up M4oE, incorporating more data, and increasing the model’s capacity to align with the definition of a foundation model.

Q2. Could the authors provide more rationale for selecting the baselines and clarify if they are state-of-the-art models for multimodal medical image segmentation?

A: The baselines were chosen based on their relevance to multimodal semantic segmentation and the availability of evaluation datasets. While not all baselines are specifically categorized as foundation models, they serve as suitable benchmarks for comparison in the context of multimodal medical image segmentation.

Q4. Why didn’t the authors include any unsupervised baselines in their comparison, considering that M4oE is a self-supervised model?

A: Although our focus was on supervised methods in this study, we appreciate the suggestion of including unsupervised baselines in the comparison. This addition will provide insights into the performance gap between supervised and unsupervised methods. We will address this point in the revised manuscript to offer a more comprehensive evaluation.

Meta-Review

Meta-review not available, early accepted paper.

back to top

M4oE: A Foundation Model for Medical Multimodal Image Segmentation with Mixture of Experts

Author(s):