List of Papers Browse by Subject Areas Author List
Abstract
Semi-supervised learning (SSL) has emerged as an effective approach to reduce reliance on expensive labeled data by leveraging large amounts of unlabeled data.
However, existing SSL methods predominantly focus on visual data in isolation. Although text-enhanced SSL approaches integrate supplementary textual information, they still treat image-text pairs independently. In this paper, we explore the potential of jointly learning from related text-image datasets to further advance the capabilities of SSL.
To this end, we introduce a novel text-enhanced Mixture-of-Experts (MoE) model, augmented with textual information, for semi-supervised medical image segmentation (TextMoE).
TextMoE incorporates a universal vision encoder and a text-assisted MoE (TMoE) decoder, enabling it to simultaneously process CT-text and X-Ray-text data within a unified framework.
To achieve effective knowledge integration from heterogeneous unlabeled data, a content regularization with frequency space exchange is designed, guiding TextMoE to learn modality-invariant representations. Additionally, the proposed TMoE decoder is enhanced by modality indicators, securing the effective fusion of visual and textual features. Finally, a differential loss is introduced to diversify the semantic understanding between visual experts, ensuring complementary insights to the overall interpretation.
Experiments conducted on two public datasets indicate that TextMoE outperforms SSL and text-assisted SSL methods, achieving superior performance.
Code is available at: https://github.com/jgfiuuuu/TextMoE.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1870_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/jgfiuuuu/TextMoE
Link to the Dataset(s)
N/A
BibTex
@InProceedings{ZenQin_Exploring_MICCAI2025,
author = { Zeng, Qingjie and Luo, Huan and Ma, Xinke and Lu, Zilin and Hu, Yang and Xia, Yong},
title = { { Exploring Text-enhanced Mixture-of-Experts for Semi-supervised Medical Image Segmentation with Composite Data } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15965},
month = {September},
page = {229 -- 239}
}
Reviews
Review #1
- Please describe the contribution of the paper
The paper proposes a unified text-augmented mixture-of-experts model (TextMoE) for handling heterogeneous image-text data to achieve semi-supervised medical image segmentation. TextMoE integrates a visual encoder and a text encoder, while incorporating a frequency swapping method and using modality indicators to guide weights. It demonstrates excellent performance on two public medical datasets.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- TextMoE as a unified model, can directly handle heterogeneous data, making it worthy of promotion in medical federated scenarios.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- In the ablation study section, Plain-MoE appears abruptly, and I’m unclear about its origin. Additionally, while your TMoE is highly effective, it includes numerous modules, why was there no validation of its individual submodules? I’d like to understand whether the integration of each submodule, such as frequency swapping and modality indicators, is effective.
- The paper’s structure lacks logical clarity, the framework diagram is not intuitively explained, and TextMoE’s numerous modules required significant time for me to comprehend.
- DuSSS, also a semi-supervised medical vision-language model, similarly incorporates text-guided pseudo-label generation, yet your experimental results significantly outperform DuSSS, which makes me skeptical.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not provide sufficient information for reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper proposes TextMoE, which integrates multiple submodules to handle heterogeneous data, but the method’s design is overly complex and lacks clear structural motivation and theoretical support.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The authors have addressed most of my concerns. Although some descriptions in the paper remain unclear, I believe further improvements can lead to a more satisfactory outcome.
Review #2
- Please describe the contribution of the paper
This paper introduces a novel text-enhanced Mixture-of-Experts (MoE) model, augmented with textual information, for semi-supervised medical image segmentation.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- TextMoE incorporates a universal vision encoder and a text-assisted MoE (TMoE) decoder, enabling it to simultaneously process CT-text and X-Ray-text data within a unified framework.
- A content regularization with frequency space exchange is designed, guiding TextMoE to learn modality-invariant representations.
- A differential loss is introduced to diversify the semantic understanding between visual experts, ensuring complementary insights to the overall interpretation.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- How is the teacher-student network defined in TextMoE framework? What’s the difference? From the perspective of the article, this is more like a siamese network. 2.The gating weights of Modality Indicator are generated by features, how do model guarantee accurate class classification?
- In the TextMoE framework, text enhancement is embodied in adding text features to the visual decoder for interaction. In the ablation experiment, the influence of text modal in this process needs to be compared.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- Clear motivation and innovative method.
- The experiment and writing are complete, and the performance of the experimental results report is good, but the experiment is not enough.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Reject
- [Post rebuttal] Please justify your final decision from above.
The author didn’t explain my question clearly. First of all, the design of the teacher-student network in this task is very strange, and the motivation has not been clearly expounded. Secondly, the role of Modality Indicator has not been verified and explained either. Finally, the author explains that when using 25% labels, the model’s metrics decrease compared to when using 50% labels, but this does not illustrate the necessity of the text modality. In many tasks, adding text modality does not necessarily improve the model performance. The author deliberately avoids this issue.
Review #3
- Please describe the contribution of the paper
In the paper, authors propose a novel text-enhanced Mixture-of-Experts (MoE) model, TextMoE, which jointly processes mixed CT-text and X-ray-text data within a unified framework for semi-supervised medical image segmentation.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1) The motivation and related work are clearly described. The authors effectively convey that the approaches, limitations and challenges of traditional SSL methods and text-enhanced SSL methods in the task. This further leads to the authors’ core contribution: jointly processing mixed CT-text and X-ray-text data within a unified framework.
2) The proposed method is novel and solid. Authors proposed a novel text-enhanced Mixture-of-Experts (MoE) model, TextMoE, which jointly processes mixed CT-text and X-ray-text data within a unified framework for semi-supervised medical image segmentation. Specifically, they proposed a content regularization loss and modality indicators in TMoE decoder to effectively mitigate modality conflict between CT and X-ray modalities.
3) The experiments appear to be soundly conducted, with thorough ablation studies and comparisons to relevant prior work.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1) It seems that the paper did not provide detailed information about the network’s architecture, including the total number of layers in the network, what each layer consists of, and the specific structures of the universal vision encoder, TMoE decoder, and text encoder. Additionally, in implementation details, authors also did not provide information about the number of training epochs and the batch size used.
2) I have some confusion in section 2.3: Content Regularization based on Frequency Space Exchange. Authors mentioned that the exchanged image combines the content of x_{i}^{u} with the style of x_{j}^{u} from a different modality. Can you provide more details on how the exchange step is specifically carried out?
3) Authors mentioned that only the student model is used for inference, while as far as I know, the teacher model usually performs better [1]. What is the consideration here for choosing the student model for inference? If using the teacher model for inference, would the results be better? [1] Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. NIPS 2017
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not provide sufficient information for reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Except for the above mentioned suggestions, I would like to discuss the content of “Weights Assignment with Modality Indicator”. I think that the introduction of a one-hot modality indicator is intended to enable Exp_V1 and Exp_V2 to focus on information from different modalities. From this perspective, if I simply let Exp_V1 and Exp_V2 handle CT and X-ray tokens respectively, would it achieve a similar effect? In such a setup, experts might become more attuned to features specific to the particular modality, possibly eliminating the need for a differentiated loss to distinguish between features.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The recommendation is based on the strengths and weaknesses outlined above, with the expectation that the authors will address and improve the identified weaknesses.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
My comments have been sufficiently addressed. I would like to thank the authors for their effort.
Author Feedback
We thank all reviewers for their constructive feedback and for acknowledging TextMoE’s clear motivation, novel design, and strong performance. We are pleased that Reviewer #1 noted its “clear motivation and innovative method,” Reviewer #2 praised the clarity of the motivation and related work, calling the method “novel and solid,” and Reviewer #3 highlighted its practical value, stating it is “worthy of promotion.” We address concerns below and will incorporate suggested improvements in the final version.
R#1 Q1-Definition and difference of teacher-student network TextMoE uses a teacher-student framework, not a Siamese network. The teacher updates via EMA of the student parameters; the student is optimized by losses.
Q2-Gating weights for class classification The gating weights are not for class prediction but to adaptively modulate expert contributions for feature integration. Guided by modality indicators, the gating net learns to assign task-relevant weights and mitigate modality conflicts.
Q3-Influence of text modal The impact of text data can be assessed by replacing our text-guided visual feature integration module with a vision-only self-attention module. Specifically, with 25% labels, Dice scores dropped by 5.35% (from 88.61% to 83.27%) and 1.96% (from 74.15% to 72.19%) on two datasets, highlighting the necessity of leveraging text modality to enhance visual understanding.
R#2 Q1-More details We used ConvNeXt (vision), CXR-BERT (text), and a linear projection layer for each expert within the TMoE decoder. Batch size was 16. Early stopping was used, not fixed epochs.
Q2-Frequency Space Exchange First, the Fourier Transform is applied to unlabeled images, decomposing each image into high-frequency (content) and low-frequency (style) components. Low-frequency components are then swapped between images. Finally, the Inverse Fourier Transform reconstructs images with swapped styles, preserving content. This creates exchanged images that retain the structural content but exhibit the frequency characteristics (style) of a different modality.
Q3-Why use the student for testing We use the student for fair comparisons with BCP and DuSSS, which also test with the student. While the teacher may perform slightly better, aligning testing procedures ensures a valid comparison against these baselines.
Q4-Modality indicator vs. modality-specific training Good insight. We conducted an experiment training modality-specific experts (one for CT and one for X-ray) without modality indicators or differentiated loss. Results show that Dice scores of modality-specific training (88.24%, 71.84%) are inferior to ours (88.61%, 74.15%) on two datasets with 25% labels. The reason is that hard assignments limit cross-modal learning, while our indicator allows adaptive focus on modality cues while learning complementary features.
R#3 Q1-Ablations Plain-MoE is from [9]. The efficacy of content regularization loss (frequency swapping) and differentiated loss are validated in Table 2. By introducing modality indicators, 0.45% (88.16% vs 88.61) and 1.67% (72.48% vs 74.15%) Dice gains can be found on the two datasets with 25% labels.
Q2-Clarity of TextMoE TextMoE is a teacher-student framework with: 1) a text-guided MoE block with modality indicators that enhances visual features using textual cues, 2) a content regularization loss (L_cr) for unlabeled data learning via frequency exchange, and 3) a differentiated loss (L_diff) for diverse expert learning. We will improve the diagram and description.
Q3-Significant gains over DuSSS Gains stem from TextMoE’s unified design, which enables joint learning across heterogeneous CT-text and X-ray-text modalities within a single framework. This unified approach allows the model to be exposed to a wider variety of data distributions and learn more generalized and robust representations, leading to improved gains, particularly in low-data settings. Source code will be released for reproducibility.
Meta-Review
Meta-review #1
- Your recommendation
Invite for Rebuttal
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Reject
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
Although some minor issues, including the necessary of text modality, the overall technical contributions meets the bar of MICCAI.
Meta-review #3
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Reject
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
This is a paper with contrastive views. There are several concerns raised by one reviewer, which seems not well addressed in the rebuttal, including the clarity/motivation of the teacher-student model, and more importantly, the necessity of adding text modality to improve the vision task performance. My assessment is that this paper might be just around/below the bar.