List of Papers Browse by Subject Areas Author List
Abstract
Accurate and reliable brain tumor segmentation, particularly when dealing with missing modalities, remains a critical challenge in medical image analysis. Previous studies have not fully resolved the challenges of tumor boundary segmentation insensitivity and feature transfer in the absence of key imaging modalities. In this study, we introduce MST-KDNet, aimed at addressing these critical issues. Our model features Multi-Scale Transformer Knowledge Distillation to effectively capture attention weights at various resolutions, Dual-Mode Logit Distillation to improve the transfer of knowledge, and a Global Style Matching Module that integrates feature matching with adversarial learning. Comprehensive experiments conducted on the BraTS and FeTS 2024 datasets demonstrate that MST-KDNet surpasses current leading methods in both Dice and HD95 scores, particularly in conditions with substantial modality loss. Our approach shows exceptional robustness and generalization potential, making it a promising candidate for real-world clinical applications. Our source code is available at https://anonymous.4open.science/r/MST-KDNet-FB17.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2782_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: https://papers.miccai.org/miccai-2025/supp/2782_supp.zip
Link to the Code Repository
https://github.com/Quanato607/MST-KDNet
Link to the Dataset(s)
N/A
BibTex
@InProceedings{ZhuShe_Bridging_MICCAI2025,
author = { Zhu, Shenghao and Chen, Yifei and Chen, Weihong and Wang, Yuanhan and Liu, Chang and Jiang, Shuo and Qin, Feiwei and Wang, Changmiao},
title = { { Bridging the Gap in Missing Modalities: Leveraging Knowledge Distillation and Style Matching for Brain Tumor Segmentation } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15967},
month = {September},
page = {97 -- 107}
}
Reviews
Review #1
- Please describe the contribution of the paper
This paper presents a segmentation model tailored for the missing modality scenario. The proposed model can use arbitrary modalities and obtain corresponding segmentation results. It employs a knowledge distillation approach, the full-modality image segmentation network as the teacher model, while the missing-modality segmentation network acts as the student model. The model is trained by multiple losses which can help the student model learn from the teacher model in multiple feature semantic levels. The effectiveness of this method is validated on two datasets, demonstrating its practical applicability in addressing the challenges associated with missing modality segmentation tasks.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
-
Design multiple losses to help the student model learn from the teacher model in multiple feature semantic levels.
-
Conduct comprehensive experiments to evaluate the performance of the proposed method.
-
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
-
Method This paper wants the student model can mimic the teacher model style and designed global style matching module. Usually, the style is the semantic features of the target, like gray distribution in the medical image processing context. But what the style means in this paper is not clearly clarified.
-
Undeclared and consistent use of symbols in papers Section 2.1 Why does the spatial dimension of the initial features become HWD / 4 * 4 * 4 instead of HWD / 8 * 8 * 8 after going through three convolutional downsampling layers? And how is this feature, after being transformed into a 1D sequence, further divided into chunks? Section 2.2 Are the equation 2 A_max, A_min, and A_mean calculated using the same attention weights? If so, why do A_max and A_min use A_s as the input, while A_mean uses A_i as the input? The L_mst in equation 4 does not appear in Figure 1. Section 2.3 The variable N does not appear in equation 5. Should 1/2 be replaced with 1/N? The meaning of the index j is not explained in equation 5. Section 2.4 The G_dsc in equation 10 is not explained. The theta in equation 11 is not explained. Section 2.5 It is recommended that the L_dice in equation 12 be unified with the L_seg in Figure 1.
-
Errors Section 3.1 The BraTS 2024 dataset used in this paper has multiple sub-tasks. If Task 1 is used, the amount of data should be more than 1080. According to the official documents, the number of samples in other sub-datasets is also not 1080. At the end of the first paragraph, the sentence “Additionally, data augmentation techniques, such as random flips, rotations, and cropping.” is not completely expressed. Section 3.2 There are a total of 15 combinations instead of 16. There is a highlighted error in column 8 of Table 1
-
- Please rate the clarity and organization of this paper
Poor
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
In terms of writing, the symbols in the figure and the method should be consistent, and the symbols in the formula should be explained in more detail to make the article easier to understand.
An introduction to the design motivation of the method should be included, such as what characteristics of the features correspond to the average value, maximum and minimum values of the features in LMS-TKD. What does global style refer to in Section 2.4? The interval between f_enc, f_dec and f_t is only one encode layer. Is there a reasonable correlation and difference between their features to ensure feature learning? The lack of necessary explanations leads to the problem of over-design of the method.
Although this paper has conducted sufficient experiments on two datasets, it lacks discussion and analysis of the experimental results, which may be due to the limitation of text length. However, the FeTS 2024 dataset is migrated from the BraTS dataset, but the institutional source is additionally marked for federated learning, so the experimental results on the two datasets are redundant. It is recommended to replace the results of FeTS 2024 with discussions to make the experimental results more reliable.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper is difficult to understand due to the problem of declaration and consistency of symbols. The paper constrains the features between the teacher and the model at multiple spatial scales and semantic levels, but does not explain what style is and how these features correspond to style, which is confusing. Although sufficient experiments are conducted, there is a lack of corresponding analysis and discussion.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The authors’ answers to my previous concerns were satisfactory for the most part. Although this method provides extensive experiments on two datasets, it lacks analysis of the experimental results, which I believe is equally important as the presentation of the results. Especially since the FeTS dataset is derived from the BraTS dataset, and there is no significant distribution bias between the two datasets, the experimental results on both datasets should have exhibited similar metrics.
Review #2
- Please describe the contribution of the paper
The manuscript proposes a transformer-based knowledge distillation framework tailored for brain tumor segmentation under scenarios with missing MRI modalities. The key contributions include a Dual-Modal Logit Distillation (DMLD) module and a Global Style Matching Module, both designed to enhance the distillation process and improve the segmentation performance when input MRI sequences are incomplete.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The manuscript addresses an important real-world challenge in medical imaging: performing robust segmentation in the presence of missing modalities.
- The integration of transformer-based architecture with a knowledge distillation framework is well-motivated.
- The proposed modules (DMLD and Global Style Matching) are interesting and show potential in improving cross-modal knowledge adaptation.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- This work lacks a direct comparison or a discussion outlining the advantages of the proposed method over existing ones that either use knowledge distillation [1–3] or image synthesis/generative model [4] to address the missing MRI sequence in BraTS. Including this would strengthen the novelty claim and contextual relevance.
- The architecture diagram (Figure 1) lacks clarity, particularly in distinguishing between the teacher and student models. Visually separating them would improve understanding.
- In Equation (6), the symbol σ is used—should be a standard deviation, not a variance, please check.
- The manuscript does not report the weighting coefficients (λ) used to balance different loss components. Providing these values is essential for reproducibility and understanding the contribution of each module.
- It is unclear how the proposed KD framework behaves when all MRI modalities are available, as shown in Figure 2 and Table 1. Since this is the standard scenario, the proposed KD student model should be compared with. Also, it would be great if you can add qualitative results for FeTS 2024 dataset.
- Page 6: The sentence starting with “Additionally, data augmentation techniques…” appears to be grammatically incomplete and should be revised for clarity.
- (Optional): If space allows, it would be valuable to include visualization of feature maps (e.g., Grad-CAM or t-SNE) after each proposed module. This would help illustrate how the modules contribute to representation learning and segmentation quality.
[1] Chen et al., 2021. Learning with privileged multimodal knowledge for unimodal segmentation. IEEE transactions on medical imaging, 41(3), pp.621-632.
[2] Choi et al., 2023. A single stage knowledge distillation network for brain tumor segmentation on limited MR image modalities. Computer Methods and Programs in Biomedicine, 240, p.107644.
[3] Hu et al., 2020. Knowledge distillation from multi-modal to mono-modal segmentation networks. In MICCAI 2020: Part I 23 (pp. 772-781).
[4] Al-Fakih et al., 2024. FLAIR MRI sequence synthesis using squeeze attention generative model for reliable brain tumor segmentation. Alexandria Engineering Journal, 99, pp.108-123.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The work presents a promising approach to address a clinically relevant problem. The results are convincing. However, several areas—particularly clarity of the architectural design, missing baseline comparisons, and clarification of experimental details—require revision.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The authors clarified and addressed my comments.
Review #3
- Please describe the contribution of the paper
This paper proposes a novel deep learning framework for multimodal brain tumor segmentation. The network (MST-KDNet) is based on a vision transformer. The authors address the challenging scenario of missing image modality by designing a knowledge distillation approach in which the teacher network has access to all modalities while the student network has access to only a subset of the data. The authors propose multi-scale knowledge distillation and logit standardization to improve teacher knowledge transfer. A discriminator is also integrated to align the features of the teacher and student networks. The method is validated on two publicly available datasets BraTS 2024 and FeTS 2024, and shows improved performance compared to state-of-the-art multimodal segmentation models.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
This paper provides several incremental methodological contributions to multimodal learning, in particular 1- logit standardisation for knowledge distillation adapted from [17], 2- adversarial learning to match teacher and student features. More generally, multimodal learning is also highly clinically relevant, as different modalities typically encode complementary information. The proposed model can be easily applied to other regions of interest (not only brain) and modalities (not only different MRI sequences, but also CT or PET). Finally, the paper presents a fair evaluation of the method: comparison with state-of-the-art on two datasets and ablation study of the proposed multiple module.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
The lack of statistical analysis of the results does not allow the reader to conclude that the gains of the proposed method over the state-of-the-art are significant. The effect of the many components of the proposed loss on segmentation prediction is not clear from the ablation study.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
Methodology:
- In equation 2, it is not clear what the variable s used in min and max corresponds to.
- In section 2.4, why are the encoder, transformer and decoder features reshaped into 2D tensors? And what is the feature fusion operation: element-wise multiplication or matrix multiplication?
Experiments:
- The authors should report the values of the different weight loss hyperparameters.
- What is the probability to mask each modality during training of the student network?
Results:
- The authors should perform a statistical analysis of the results to ensure the superiority of the proposed method (Tables 1, 2 and 3).
- Similarly the authors should perform a statistical analysis of the ablation study to demonstrate the significance of each term of the proposed loss (Table 4). From Figure 2, it is also difficult to see what are the benefits/impacts of the different modules (MS-TKD, SKLD, and GSM) on the segmentation outputs.
- From Tables 1, 2 and 3, the authors should comment on which modality is the most useful for segmentation of brain tumour tissue, and does it correspond to what is expected?
- In order to assess the benefit of multimodal learning, it would be beneficial to compare the results of the multimodal models with unimodal approaches such as nn-UNets trained on each modality separately (Isensee, F., Jaeger, P.F., Kohl, S.A.A. et al. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 18, 203–211 (2021). https://doi.org/10.1038/s41592-020-01008-z).
Recommendation for future work:
- Explore the performance on other multimodal segmentation problems such as the MICCAI HECKTOR challenge (https://hecktor.grand-challenge.org/). Head and neck tumors segmentation in CT and PET imaging.
Misc.:
- In Equation 6, there is a 1 (one) instead of l in the definition of Z(l).
- In Equation 10, spelling error: the variable G_{dsc}^{T} should be G_{dec}^{T}.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(5) Accept — should be accepted, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Multimodal learning and missing modality approaches are relevant and of interest to the MICCAI community. The paper proposes a novel framework based on knowledge distillation with solid evaluation on two datasets.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The authors have addressed all the technical issues raised by the reviewers. The experiments are robust and the method is novel; the paper is suitable for MICCAI.
Author Feedback
Answer for Reviewer#1 R1-Q1: Lack of comparative ablation analysis and advantage elaboration Due to space constraints, we have streamlined the descriptions in the manuscript, but we have supplemented the detailed analysis in our GitHub repository for readers to access the detailed comparison.
R1-Q2, 3, 4, 6: Model diagrams, symbols & formulae issues Regarding Fig. 1, we used different arrow colors to distinguish the forward propagation of the teacher model and student model as well as re-elaborated them in the baseline part of Method for readers’ convenience to differentiate them; there was a clerical error in our text, and the symbol “σ” is the standard deviation; we wrote in the code We wrote λ in the code and added detailed comments in the corresponding code area; the sentence “In addition, data augmentation techniques, such as ……” has been revised to “Additionally, we Additionally, we applied data augmentation techniques, such as random flips, rotations, and cropping.”.
R1-Q5: Results of KD in full modality & Qualitative description of FeTS When we calculated the results for the case where all four modalities were available, we used the student model, not the teacher model. It is possible that the presentation problem caused your misunderstanding, and we have changed it to the latest version of the revised paper to avoid confusion among the readers. The detailed qualitative description of FeTS has been added to our github repository for readers to view.
Answer for Reviewer#2 R2-Q1: “Style” is not clear. The term “style” in the text refers to the global texture and intensity distribution of the student model and the teacher model in feature space. We have explained it in the latest version of the revised paper.
R2-Q2: Inconsistency between symbols and formula descriptions The following has been changed: (2.1) We made a clerical error in writing HWD / 4 * 4 * 4, which should be HWD / 8 * 8 * 8 after three convolutional downsampling layers; we flattened the blocks into vectors and put the vectors together into a one-dimensional sequence in order. (2.2) We have computed this using the same attentional weights, and A_s is the ensemble representation of A_i; to save space, we take the ensemble representation. (2.3) Have changed 1/2 to 1/N, and j is the index of the pixel point. (2.4) G_dsc has been changed to G_dec, and theta/2 has been changed to theta/4N^2, where theta is 0.0001 and N is the total number of pixels. (2.5) L_dice has been changed to L_seg.
R2-Q3: Data set description problem & text language description problem (3.1) 1080 is the number of training set we divided according to the ratio of 8:2, there was a mistake in the paper, the latest version of the paper has been re-changed to 1350, (3.2) the paper has been revised “16” and color labeling; data enhancement statement problem can be found in R1-Q6.
R2-Q4: Optional We have revised the method consistency and added detailed notation explanations, as well as a detailed description of the motivation for the design of the Method; global “style” explanations can be found in R2-Q1.; f_enc, f_dec, and f_t complement each other by realizing local encoding, global self-awareness, and detailed decoding.
Answer for Reviewer#3 R3-Q1: Inadequate experimental analysis Because of space constraints, we put this section into the GitHub repository. We have statistically analyzed the differences in modeling effects demonstrated by the dice, hd95 metrics, and elucidated the effects of different loss functions.
R3-Q2: Optional The following has been modified: (Methodology) A_s is explained in R2-Q2.; in order to characterize the second-order statistical correlation between the different features and to “calibrate” the style by combining Adversarial Learning with the MSE loss. (Experiments) The use of matrix multiplication and hyperparameter values have been annotated in the GitHub code. (Results) As shown in R3-Q1. (Misc.) We have fixed all the errors in the equations.
Meta-Review
Meta-review #1
- Your recommendation
Invite for Rebuttal
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
The reviewers have raised several important questions and concerns regarding the clarity of the method, consistency in notation, and the need for more detailed experimental analysis. Also, the paper omit some important and very related literatures, e.g. ShaSpec. The paper need to show the empirical comparison. Authors should to carefully address these points.
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Reject
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
This paper proposes a transformer-based knowledge distillation framework for brain tumor segmentation under missing MRI modality conditions.
While some concerns were raised regarding notation clarity, definition of “style,” and depth of analysis, the rebuttal satisfactorily addressed these issues. All reviewers updated their scores to acceptance, recognizing the method’s soundness, novelty, and practical relevance. I recommend acceptance.
Meta-review #3
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
This work presents enough technical contributions and meets the bar of MICCAI.