Abstract

Wilms’ tumor (WT) is a prevalent cancer affecting the kidneys of children, and accurate segmentation and prediction of metastasis are vital for treatment planning and prognosis. Current methods for assessing metastasis, such as invasive biopsies and expensive PET-CT scans, hinder their widespread use in clinical settings. Deep learning, especially classification models for 3D data, is currently widely used in tumor metastasis prediction. However, existing models may not have fully accounted for the global significance of cross-sectional slices, and segment-assisted classification frameworks tailored for low-cost clinical CT imaging protocols remain understudied, with systematic validation in clinical settings yet to be comprehensively established. In this study, we propose MT-WilmsNet, a slice-guided multi-task multi-level Transformer fusion network featuring three synergistic components. First, a Wide Reinforced Transformer Feature Pyramid Network integrates multi-scale features to boost preoperative metastasis prediction accuracy. Second, a dedicated UNet-like architecture performs tumor segmentation while providing anatomical context for metastasis analysis. Finally, a global slice attention mechanism combined with multi-level self-distilling transformers emulates radiologists’ cross-slice diagnostic reasoning. Our MT-WilmsNet outperforms many typical classification models for WT metastasis prediction. The source code is available at: https://github.com/wenjing-gg/MT-WilmsNet.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2475_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/wenjing-gg/MT-WilmsNet

Link to the Dataset(s)

we retrospectively constructed a multi-center, annotated Wilms’ tumor CT dataset comprising 197 postoperative pediatric cases. In order to support future research, we have decided not to release our dataset at this time. The ethical considerations pertaining to the dataset are detailed in the “Dataset and Implementation” section of the paper.

BibTex

@InProceedings{ZhuZhu_MTWilmsNet_MICCAI2025,
        author = { Zhu, Zhu and Yu, Wenjing and Ma, Xiaohui and Liu, Shuai and Dong, Jie and Du, Yuxin and Wang, Changmiao and Yu, Gang},
        title = { { MT-WilmsNet: A Multi-Level Transformer Fusion Network for Wilms’ Tumor Segmentation and Metastasis Prediction } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15963},
        month = {September},
        page = {326 -- 336}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes a novel method for the segmentation and classification of a specific type of tumor (Wilims’ tumor). The method is fairly complex and in some aspects resembles the YOLO approach (YOLO-Med network) for object detection (component-wise) but with the incorporation of all new mechanisms starting from KAN to Multi-Task learning.

    The method showed superb results over the existing methods on the targeted task.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The major strengths of the paper are:

    1.) Novel neural network model with in-depth component analysis and their contributions toward the Wilims’ tumor segmentation.

    2.) The components are more or less known as the building blocks, but the way they are merged into a complex architecture with the reasoning behind it is interesting and presents a good foundation for further research.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Major weaknesses of the paper are: 1.) The dataset is not public so it is hard to verify how well the model performs on the outside tasks. 2.) Based on my understanding, in inference time, the user of the model still needs to provide VOI to the model which means that the proposed method requires additional effort by the user. This limitation needs to be addressed. 3.) The authors emphasize the importance of global slice attention but do not elaborate on NNUnet which is the method that proposed various mechanisms to capture global connections in the dataset. I would like authors to reflect on the NNUnet or TotalSegmentator which is based on the NNUnet because those methods often yield top results on various tasks due to their unique capability of capturing the nature of volumetric medical data. 4.) There are not limitations mentioned of the proposed method, for instance: How easy would be to apply proposed method on some other problems? The input data size is relatively small 64x64x64, so if one wants to segment bigger volume, how big the model will be(parameter wise)? Is it even fesiable to put bigger input volume size?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    I would like authors to discuss and where possible reflect in the manuscript on the following points:

    1) As I mentioned in the third point of weaknesses, the Authors claim in the abstract for instance that there is a lack of consideration of the global significance of slices, where the NNUnet method adjusts the input volume size to capture the most important features for the targeted task. Although it is not completely based on the per-slice basis, it still addresses the nature of medical data and because it does so, achieves top results on many tasks. There is also the novel Medical SAM 2 (https://github.com/SuperMedIntel/Medical-SAM2) which uses a relatively clever way to obtain slice interactions, I would say on per slice basis. I believe that the claim made by the Authors is a bit too strong and needs to be softened.

    2) The authors mentioned in the introduction that radiomics-based methods demonstrate notable limitations when applied to other metastatic evaluations of multicentric WT. Can the Authors briefly provide what are the limitations?

    3) In methodology section, in Fig1 caption Authors refers to student flow and teacher flow. I would suggest to provide more information regarding mentioned flows because for me it is unclear what it exactly means. Same goes for Auxiliary Loss, I am not sure that based on the presented description is fully clear how and where is Auxiliary Loss implemented and computed.

    4) Regarding the dataset, I assume that 197 patients mean also 197 CTs/Samples. Can the authors confirm this because it was evaluated only on private samples? Also, did the authors address the resolution of the CTs in any way? Namely, resampling and interpolating can significantly impact the quality of the data. Moreover, based on my experience, 64x64x64 volume is a relatively small patch, what is the median size of total input volumes (CTs)? The authors mentioned some public datasets such as KiTS23, just to clarify, they did not use the publicly available data in their training/test datasets.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The manuscript shows a novel method which was properly evaluated on the private dataset where it obtained top results. The authors justified the use of every components and building blocks. I would like the authors to discuss in a more critical manner their method and address their potential limitations of it.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper addresses the clinical problem of segmenting Wilms’ kidney tumors, and predicting metastasis from CT images, considered as slices. It describes a new model called multi-task multi-level Transformer fusion network, consisting in a proposed Global Slice Attention module for cross-slice reasoning, a hierarchical 3D Vision Transformer, two parallel modules: Unet for segmentation, and a proposed Wide Reinforced Transformer Feature Pyramid Network for metastasis prediction. Experiments are performed on a private dataset of 197 patients, collected for the study, due to lack of public datasets (potential datasets are referenced and criticized), with 80/20 training/testing split. Comparisons are made with 4 radiomics methods for classification (metastasis prediction), and 3 segmentation methods for tumor segmentation. They demonstrate a clear improvement for classification, and a tiny, probably insignificant, improvement for segmentation (wrt SAM-Med3D).

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    A clear clinical motivation. The approach to perform both classification and segmentation in parallel is worth being investigated. A solution is proposed that is proven to be effective for the classification task against several other methods. An ablation study is proposed, as well as a brief section about explaining results. The code was released for public use.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The major weakness deals with hyperparametrization, whose values are given, but since no validation set was used, there is a risk of bias since hyperparameters could have been fixed to maximize results on the test set. This methodological risk is a major concern. Standard deviations should be provided to assess statistical significance of the results. Also, the model is rather complex, with many details that are probably left out due to page limitations. But the clarity in general should be improved, especially for the GSA module description.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • the segmentation loss equation is provided, even though it is rather classical, but the auxiliary loss is not, although it is more unusual. At least a reference should redirect to equations to describe it, or a more precise expression should be provided (as is done for the joint loss function).

    • the modality is probably CT, but it should be stated more clearly and imaging parameters should be given when describing the dataset.

    • the dataset description should be more detailed: number of hospitals? annotation process? population statistics?

    • preprocessing: the manual outline of a bounding box is made by a physician. It is unclear whether the bounding box is for the lesion or the kidney? Details are missing, though it probably is for the tumor based on the automatic bounding box determination in the training dataset. But wouldn’t it make more sense and less biased to outline the kidney (models like SAM-Med3D could be used to do it automatically)? Since the input depends on a manual outline, the impact of variations in this manual step should be investigated.

    • VIVIT is adapted to video analysis, not 3D medical image analysis. The authors should better justify its inclusion as a classical classification method in radiomics.

    • while the ablation study demonstrates a gradual improvement in AUC, ACC and F1 scores, it also unveils that the introduction of the GSA module improves sensitivity but degrades specificity, and the introduction of the multi-task modules increases specicificty while degrading sensivity. This counterbalance effect should be discussed.

    Minor

    • please be consistent when reporting results: in 3.2, quantitative comparision, a 13% improvement in AUC is reported as well as a 16% relative increase in F1 score, while the actual improvement is 0.12
    • D,H,W described in 2.4 but already used in 2.2
    • $\sigma$ in equation 1 is not the same as $\sigma$ in equation 2
    • $\gamma, p, t$ in equation 2 are not introduced
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    A clear performance improvement with an innovative model for metastasis prediction. Not so much for segmentation. The model is complex and the ablation study is not so absolutely convincing of the importance and impact of all subtleties.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The study propoes an end to end pipeline to segment and classify 3D medical image input (case study: CT images of Wilms Tumors)

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The code repository is very well structured and documented
    • The technical description is clear and very well detailed
    • The results are promising
    • The ablation study is complete and validates the design
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The big picture is masked by the technical details
    • Only one dataset was used
    • The dataset is relatively small and the authors did not provide a comprehensive description for it
    • [major] everything is based on a single train/test split which may not be reproduceable
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    As a reviewer, my number one goal is to help you improve your manuscript rather than judge if it should be published or not. I am aware of how difficult it can be to prepare a manuscript, and I wish you good luck with this publication. Please feel free to disagree with any comment that I add, as no one knows your research better than you. If you find my style of review helpful, please use it when you serve as a reviewer.

    [minor] Please update your references. The first reference in the intro is number 6.

    Please revise the manuscript and highlight that you work with CT images. Other than Fig 1, it is not clear which modality you focus on.

    Either you introduce a new algorithm or you focus on a specific clinical problem. If your goal is to introduce MT-WilmNet as a new model that can perform segmentation and classification, please test it on more datasets (I dislike it when reviewers ask for more datasets, because it is easy to say that for any study Nonetheless, one time train/test split with a small datast can be random). If you want to focus on WT, then you should ensure it is reproducible. You already have an 80/20 data split. It should be doable to turn it into a 5 fold CV. You can also use Monte Carlo data splits and repeat your experiments 10-15 times with different seeds which is technically easier.

    Please provide the preprocessing details. Section 2.1 starts with a jump into working with preprocessed VOIs.

    It has been already shown that Deep multitask learning (dMTL) effectively increases the performance of ML pipelines. In Table 1, provide one dMTL baseline in addition to what you have. I will leave it up to you to choose a methodology with is SOTA and has an easy-to-run code repository. Again adding more baselines is a kind of generic comment that I do not like at all. Even without reading a paper, a reviewer can ask for more dataset and more baselines. However, I believe it improves the quality of your work, as yours is also dMTL.

    It is great that you did not provide bootstrapped measurements with a single data split, as it would not help.

    Share the data if you can.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The manuscript provides promissing results and enough support. There are concerns about reproduceability and randomness measurements which can hopefully be addressed during the rebuttal.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We express our great appreciation for all comments and questions from reviewers and the Area Chair, as well as for your acknowledgment of the significance of our work. Below, we summarize our responses to the main concerns:

  1. Details of the nephroblastoma (Wilms’ tumor) dataset used in this study (R1, R2, R3);
  2. Concerns about existing methods being able to obtain the VOI only via interactive input during inference (R1, R2);
  3. Other questions raised by the reviewers.

  4. Details of the Wilms’ tumor dataset. We retrospectively assembled a multi-center, annotated Wilms’ tumor CT dataset comprising 197 postoperative pediatric cases (131 from Center 1, 66 from Center 2) imaged between January 2012 and December 2024. All patients underwent contrast-enhanced abdominal CT before any surgery, biopsy, radiotherapy, or chemotherapy. Of these, 109 cases were metastatic and 86 non-metastatic. We excluded studies with missing or low-quality scans (e.g., motion artifacts), preoperative treatment, or ambiguous diagnoses. The study was approved by both institutions’ review boards, and all data were anonymized.

  5. VOI Acquisition. In the initial design of our method, we proposed an interactive model inference framework requiring clinician input to manually delineate volumes of interest (VOI). While this approach aimed to enhance interpretability, we acknowledge that the additional operational steps during the inference phase may introduce practical burdens in clinical workflows. To address this limitation, we plan to refine our framework in future work by integrating an automated VOI cropping module. This module will accept raw CT images of arbitrary dimensions (e.g., 512×512×136 voxels) and autonomously localize and extract tumor-related VOIs, thereby minimizing manual intervention while maintaining analytical accuracy. We believe this improvement will significantly enhance the usability and scalability of our model in real-world clinical settings.

  6. Other questions raised by the reviewers. (1) Student Flow and Teacher Flow, and Auxiliary Functions The “student flow” and “teacher flow” illustrated in Figure 1 are core components of our self-distillation architecture, designed to enhance feature representation through hierarchical guidance. Specifically, the student flow corresponds to the backbone features extracted at each stage of the network, while the teacher flow is derived from the fused multi-scale features generated by the WRT-FPN module. The teacher flow propagates forward to produce classification predictions, which then supervise the student flow’s predictions via knowledge distillation. This interaction is formalized through an auxiliary loss function that combines: (1) the cross-entropy losses between both flows’ predictions and the ground-truth labels, and (2) the KL divergence between the probabilistic outputs of the two flows. We emphasize that this design aims to implicitly refine feature discriminability without introducing additional annotation burdens, as the teacher flow inherently captures richer contextual information from multi-scale fusion. (2) Deep Multi-Task Learning (dMTL) Baseline We appreciate the reviewer’s suggestion to evaluate our approach on a dMTL baseline. Although our model already achieves state-of-the-art results in standalone classification and segmentation, we agree that a unified dMTL evaluation would further bolster our findings. We will therefore systematically assess its performance on dMTL tasks in future work. Thank you for this valuable recommendation.

Once again, we sincerely thank the reviewers for their valuable feedback!




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top