List of Papers Browse by Subject Areas Author List
Abstract
Transformer has revolutionized medical image processing using long-range information modeling and self-attention mechanisms. However, the adaptability of traditional transformer-based architecture for real-world applications is hindered by its ample parametric space as well as its limitation of processing fine-detailed local information which is vital for high-resolution downstream tasks like multi-organ segmentation. Several convolutions and self-attention-based hybrid models have been proposed to resolve the above issues over the years. Yet, none of the methods have succinctly presented an efficient processing of local and global information. Inspired by the top-down mechanism of the human visual system, in this paper, we propose a novel 3D-transformer architecture dubbed WaveFormer that leverages the fundamental frequency-domain properties of features to learn contextual representation. Our encoder efficiently extracts multi-scale features using the self-attention mechanism in reduced feature space obtained through progressive summarization by discrete wavelet transformation (DWT), keeping the high-frequency detail intact. On the other hand, the decoder progressively reconstructs high-resolution segmentation masks using the high-frequency counterpart from DWT as a global guide through inverse discrete wavelet transformation (IDWT). Leveraging the fundamental data property to decode using linear transformation (IDWT) diminishes the necessity of high parametric upsampling layers unlike traditional methods. Quantitative evaluation on BraTS2023, and FLARE2021 shows that our model outperforms state-of-the art and produces comparable performance on KiTS2023 challenge dataset.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4968_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/mahfuzalhasan/WaveFormer
Link to the Dataset(s)
BibTex
@InProceedings{AlMd_WaveFormer_MICCAI2025,
author = { Al Hasan, Md Mahfuz and Zaman, Mahdi and Jawad, Abdul and Santamaria-Pang, Alberto and Lee, Ho Hin and Tarapov, Ivan and See, Kyle B. and Imran, Md Shah and Roy, Antika and Fallah, Yaser Pourmohammadi and Asadizanjani, Navid and Forghani, Reza},
title = { { WaveFormer: A 3D Transformer with Wavelet-Driven Feature Representation for Efficient Medical Image Segmentation } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15963},
month = {September},
}
Reviews
Review #1
- Please describe the contribution of the paper
The paper introduces WaveFormer, a novel 3D transformer for medical image segmentation that leverages discrete wavelet transforms to efficiently capture both global context and fine local details. It replaces standard upsampling with inverse wavelet transforms, reducing model complexity. Validated on three benchmarks, WaveFormer achieves competitive performance with significantly lower computational cost.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The paper addresses a clinically relevant and timely problem in the field of radiological image analysis, focusing on efficient and accurate 3D medical image segmentation. One of its major strengths lies in the novel formulation of a transformer architecture, which integrates discrete wavelet transforms to separate low- and high-frequency components of volumetric data.
The authors clearly explain both the conceptual foundation and technical implementation of the model, providing a solid and interpretable justification for its design. The use of multiple datasets covering diverse anatomical regions and resolution settings demonstrates strong generalization and ensures clinical relevance across a range of use cases.
The experimental evaluation is thorough, including appropriate performance metrics and an ablation study that, although not extensive, supports the design choices. While statistical significance tests are not explicitly reported, the results are convincing and contextualized in comparison to prior work.
Importantly, the proposed method shows competitive accuracy with a much lower computational footprint, highlighting a meaningful trade-off between performance and efficiency. This positions the proposed model as a promising contribution for scalable and deployable medical AI solutions. The authors also acknowledge current limitations and suggest future directions, demonstrating awareness of the challenges ahead.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
I did not identify any substantial weaknesses in the paper. If one aspect were to be improved, it would be the depth of analysis regarding the contribution of individual components of the proposed architecture.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not provide sufficient information for reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
This paper provides a solid contribution, but a more detailed component-wise analysis and statistical evaluation would strengthen the work. This warrants acceptance with minor reservations.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #2
- Please describe the contribution of the paper
The manuscript proposes a novel medical image segmentation model for volumetric medical images that simultaneously focuses on handling computational burden while improving segmentation performance. The work, instead of using the spatial domain of the input images, leverages the features of the wavelet domain to learn both global context and fine-grained details of the target regions. Using the frequency domain also helped to reduce the computational complexity in terms of the number of parameters involved.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
-
Leveraging the wavelets for feature learning throughout the network is a good way to reduce the number of tokens being processed by the self-attention or window self-attention mechanism that reduces the computational burden.
-
The choice of wavelets over other frequency domains like Fourier transform helps capture what frequency components are present and where they are situated in the input. This helps to capture the overall shape and texture of the target regions well which is crucial for medical image segmentation.
-
Exploring IDWT other than the conventional CNN blocks along the decoder path introduces computational efficiency by reducing the number of parameters from the 3D conv blocks.
-
Evaluating their performance across multiple volumetric datasets prove the generalizability of the model.
-
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
-
IDWT needs the low- and high-frequency component to reconstruct the spatial domain representation. The Z_{enc} is the final processed low-frequency output from the encoder arm and Zi_{HF} is the high frequency output from stage i. Now for reconstructing the spatial domain representation at a particular stage i, why the authors have not chosen Z_i (low-frequency component) and Zi_{HF} as inputs to IDWT operation?
-
The logic behind using both low-frequency component produced at a particular stage (Z_i) and the final encoder output Z_{enc} which is also a low-frequency representation in the IDWT upsampling block is unclear.
-
The authors report only the number of parameters as a metric to show their model is computationally efficient. However, it does not solely reflects the amount of arithmetic operations required to process the input. Fewer parameters can still have complex operations that might effect the inference time.
-
The choice and rationale behind what particular wavelet is used for the experiments should be included.
-
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The unclear design choices are the reason behind this decision.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
Three major contribution of the paper : frequency-domain representation learning, an efficient frequency-guided decoder, and enhanced local-global context aggregation.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The integration of DWT into a transformer architecture is a novel and well-justified approach. Features decomposition between low-frequency (global) and high-frequency (local) sub-bands,effectively addresses the dual challenges of computational efficiency and fine-grained detail preservation in 3D medical image segmentation.
- The biologically inspired design, emulating the top-down visual processing pathway, adds a unique perspective to the model’s development, aligning with neuroscientific principles and potentially enhancing its interpretability.
- The use of DWT and inverse DWT (IDWT) to reduce the token count for self-attention and replace heavy upsampling layers
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Insufficient ablation of studies , comprehensive ablation study to dissect the contributions of individual components (e.g., DWT levels, wavelet-attention blocks, squeeze-and-excitation module). For instance, how does the model’s performance change if the DWT decomposition level 𝑚, m is altered? What is the impact of excluding the squeeze-and-excitation module?
- Clarity in section 2.2 and also in figure 2. Some parameters has been used in equations without defining them
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The manuscript presents a promising and innovative approach to 3D medical image segmentation, with clear contributions in terms of efficiency and performance. The use of wavelet transforms in a transformer framework is a novel contribution, and the evaluation on benchmark datasets demonstrates its potential. However, the manuscript requires significant revisions to address the insufficient ablation studies, incomplete baseline comparisons, and clarity issues in figures and equations. With these revisions, the manuscript has the potential to make a valuable contribution to the field and would be suitable for publication.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
We thank all reviewers for their valuable feedback and constructive comments. The GitHub repository will be provided in the camera-ready version.
R1: We will provide ablation on different components contribution (e.g., Channel Calibration) in the final version.
R2-Q1: Please note that Z_i is the final output from ith stage, not the Low frequency (LF) components. At each stage, DWT decomposes input token into low-frequency approximation (Z_LF) and set of high-frequency (HF) detail coefficients components {Z_HF}(Figure 2a). 2 rounds of Attention & Feedforward transforms Z_LF to Z_i; which is the final output at stage i. Whereas, Z_enc (6^3) is the final encoded representation obtained by processing input token Z_token (48^3) through 4 stages of hierarchical attention (2*4=8 attention blocks) on the DWT approx. coefficient of the token features (Figure 1). For every decoder level, we feed the same global representation Z_enc into the IDWT, together with the stage-specific high-frequency set: IDWT(Z_enc,{Z_HF^i})→Z_dec^i. So, we had two options, using Z_enc or per stage local LF, Z_LF^i (Figure 2a). We chose to use Z_enc. This is because Z_enc captures global context via full-volume self-attention. Pairing this with stage-specific HF bands in each IDWT block enriches reconstruction with long-range semantic cues (e.g., tumor vs. edema). Reusing Z_enc across decoder stages also couples gradients across scales, promoting consistent learning. Using separate per-stage LF maps would only provide local context, increase memory usage as we need to cache then, and complicate implementation. In contrast, a shared Z_enc path enables a cleaner, more efficient multi-scale decoding pipeline in PyTorch, with better semantic coherence.
R2-Q2:In each IDWT upsampling block, Z_enc serves as the LF input and is combined with the stage-specific HF set {Z_HF^i} to perform the inverse 3D-DWT, doubling the spatial resolution. Separately, the encoder output Z_i from stage i is added as a skip connection after IDWT to provide channel-specific localization cues to the decoded feature. To clarify, Z_enc and Z_i are not both passed into the IDWT operation—only the global Z_enc is used there. We acknowledge the potential confusion due to naming and will address this along with the ablation study on evaluating the effect of omitting the skip connection in the final version.
R2-Q3: We use number of parameters as a metric to measure complexity. The new operations introduced are from DWT which are convolutions with a Haar wavelet. These operations are not complex in nature, and it is demonstrated that the DWT is optimal and linear. From here, the number of parameters can be used as a metric to compare with classical methods. Additionally, we compared FLOPs of our network and found that our model has comparable or less FLOPs than the SOTA methods. We will present the FLOPs and complexity of the operations in the final version.
R2-Q4:We use the Haar Wavelet. Haar wavelet is the simplest orthogonal wavelet with compact support defined by scaling factor g_0=[1⁄√2,1⁄√2] and wavelet filter g_1=[1⁄√2,(-1)⁄√2].
R3-Q1: Part 1:In stages 1–3, input tokens undergo wavelet decomposition at levels m = 3, 2, and 1, respectively, down to the final encoder scale (e.g., 48³ → 6³ in stage 1). Reducing m (e.g., to 2) leads to higher-resolution attention (12³), increasing computation—counter to the model’s goal. However, for fine-grained tasks, m can be adjusted. The modular design supports such adaptation.
Part 2: The SQE block enhances context propagation across channels in the high-dimensional encoded feature (384 ×6^3), yielding more holistic representations. While a 3D UNet-style bottleneck (as in UX-Net) is possible, it adds significant parameters with marginal Dice improvement. We will include ablations on both m and SQE in the camera-ready version to highlight performance–efficiency trade-offs.
R3-Q2:We will fix mentioned issues in the camera-ready version.
Meta-Review
Meta-review #1
- Your recommendation
Provisional Accept
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A