Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Semantic segmentation in colonoscopy images is pivotal in aiding healthcare professionals to interpret images and enhance diagnostic precision. Nonetheless, the detection of polyps and instruments is challenged by the difficulty in capturing the textures and edges of tiny lesions, and these challenges are exacerbated by low contrast, inconsistent illumination, and noise. To address these challenges, we introduce WDNet, a network adopting a multi-tiered feature extraction and fusion approach, with each encoder layer amalgamating local and global information to construct expressive high-level representations. The input of the network is derived from wavelet transform to dissect images into low- and high-frequency sub-bands, utilizing learnable soft-thresholding to diminish noise while maintaining essential features. High-frequency data are adept at capturing details and edges, whereas low-frequency data furnish a global context. Moreover, WDNet harnesses a diffusion-based decoding mechanism with adaptive step sizes to amplify target region features and mitigate background interference, achieving meticulous segmentation. Comprehensive experiments conducted on a new surgical dataset, along with public benchmarks underscore its remarkable performance. WDNet not only exhibits state-of-the-art performance of semantic segmentation in colonoscopy images with remarkable detail and boundary accuracy but also stands out in processing speed, facilitating the swift handling of extensive datasets. The dataset and source code are available at \url{https://github.com/hedongdong6060/WDNet}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2468_paper.pdf

SharedIt Link: https://rdcu.be/eHw4L

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05127-1_60

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/hedongdong6060/WDNet

Link to the Dataset(s)

https://github.com/hedongdong6060/WDNet

BibTex

@InProceedings{HeDon_WDNet_MICCAI2025,
        author = { He, Dongdong AND Ma, Fang AND Liu, Ziteng AND Yin, Xunhai AND Liu, Hao AND Gao, Wenpeng AND Zhang, Chenghong AND Fu, Yili},
        title = { { WDNet: A Novel Wavelet-guided Hierarchical Diffusion Network for Multi-Target Segmentation in Colonoscopy Images } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15969},
        month = {September},
        page = {625 -- 635}
}

Reviews

Review #1

Please describe the contribution of the paper

The authors propose a new multi-target segmentation method for colonoscopy images. It explicitly separates the input into details in high-frequency sub-bands and a global context in low-frequency sub-bands, and fusion them using a diffusion-based decoding mechanism. Through this design, the model improves robustness to low contrast, inconsistent illumination, and noise, thereby enhancing segmentation performance. The authors validate the method with the best performance on the self-dataset EPID and other public datasets.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The authors explicitly use the discrete wavelet transform (DWT) to decompose the original image, effectively enabling separate processing of high-frequency details and low-frequency semantic information.
2. In the performance comparison, the proposed method clearly outperforms other baseline models and also demonstrates real-time inference capability that enables the possible implementation of robot-assisted surgery.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Regarding the ablation study: The authors claim that “removing the high-frequency branch” (Baseline: Single Branch) leads to a drop in IoU score. However, this comparison seems unfair. To fairly demonstrate the advantage of the dual-branch structure, the single-branch baseline should be given the full input—whereas in the current setup, only the low-frequency component is used.
2. The overall organization of the paper is confusing: It is very hard to locate the position of the “Dual-Stream Hierarchical Feature Aggregation Module” in Figure 1. The authors should clarify where this module fits within the overall architecture. The description of the “Diffusion Feature Refinement Module” does also not seem to align with the “Diffusion Decoder” shown in the figure.
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
1. The figures need improvement.
2. In fact, many previous works have applied wavelet decomposition to input images followed by feature fusion. The authors should conduct a more thorough review of the related literature in this area: Zhou, Yanfeng, et al. “Xnet: Wavelet-based low and high frequency fusion networks for fully-and semi-supervised semantic segmentation of biomedical images.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
3. Both mIoU and IoU appear in the manuscript. It is recommended that mIoU be used only throughout the paper.
4. The ablation study on “No Diffusion Module” is somewhat unclear. The authors mention “replacing the diffusion module with conventional convolutions,” but it is not specified whether there is a significant difference in the number of parameters before and after the replacement. Additionally, it would be helpful if the performansce comparison section included model parameter counts for reference.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The explanation of the proposed method is unclear and confusing. The authors shall carefully address the concerns in the rebuttal.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The response has addressed most of my concerns, and I think the work is enough for publication at MICCAI. However, the manuscript still needs improvement, particularly in terms of its visual illustrations.

Review #2

Please describe the contribution of the paper

This research work tackles the issue of segmentation in colonoscopy images (structures and instruments). I would say that the model implements a mainstream process with a novel architecture in the era of deep learning and is properly conducted as with ablation study. The results are improving the state of the art. Perhaps, I will explain more the clinical use case (why instruments are imaged here) for the general audience. Also Equation (1) and explanation : it is not clear to me the “dround over dk signifies the high frequency component extraction”. Perhaps the paper needs to be read again from English perspective.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Results are improving the SoTA.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Perhaps some clarification on the maths
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

I think the paper needs to be read again to make it clearer
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

I understood much more the clinical use and I think that the authors answered most of the comments of other reviewers.

Review #3

Please describe the contribution of the paper

The paper introduces WDNet, a novel deep learning architecture for real-time semantic segmentation in colonoscopy images, specifically targeting both polyps and surgical instruments. Its core innovation lies in a wavelet-guided decomposition module, which separates input images into high- and low-frequency subbands to better preserve texture and boundary information. These multi-frequency components are then processed through dedicated convolutional blocks (Double Basic Block) to enhance feature representation. The encoder adopts a hierarchical multi-level fusion strategy to effectively integrate local and global context, while the decoder includes a novel diffusion-based refinement module inspired by partial differential equations, which iteratively sharpens boundaries and suppresses background interference using adaptive step sizes. Additionally, the authors contribute a new large-scale dataset comprising over 10,000 annotated frames from 100 endoscopic procedures, enabling a thorough evaluation that demonstrates WDNet’s superiority in both segmentation accuracy and inference speed compared to existing approaches.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Clinically relevant real-time dual-target segmentation: The model jointly segments polyps and surgical instruments with speed of around 35 FPS, which aligns well with the needs of computer-assisted interventions. This enhances the method’s translational potential in endoscopic workflows.
- Creative adaptation of wavelet decomposition: Although wavelet transforms are established, the paper presents a thoughtful integration of DWT into a segmentation framework, allowing frequency-aware feature extraction that improves boundary and texture preservation in surgical images.
- Effective diffusion-based feature refinement: The diffusion-inspired decoder module leverages adaptive PDE-like updates to refine feature maps, improving boundary sharpness while maintaining real-time performance. It’s a solid example of principled design tailored to the task.
- Valuable dataset contribution: The release of a large and diverse colonoscopy segmentation dataset (10,000+ annotated frames from 100 cases) is a major contribution that supports reproducibility and further research, especially in the under-resourced space of surgical scene understanding.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Missing deployment metrics. While real‑time performance (35 FPS) is noted, essential deployment details—total parameter count, FLOPs, and memory footprint—are omitted, making it difficult to assess feasibility on resource‑constrained surgical hardware or edge devices.
- Lack of info on the dataset and training setup. There is lack of information about the dataset, e.g., patient demographic, other than the fact that it contains both polyps and surgical instruments. Basides that, there is no mention of dataset split used during training for both public and private datasets, and whether a single model is trained on all datasets or not.
- Incremental methodological novelty. The application of the proposed network architecture on this topic is novel, but, the usage of DWT, multi-level feature fusion and diffusion decoder to denoise the embeddings are not necessarily novel. With that said, authors didn’t cite any previous works on these. For example, Diffusion-Wavelet (DiWa) [https://arxiv.org/pdf/2304.01994] and WaveCNets [https://arxiv.org/pdf/2005.03337].
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

Major comments: C1. Does the EPID dataset capture the “complex enviroment” as described in the paper? C2. Where is the “…introduce a sparsity prior-based constraint mechanism … (in intro)” in the network? Please elaborate. C3. A few unclear explanations in Sec. 2.1. (1) How did authors “integrated and normalizes the three high-frequency sub-bands”? (2) Based on Fig 1., it’s unclear how the authors incorporate F_fused in the network. C4.In Sec 2.2, it is unclear how “… the Sigmoid(X) implements the adaptive noise gating …”? Also, why does “adaptive noise gating” improve boundaries? C5. In Sec 2.3., what does “transition” refer to? It is also unclear how the feature aggregation module is implemented in Fig. 1. Please elaborate. C6. Is there any explanation on why the proposed model achieves higher performance on ClinicDB and Kvasir than EPID (Table 1)? Why does Table 3 exclude ClinicDB? C7. There is no discussion on the limitations of this work. C8. What’s the relationship between Diffusion steps and FPS? Linear?

Minor comments: C9. Remove “.” after citation [11] and [12] in the Introduction. C10. The first few sentences in Sec. 2.2 is repetitive. Please make it more concise. C11. The radar chart is not necessary since authors can extend Table 1 to depicts the same information. C12. How does Upsample operation work? It’s not mentioned at all in the text. C13. Fix typos in Table 2. Also, what’s the difference between Table 1 and 2? Does Table 2 separate evaluation between polyps and instruments segmentation? If it does, please indicate in the text. C14. In the Diffusion Decoder (Fig 1.), what are the top-down and bottom-up arrows being concatenated? Besides that, the first “elaborated” Diffusion Block in Diffusion Decoder could be replaced with the shape for simplicity.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper proposed an interesting application of DWT and diffusion decoder for polyps and surgical instruments segmentation. The proposed solution achieves quite high accuracy compared to existing methods, while maintaining relatively high FPS. However, technical details of the proposed network wasn’t described/explained well as of now, thus, there are some confusions that warrant the Weak Accept.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Authors addressed most of the questions, expect minor revisions on the paper for better clarity.

Author Feedback

We appreciate the reviewers’ constructive feedback. [R1] Clinical Use Case: Simultaneous segmentation of polyps and surgical instruments is essential during polypectomy. Their interaction, involving occlusion and deformation, provides crucial context for robotic navigation, surgical skill assessment, and improved visualization. [R2] Metrics: The model on EPID data requires 117.11M parameters, 2.5 TFLOPs, and 6.2 GB memory for deployment. [R2] Dataset: EPID dataset captures complex clinical environments, focusing on interaction between instruments and polyps during surgery. Factors include: exposure, deformation, low intensity, noise. Upon public release, demographic information will be included. Data split 7:2:1 for training, validation, testing. Separate WDNet models trained on EPID and public datasets. [R2] Citations: We will cite relevant references and provide in-depth comparison. Novel architecture uses dual-stream wavelet encoder for multi-frequency, hierarchical features and SDE-guided DFR. Unlike DiWa and WaveCNets, DFR iteratively refines learned features, enhancing boundary clarity and reducing noise. Addresses surgical challenges for robust multi-target segmentation. [R2] Sparsity mechanism: This mechanism applies learnable soft-thresholding to high-frequency wavelet coefficients, leveraging their sparsity to filter noise and enhance fine details, namely the critical edges of targets. [R1,R2]2.1:Three high-frequency sub-bands were integrated via element-wise summation, followed by min-max normalization. The Wavelet Transform Module outputs these two processed streams (low-frequency and high-frequency), which serve as the inputs to the Encoder. [R2] 2.2: Sigmoid(X) acts as an adaptive gate by scaling noise influence with feature strength: strong features, characteristic of boundaries, allow more noise to guide the diffusion process. This targeted refinement of high-signal boundary regions enhances sharpness and clarity. [R2] 2.3: “transition” refers to operations that change feature map dimensions between encoder levels. In Section 2.3, this module is represented by the stacked blocks within the Encoder. The downward arrows indicate this hierarchical processing and potential spatial downsampling. [R2] Results and limitations: Performance is lower on EPID due to its more complex multi-target segmentation task with interactions. We focused ablation studies on the novel EPID and representative Kvasir datasets. We will discuss limitations. [R2] Increasing diffusion steps decreases FPS, reflecting an inverse, non-strictly linear trade-off between accuracy and speed. [R3] Ablation Study: The ‘single-branch’ baseline, like the full model, undergoes wavelet transform, then removes high-frequency (HF) information and its dedicated branch. This measures the unique contribution of the HF pathway, showing performance gains from processing HF data in the wavelet framework. [R3] Clarify: The “Dual-Stream Hierarchical Feature Aggregation Module” is the fundamental building block within the Encoder layers shown in Fig 1; downward arrows represent hierarchical processing. We will make its location within the Encoder clearer in Fig 1 and standardize the terminology for the diffusion component to “Diffusion Feature Refinement Module” in the revised manuscript. [R3] Ablation Module: ‘No Diffusion Module’ ablation (conv replacement): Parameters 112.25M vs 117.11M (original). It’s a relatively small difference. [R3] Review: We will conduct a more thorough literature review, including the suggested paper by Zhou et al. Unlike Zhou et al, our method employs sparse prior constraints in high-frequency and integrates dual-stream hierarchical aggregation and diffusion refinement, improving boundary and noise processing for colonoscopy surgery. [R3] Metrics: We used “IoU” for single-target evaluations and “mIoU” for our multi-target dataset. Regarding other details, we will improve the figures, improve clarity, and add missing references.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

WDNet: A Novel Wavelet-guided Hierarchical Diffusion Network for Multi-Target Segmentation in Colonoscopy Images

Author(s):