Abstract

Medical image segmentation plays a vital role in healthcare by identifying and delineating specific structures, such as organs, tumors, or lesions, from medical images. While deep learning has significantly advanced this field, existing methods face two major challenges. First, they rely on pixel-wise discrete representations, which lead to difficulties in scaling to different input sizes and create ambiguity in fine boundary delineation. Second, the presence of noisy labels in medical datasets hinders model accuracy. To address these challenges, we propose a novel approach that leverages continuous representations and incorporates three key components: the Hierarchical Channel-Attention Encoder (HCAE), the Three-Stage Implicit Decoder with Noise-Based Index Selector (NBIS), and the High-Frequency Noise Modulator (HFNM). HACE enhances feature extraction by capturing both fine and coarse details through hierarchical attention mechanisms. NBIS refines segmentation by identifying stable and unstable feature indices, improving performance in challenging regions. Meanwhile, HFNM selectively introduces noise to high-frequency components, helping the model mitigate the effects of label noise. This comprehensive solution demonstrates improved segmentation accuracy, particularly in the presence of noisy labels, making it a promising approach for medical image analysis.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0665_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{KumSur_Improving_MICCAI2025,
        author = { Kumari, Suruchi and Singh, Harshdeep and Singh, Pravendra},
        title = { { Improving Medical Image Segmentation with Implicit Representation and Noisy Label Robustness } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15975},
        month = {September},
        page = {269 -- 279}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces a novel deep learning framework based on SAM and implicit representations for medical image segmentation. The main contribution of this framework lies in three modules: the Hierarchical Channel-Attention Encoder (HCAE), a three-stage implicit decoder with a Noise-Based Index Selector (NBIS), and the High-Frequency Noise Modulator (HFNM).

    The HCAE leverages hierarchical features and channel attention to effectively capture the structure of medical images. The NBIS helps the model identify feature vectors that require further refinement in the three-stage implicit decoder. Meanwhile, the HFNM uses a Wavelet Transform to help the model handle label noise.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The first strength I can identify in this paper is its performance, as the proposed framework achieves SOTA results. However, the authors only compare the proposed method with methods published in 2023 or earlier. Therefore, it is difficult to draw a definitive conclusion about the current performance of the proposed method. 2) The idea of NBIS is interesting. The motivation and proposed module are well-aligned.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    [Unclear Paper Writing] 1) Is the X^w in Figure 1 similar to \tilde{X} in section 2.4? 2) Do both \tilde{X} and X go to the Patch Embedding for encoding, or just \tilde{X}? From the sentence “we pass the \tilde{X} to the encoder and decoder and get the final output as \hat{o}^’”, it seems that \tilde{X} works as an independence input, but this still needs clarification. 2) Regarding the Cross-Resolution experiments—are they conducted on the Polyp Segmentation task? What is the original performance at 384×384 resolution? How can readers evaluate the efficiency of the proposed method without this baseline? 3) What is the correct abbreviation for the Channel-Wise and Hierarchical Attention, “CWHA” or “CWHN”? iable 4, the authors use “CWHN” instead of “CWHA.” Please clarify. 4) The authors do not provide the implementation details for proposed method.

    [Insufficient Experiments] 1) The choice of baselines for discrete approaches is questionable. The most recent method cited is MedSAM, published in Nature in 2024, but the original version was already available on arXiv in 2023. Between 2021 and 2024, many other SOTA methods have been published. Why are these not included for comparison? 2) The ablation studies for the Three-stage Decoder are insufficient. For example, what is the performance if only a two-stage decoder is used?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Below is the minor suggestions: 1) Compared to the I-MedSAM method, the proposed method uses Wavelet over Fourier to transform the image to the frequency domain. It will be informative if the authors can provide ablation studies for Wavelet Transform. 2) I recommend adding\hat{o}^3 to the Figure 1 at the last arrow. 3) For Table 1, the last element of the “Method” column should be “Ours” for consistency. Also, “Method” should be “Methods”. 4) For Table 2, please add lines to separate each column.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I recommend weak reject for this paper because of several reasons: (1) there are many issues with logic and a lack of information in the writing; (2) the ablation study is incomplete.

    Also the novelty for the Hierarchical Attention-Channel Encoder with CWHA is not significant, as it merely combines existing techniques without clear motivation.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I appreciate that the authors have addressed my concern regarding the unclear paper writing and insufficient experiments. After reading reviews from other reviewers and the author’s rebuttal, I recommend accept for this paper.

    However, the authors have to provide the ablation study for the Three-stage Decoder, as they stated in the rebuttal. Also, for 2024 and 2025, we have many papers reach SOTAs on Polyp Segmentation and Organ Segmentation, the authors have to compare proposed methods with those models, not just Condseg.

    The authors do not need to compare their method with UNet, PraNet, or Res2UNet, as these are considered standard baseline methods. Instead, the proposed method should be compared with more recent SOTAs to demonstrate its effectiveness.



Review #2

  • Please describe the contribution of the paper

    1)Proposed Hierarchical Channel-Attention Encoder (HCAE). Combines SAM’s hierarchical features with a Channel-wise Hierarchical Attention (CWHA) mechanism​​ to enhance multi-scale feature extraction (from coarse-to-fine granularity) for medical images. 2)Designed Three-Stage Implicit Decoder with Noise-Based Index Selector (NBIS). Refines segmentation results progressively via a three-stage decoder​​ that utilizes variance from noise perturbations to identify unstable feature indices (NBIS), enabling targeted refinement of challenging regions. Integrates Implicit Neural Representations (INRs)​​ to achieve resolution-agnostic continuous prediction, addressing limitations of traditional discrete pixel-based methods (e.g., scale restrictions and boundary ambiguity). 3)Proposed High-Frequency Noise Modulator (HFNM). Separates high-frequency components via wavelet transform​​ and introduces controllable noise perturbations in high-frequency regions to improve robustness against label noise.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1)Innovative Hierarchical Channel-Attention Encoder (HCAE): Breakthrough in Multi-Scale Feature Fusion. SAM’s four-layer hierarchical features are processed via channel-wise average pooling to generate channel weight vectors. A randomly initialized vector introduces non-linear weighting (ReLU activation), followed by 3D cross-layer pooling and soft attention-based fusion. Overcomes limitations of naive multi-scale feature concatenation, adaptively focusing on critical anatomical structures (e.g., small polyp edges or large organ contours), achieving a ​​2.3% Dice improvement​​ in fine-grained segmentation. 2)Noise-Driven Implicit Decoder (NBIS): From Passive Noise Resistance to Active Optimization​. Original Design​​: Proposes the ​​first three-stage implicit decoding framework​​ leveraging ​​noise perturbation​​ (speckle + Gaussian noise) to generate feature variance maps, localize unstable indices, and refine regions progressively. Identifies challenging regions using feature variance v from noise perturbations, dynamically allocating decoding resources. The first stage predicts global results; subsequent stages refine only high-variance regions (e.g., ambiguously labeled lesion boundaries), reducing computational redundancy. 3) High-Frequency Noise Modulator (HFNM): Pioneering Frequency-Domain Noise Resistance​. First integration of wavelet transform with targeted noise injection​​ for enhancing robustness in high-frequency components (edges/textures). 4) Clinically Oriented Rigorous Evaluation: Beyond Conventional Validation​. Achieves ​​92.07% Dice​​ (only ​​0.61% drop​​) when transferring from Kvasir (endoscopic images) to CVC (different imaging devices) without fine-tuning, outperforming nnUNet’s ​​9.9% performance drop​​. Supports ​​arbitrary output resolutions​​ (128×128 to 896×896), maintaining ​​92.8% Dice​​ even at low-resolution inputs (128²), addressing clinical multi-device compatibility needs. Evaluates under three noise types (random noise, structural mislabeling, boundary shifts), retaining ​​89.1% Dice​​ under extreme noise (noise ratio 0.8), ​​4.7% higher than I-MedSAM​​.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1)​​ Unclear Motivation for Continuous Representations. The paper claims that discrete pixel-wise predictions lead to suboptimal results but ​​fails to identify the root causes​​ of this suboptimality. While continuous representations (e.g., implicit neural representations) are proposed as an alternative, the authors do not explicitly explain why discrete methods underperform in medical segmentation tasks (e.g., boundary ambiguity, resolution dependency). 2) ncomplete Methodological Explanation in HCAE. The description of the ​​Hierarchical Channel-Attention Encoder (HCAE)​​ focuses on procedural steps but ​​omits critical justifications. 3) ​​CWHN vs. Existing Feature Fusion​​: No comparison is made between the proposed Channel-Wise Hierarchical Attention (CWHA) and established fusion techniques (e.g., ASPP in DeepLab, multi-scale transformers), leaving its superiority unproven. Three-Stage Decoder Validation​​: The paper does not validate the necessity of three stages. For example, would a two-stage decoder suffice? How much performance gain is attributable to each stage?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    1) Component Effectiveness Demonstrated. Ablation experiments confirm the effectiveness of key components (e.g., HCAE, NBIS, HFNM) and their contribution to improved accuracy. 2)Insufficient Differentiation from Prior Work. The paper fails to ​​clearly articulate the novelty​​ of its approach compared to existing implicit methods (e.g., I-MedSAM, IOSNet). 3) Weak Motivation: The root causes of suboptimal performance in implicit methods were not thoroughly investigated.​ The authors claimed that existing implicit methods underperform, but ​​did not analyze the causes of these issues​​ and ​​failed to theoretically explain how HCAE/NBIS specifically address these problems​​.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The author’s rebuttle was able to give a partially reasonable explanation for my problem. According to the experimental results, the proposed method is valuable. Hopefully the author will make changes based on rebuttle’s response.



Review #3

  • Please describe the contribution of the paper

    The paper attempts to solve the problem of imperfect medical image segmentations resulting from (1) expensive discontinuous/discrete coordinate-based representation and (2) noisy labels using a combination of a hierarchical attention encoder, a multi-stage implicit representation-based decoder, and dedicated modules to mitigate the effects of label noise. The provided framework has components that are elegantly placed, and the experiments show that they all work well to generate high-quality segmentation results (at least as per the quantitative results).

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    A medical image segmentation model built with a combination of a hierarchical attention encoder, a multi-stage implicit representation-based decoder, and dedicated modules to mitigate the effects of label noise, albeit it seems like a complicated engineering solution, this method solves the two open problems in Medical Image Segmentation. While the components independently are not too novel, the use of decoder-only implicit representation combined with a powerful SAM-like encoder leveraging hierarchical features, along with the modules to combat noisy labels, looks like a novel method in synergy. The experiments are good enough to show the model’s and its components’ good performance. The major strength of this paper is the ease of use (due to the use of a SAM-based encoder) and superior segmentation performance in quantitative metrics.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. My biggest concern about this method is that the baseline alone (SAM with LoRA) is good enough to beat almost all of the comparative methods chosen (as shown in the ablation table vs main table), hence it seems almost unfair to the other methods that did not have the leverage of such a powerful encoder. A more reasonable comparison would have been using the ideas presented in the other methods along with the SAM+LoRA. That would have been the true test of how good the proposed components are as compared to their contemporaries. For example, perhaps the ideas of implicit representation with global and local coordinates used in one of the baselines, SwIPE, would have worked better with SAM+LoRA and even better than the decoder-only implicit representation presented in this paper?

    2. Similarly, only occupancy value-based implicit methods are explored in this paper. Are these the best methods? How about Sigh Distance Function (SDF)-based implicit methods? Could SDF methods provide more continuity in the segmentation structure than occupancy value alone?

    3. Perhaps the qualitative segmentation results would have told us more about how good the segmentation quality are with this method. If the argument was about discrete methods yielding poor segmentation quality, visual results showing the areas this method improved the segmentation performance would have been amazing.

    4. There are inconsistent usages of terminology (the component “Hierarchical Channel-Attention Encoder” is referred to as HCAE somewhere and HACE elsewhere).

    5. Table 5 is mentioned in ablation studies, but the paper doesn’t have a Table 5. Was that part of the supplementary section? Or, you meant to say Table 4?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the components alone are not novel, the combination of them is novel enough and yeild good segmentation performances. I have issues with how the comparisons are made with the existing method but since the method beats all of them comprehensively, it should be fine. Qualitative segmentation results would have definitely helped the case of the paper even more.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    My concerns were mostly addressed and the authors have agreed to include qualitative segmentation results in the final version. This should make the paper very informative and a good read for the audience.




Author Feedback

We sincerely thank the reviewers for their positive feedback on our work: novel integration of HCAE, NBIS, and HFNM modules for effective multi-scale feature extraction [R1, R3], rigorous evaluation [R1], NBIS decoder was seen as novel [R1, R2, R3], elegant component synergy [R2], superior segmentation performance [R2, R3], robustness to label noise [R1, R2, R3], clinical transferability [R1], and ease of use via a SAM-based encoder [R2].

Common Respons:

Existing implicit methods often rely on features extracted from a single resolution and do not sufficiently capture both global context and local details, leading to poor segmentation performance in complex structures and boundary regions. To address this, we utilize a hierarchical encoder based on the SAM to capture both fine-grained and coarse-level features. These features are further refined using a Channel-Wise Hierarchical Attention (CWHA) mechanism, which emphasizes the most informative features across spatial scales. Additionally, to further improve performance in boundary regions, we explicitly measure the variance between clean and noisy features and use this information to select unstable indices for refinement within a three-stage decoder. This targeted refinement allows the model to focus on unstable areas—typically found near object boundaries—resulting in significantly better boundary delineation. Compared to IMedSAM, our method reduces the Hausdorff Distance (a boundary-sensitive metric) by 1.85 voxels. Furthermore, existing implicit methods struggle under noisy label conditions, as shown in Table 3. Even when equipped with explicit noise-handling mechanisms (e.g., IMedSAM V2), their performance remains limited. In contrast, our method outperforms IMedSAM, demonstrating greater robustness to noisy annotations.

R1W1: Thank you for pointing this out. The main reason is that discrete representations lack spatial continuity, which leads to discretization artifacts and limited flexibility when dealing with arbitrary input sizes or fine-grained boundary details. We will revise the manuscript to make this explanation clearer.

R1W3,R3W7: We first experimented with a two-stage decoder; however, switching to a three-stage decoder led to a further improvement in the Dice score compared to the two-stage version.

R2W1: We appreciate the reviewer’s insightful comment. Our goal was to evaluate the effectiveness of our proposed components within a strong foundation (SAM+LoRA), as modern segmentation tasks increasingly rely on such powerful encoders like IMEDSAM. While combining prior methods (e.g., SwIPE) with SAM+LoRA could be informative, this is orthogonal to our contribution.

R2W3: Due to the space constraint, we could not include the qualitative segmentation results, and we will include them in the revised version.

R3 W1: Yes, thank you for pointing this out — we’ll correct it in the revision.

R3W2: Yes, both ( X ) and ( \tilde{X} ) go through the patch embedding and are treated independently. Eq. (6) and (7) show losses for ( X ) and ( \tilde{X} ), respectively. Eq. (8) combines both losses.

R3W3: We apologize for the confusion. The original performance (baseline) at 384×384 resolution is provided in Table 1.

R3W4: The correct abbreviation is Channel-Wise and Hierarchical Attention (CWHA). We will make it consistent.

R3W5: We had included the implementation details in the supplementary material, but they were removed during submission. We will provide a GitHub link to the implementation to support reproducibility.

R3W6: There are notable differences between the version of MedSAM published on arXiv in 2023 and the one published in Nature in 2024. We will include another SOTA method [Lei, Mengqi, et al., Condseg] published in 2025, which reports a Dice score of 89.1 on the Kvasir-Sessile dataset. Our method outperforms this approach, too.

R1,R2,R3:Typos: Thank you for pointing this out. We will correct it in the revised version.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All three reviewers believe the authors have properly addressed most of the concerns. I concur.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top