Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Owing to its superior soft tissue contrast, Magnetic Resonance Imaging (MRI) has become a cornerstone modality in clinical practice. This prominence has driven extensive research on MRI-based segmentation, supported by the proliferation of publicly available benchmark datasets. Despite employing multi-expert consensus protocols to ensure annotation quality in public datasets, the inherent label noise, particularly prevalent at lesion boundary regions remains unavoidable. To address this fundamental challenge, we introduce a novel machine learning paradigm that reframes dataset annotations as probabilistic weak supervision rather than deterministic gold standards. We proposed AffinityUMamba, a novel dual-branch Unet-like framework that synergistically integrates convolutional operations with state space models, leveraging local feature coherence and global contextual agreement. And a Local Affinity-guided Label Refinement (LALR) module to identify potential noisy labels in the training data and produce refined pseudo labels. A unified uncertainty constraint paradigm combining margin-based logit smoothing with local affinity refinement, enabling simultaneous optimization of segmentation accuracy and confidence calibration. Training is stabilized through a composite objective combining topological preservation constraints with margin-aware uncertainty penalization, enabling joint optimization of structural coherence and detail fidelity. We comprehensively evaluated the proposed method on 12 public datasets spanning multiple modalities: 10 MRI, 1 Ultrasound, and 1 CT. The results of our experiments demonstrate an improved segmentation performance and reduced prediction uncertainty.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2411_paper.pdf

SharedIt Link: https://rdcu.be/eHwPq

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04947-6_4

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

ACDC: https://www.creatis.insa-lyon.fr/Challenge/acdc/ iSeg2017: https://iseg2017.web.unc.edu/ BraTS2020: https://www.med.upenn.edu/cbica/brats2020/ ISLES2022: http://www.isles-challenge.org/ PROMISE2012: https://promise12.grand-challenge.org/ MyoPS2020: https://zmiclab.github.io/zxh/0/myops20/ MSD: http://medicaldecathlon.com/ AMOS2022: https://amos22.grand-challenge.org/ ATLAS2022: https://atlas.grand-challenge.org/ CuRIOUS2022: https://curious2022.grand-challenge.org/

BibTex

@InProceedings{ZhaYuk_AffinityUMamba_MICCAI2025,
        author = { Zhang, Yukun AND Wang, Guisheng AND Nailon, William Henry AND Cheng, Kun},
        title = { { AffinityUMamba: Uncertainty-Aware Medical Image Segmentation via Probabilistic Weak Supervision Beyond Gold-Standard Annotations } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15962},
        month = {September},
        page = {35 -- 45}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes the AffinityUMamba framework which mitigates the effects of noises or flaws in human annotations on training segmentation models. This framework uses a dual-branch architecture, with one branch a U-Net enhanced with Mamba blocks, and another branch with CNN decoders. The framework has a Local Affinity-guided Label Refinement (LALR) module that identifies noisy labels in the training data and produces refined pseudo labels. The Implicit Affinity Enhancement (IAE) and Pixel Neighborhood Enhancement (PNE) modules are also proposed.

Experiments were performed on 12 datasets with five comparing models.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The goal of addressing discrepancies in human annotations is an interesting problem.
2. Using 12 public datasets spanning 10 MRI, 1 Ultrasound, and 1 CT modalities on five other comparing models is impressive, especially for a conference paper.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
The methodology is difficult to follow, and is lacking in rigor. The writing and clarity need to be improved.
1. “Affinity” is a key concept in the paper, but it is not properly defined. This makes it difficult to connect the formulations with “affinity”.
2. In Section 2.1 (IAE), the authors mention that eq. (1) “encodes semantic relationships among neighboring pixels by comparing each pixel’s value to the neighborhood’s average, enhancing the corresponding features in AM based on these implicit affinities”. Nevertheless, how eq. (1) can achieve this is nontrivial and is not properly explained. Furthermore, why CNN features are used to guide the AM features but not the other way round?
3. In Section 2.2 (PNE), “This module enhances local continuity in CNN decoder output features by directly leveraging multi-scale intensity relationships from the input image, addressing ground truth inaccuracies caused by inter-observer variability”. Again, how eq. (2) can address ground truth inaccuracies is unclear.
4. In Section 2.3 (LALR), eq. (5) checks for conflicting patterns by comparing cosine similarities of features between pixels with the same and different labels, which makes sense. Nevertheless, how do we know if such conflicts are caused by data uncertainties or model uncertainties?
5. In the experiments, it feels contradicting that the human annotations were still treated as gold-standard when computing the Dice and HD.
6. At inference, which branch provides the segmentation?
Please rate the clarity and organization of this paper

Poor
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(1) Strong Reject — must be rejected due to major flaws
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper is difficult to follow, and important details and rationales are missing. The proposed framework is lacking in rigor.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

Although the framework can produce better calibrated models, it is unsure what types of uncertainties are addressed. They can be model uncertainties, interpatient variability, imaging artifacts, or annotation uncertainties. The paper and the rebuttal cannot answer this question but simply stating that they are addressing annotation uncertainties, which is unpersuasive especially without experiments on datasets in which each image has multiple annotations. This is why this paper is lacking in rigor, and I decided to reject this paper.

Review #2

Please describe the contribution of the paper

The authors address the issue of aleatoric uncertainty, which arises from errors in data annotations. To overcome this challenge, they propose a novel end-to-end framework that employs a weak-supervision approach, treating ground truth labels as pseudo labels and refining them throughout the training process. They integrate Mamba with a CNN-based framework to effectively manage complex predictions at object boundaries and mitigate label noise. The authors provide comprehensive analyses across 12 datasets to validate their approach.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Strengths
- The authors present a significant challenge in medical imaging, which is the noise in the annotations.
- To address this issue, they introduce an innovative dual-branch framework resembling the U-Net framework, which combines CNNs with SSMs, Mamba.
- Additionally, they employ ground-truth labels as pseudo-labels and propose a novel label refinement strategy.
- To mitigate overly confident predictions resulting from noisy boundaries, they also introduce a margin-based smoothing loss.
- Their extensive validation across 12 datasets further reinforces their assertions.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Weaknesses
- The authors suggest that “when combined with CNNs, SSMs are capable of retaining local details and boundary features while simultaneously capturing global information”. However, they do not provide any evidence to support this claim. Is this information generally accepted knowledge, or was it demonstrated through empirical validation? It would be important for the authors to clarify why SSMs, particularly in conjunction with CNNs, possess this capability or to reference studies that substantiate this statement. This clarification is especially important given that the approach is fundamentally based on SSMs, and understanding the motivation is crucial.
- Moreover, as the authors address the challenges of data uncertainty and label noisiness, it raises the question of how to determine if the observed improvements are statistically significant, considering the potential for noisy labels. Calculating the mean and standard deviations would be useful to illustrate the statistical significance of the proposed approach.
- Furthermore, the authors have tested a limited number of combinations of their proposed components in the ablation study. I am curious about the reasons for not conducting an ablation study on all possible combinations of LALR, LAE, and PNE.
- It appears that the first CNN block in Fig.1 is missing the input.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The authors are tackling a crucial challenge in biomedical imaging to consider variability/uncertainty in annotated labels when training deep nets. They propose an innovative approach that integrates SSM with CNN in a UNet-like framework, employing various affinity-based refinement techniques. Their validation supports their claims about the improvement. However, it is essential to clarify the motivation behind adopting this framework, either by referencing previous studies or through their own insights. As noted in the weaknesses section, the manuscript does not adequately explain why the combination of CNNs and SSMs is effective in retaining local details and boundary features while also capturing global information. Providing some background to support this assertion is crucial. Additionally, calculating mean and standard deviations would strengthen their claims. Overall, the contributions are well-articulated, and the concerns raised could be easily addressed in the authors’ response. Therefore, I am inclined towards accepting this paper.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

After reading the rebuttal and the other reviews, I feel that the contribution is worth publishing. The rebuttal could have been stronger, and I am assuming that the Authors will follow up on all of their promises.

Review #3

Please describe the contribution of the paper

This paper presents a medical image segmentation method based on U-net/Mamba and guided by uncertainty and the identification of noisy labels in training. The method was evaluated on 12 public datasets with 5 baselines showing improved performance and reduced uncertainty expected calibration error
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The formulation is novel in its use of uncertainty for label refinement in segmentation
- While the improvement over baselines is sometimes relatively minor, the consistency for both performance and expected calibration error over all datasets is impressive. This makes for a convincing evaluation.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Some constants in the equations are unexplained, e.g. 8 in Equation 8.
- The uncertainty for Ours in Figure 2 is surprisingly focused on the boundary. I would have expected some more spread particularly with the smaller structures.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The novelty of the formulation and its sensible use of uncertainty along with the consistent improvement shown.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The use of uncertainty in this context is novel and will be a nice contribution to the conference. I found the rebuttal convincing.

Author Feedback

Abbreviations: W = Weakness, R = Reviewer, A = Answer. Major Concerns:

Low uncertainty for small structures. (W2 R1) A: Our method refines high-uncertainty (aleatoric) boundaries resulting from difficulties or multi-observer variation at indistinct margins between structures. Small structures with clear boundaries commonly exhibit low aleatoric uncertainty, thus not “more spread” in Fig.2.

CNN and Mamba synergy. (W1 R2) A: CNNs (local) and SSMs (Mamba, global) are complementary. Our dual-branch architecture (with IAE, PNE interactions) leverages both feature extractions. Tables 1-2 demonstrate this synergy, which will be further supported by reference [a] in the Camera-Ready (CR) version. [a] HCMA-UNet: A Hybrid CNN-Mamba UNet with Inter-Slice Self-Attention for Efficient Breast Cancer Segmentation.

Statistical significance and ablation study (W2 and W3, R2) A: Good advice about STD for better statistical significance demonstration. We indeed have the STD result, but did not put it in due to limited space in Tables 1 and 2. We will do that in future work. The ablation study targeted the core module LALR for its performance in identifying potentially incorrect labels. IAE & PNE are auxiliary modules, that is why we did not conduct an ablation study on all possible combinations of LALR, IAE, and PNE.

Data or model uncertainty (W4 R3) A: Good question. Our dual-branch decoder minimizes model uncertainty, enabling us to approximate data uncertainty with prediction uncertainty, which then better reflects the data’s inherent uncertainty. Fully disentangling model and data uncertainties is a future goal.

Gold standard issue (W5 R3) A: We acknowledge that evaluating with noisy labels is unjustified; it is just benchmarking with other segmentation methods. Even though the evaluation remains valuable, our method optimizes high-uncertainty regions (e.g., boundaries) while preserving expert annotations in other areas. Consistent improvements across datasets/metrics (DSC, HD95, ECE) and Fig. 2 results show our method corrects annotations, while not overfitting them. R3 also noted, “Using 12 public datasets is impressive”. These 12 public datasets are the gold standard in evaluation, just not ‘perfect’.

Clarity Issues:

Unexplained constant (W1 R1) A: Constant 8 (from [20]) controls the uncertainty margin. The CR will clarify this and ensure that all other constants are explained.

Fig.1 CNN input unclear (W4 R2) A: Fig. 1 is simplified. CNN decoder inputs are UNet-like (skip connections & previous layer features). First CNN block gets the final encoder output. The CR will clarify this.

Writing and clarity (W1,W2,W3, and W6 of R3) A: Overall, thanks for your advice, we will improve our presentation in CR. W1: “Affinity” denotes similarity/relatedness between pixels or features in the context of image segmentation. Its operation is described before Section 2.1 and in Eq. 4, following references [1,23,28]. Will add an explanation in Intro for an easier connection to the equation. W2: As detailed after Eq. (1): IAE enhances AM feature discriminability by contrasting local differences from CNN, highlighting boundaries. CNNs excel at local features, AMs at global; IAE leverages CNN affinity to boost AM feature discrimination, balancing local/global information. The reverse design approach would be less logically convincing. W3: PNE’s design assumes noisy annotations but relatively reliable image signals (containing tissue boundary info). Eq. 2 shows PNE aggregates grayscale info using multi-scale Gaussian kernels, correcting CNN features via image continuity, and mitigating annotation inaccuracies. W6: Inference uses the AM branch output. This is implicit: 1. Loss functions (Fig.1 L_aff, L_margin & Sec. 2.3 Eq. 7,8) supervise AM. 2. CNN assists AM in generating y_refine and enhancing features. 3. Mamba offers strong global modeling, and IAE/PNE integrates CNN’s local information. The CR will state this explicitly.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

The reviewers have mixed comments on this work. The authors are invited to clarify several points relevant to the reviewers’ concerns: 1) Unclear description, such as for the term “affinity”, 2) the motivation of IAE, PNE and LALR should be better explained; 3) the use of human annotation (noisy) for calculating Dice and HD seems to be problematic; 4) not clear if the improvements are statistically significant; 5) more ablation studies are needed; 6) uncertainty for small structures are not clear
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The authors proposed a medical image segmentation framework that integrates uncertainty estimation and noisy label identification to improve training with imperfect annotations. However, several weaknesses remain. The motivation for combining CNNs and SSMs is insufficiently justified, and some key concepts like “affinity” are not clearly defined. Important details and rationales behind the modules are missing, limiting clarity and rigor. Considering these issues, I recommend rejection.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

This paper has received mixed scores. Since the views from the reviewers differ, and all bring valid (differed) points, I have decided to read myself the manuscript. After reading the work, I am inclined toward recommending rejection, based on the complexity of the methodology proposed, which is not supported by the results. Indeed, looking at the ablation studies, the average DSC and ECE compared to the base model without any proposed component is rather marginal. In addition, the better performance compared to other models may come from a stronger baseline, rather from the proposed components (given the results in ablations). Last, authors ignore a full body of literature on calibration, in both discussion and empirical validation.

back to top

AffinityUMamba: Uncertainty-Aware Medical Image Segmentation via Probabilistic Weak Supervision Beyond Gold-Standard Annotations

Author(s):