Abstract

MoDiff is a morphology-emphasized diffusion model designed to address the challenges of ambiguous medical image segmentation by overcoming the consistency and accuracy limitations of conventional probabilistic models. MoDiff utilizes probability-based label maps instead of traditional one-hot encoding to capture inherent uncertainties across diverse medical image annotations and to maintain consistency among sampled segmentation results. Additionally, MoDiff determines the presence or absence of individual radiologist labels, enabling diverse segmentation sampling and providing more comprehensive insights into ambiguous areas. MoDiff improves boundary precision through the Learnable Discrete Frequency Filter (LDF), which is designed to capture high-frequency, detail-specific information within the image domain. By preserving essential structural characteristics, LDF facilitates the retention of fine-scale details crucial for accurate boundary delineation. Integrated with the Morphology-based Cross Attention Network (MCA), LDF enhances feature synthesis, thereby enabling more precise segmentation of anatomical contours. Comprehensive evaluations on the LIDC-IDRI and MS-MRI datasets demonstrate that MoDiff achieves excellent segmentation accuracy, boundary precision, and consistency across samples.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3281_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{AhnJun_MoDiff_MICCAI2025,
        author = { Ahn, Jung Su and Kwak, Ki Hoon and Seo, Jung Woo and Cho, Young-Rae},
        title = { { MoDiff: A Morphology-Emphasized Diffusion Model for Ambiguous Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15962},
        month = {September},
        page = {390 -- 399}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes MoDiff, a diffusion-based probabilistic segmentation model tailored for ambiguous medical image segmentation. Its two main contributions are 1) A module that operates in the frequency domain to extract high-frequency morphological features and suppress noise via learnable filtering. 2) An attention mechanism that fuses image patches and sampled label patches through positional encoding and multi-head self-attention to guide the reverse diffusion process using morphological structure. MoDiff employs probability-based label maps derived from multiple radiologist annotations to model inherent ambiguities in medical images more accurately. MoDiff is evaluated on the LIDC-IDRI (CT) and MS-MRI (brain MRI) datasets using metrics that assess segmentation accuracy, uncertainty calibration, and consistency. The model outperforms state-of-the-art probabilistic and diffusion models across all tested metrics.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The paper tackles ambiguity in segmentation, especially important in multi-rater scenarios common in radiology and neuroimaging. 2) The use of probability-based label maps instead of discrete one-hot annotations is well-motivated and consistent with real-world inter-rater uncertainty. 3) Two novel architectural modules: a) LDF creatively uses frequency-domain filtering with learnable kernels, which is novel in this context. b) MCA encodes cross-attention between morphological features from the image and label domains, helping condition the diffusion process meaningfully. 4) The model consistently achieves state-of-the-art results on LIDC-IDRI and MS-MRI, improving on prior diffusion models like CCDM and CIMD.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) The rationale behind LDF’s use of frequency-space filtering with convolution + sigmoid is empirically motivated but not well justified theoretically. What are the benefits over standard spatial convolutions or adaptive frequency masks? Why does the sigmoid activation improve edge focus? 2) MCA’s contribution is largely implementation-driven. The attention mechanism appears to follow standard transformer-like design, with softmax suppression removed in some stages, which the authors claim improves contrast. However, this claim lacks experimental or theoretical grounding. 3) The proposed method uses T = 250 steps and heavy computation (4x RTX 3090), yet no discussion of inference time or runtime analysis is provided. This is critical for clinical adoption. 4) The model still relies heavily on sampling-based diversity, but no statistical analysis of variability across runs is offered to support claims of consistency. 5) The experiments are limited to only two datasets, both of which are patch-based 2D datasets with localized lesions. 6) The model’s performance under data imbalance, low training data regimes, or multi-modal full-volume data (e.g., 3D MRI or PET/CT) is not studied. 6) The technical writing in Sections 3.1 and 3.2 is dense and at times ambiguous. For example: Notation in Eq. (1)-(5) is inconsistent, the explanation of the MCA encoder and patch-wise attention formulation is hard to follow. and the rationale for omitting softmax normalization in Eq. (2) is not clearly justified.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    ** How does it differ from traditional frequency-based denoising? Why are 3×3, 5×5, and 7×7 convolutions used? Are they fused? ** Clarify the impact of omitting softmax in attention scoring. Does this empirically improve contrast? Include an ablation. ** Add runtime and memory benchmarks. How long does training/inference take per sample? Can it run on 1 GPU? How many parameters? ** For clinical relevance, consider adding experiments with noisy or misaligned annotations, as often seen in practice. ** Figure 2 is under-described. Show difference maps or uncertainty maps to justify improved structural fidelity.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper introduces a well-motivated solution to the problem of ambiguous segmentation in medical imaging, with two interesting modules (LDF, MCA) and strong experimental performance.

    However, the lack of theoretical clarity, overly complex and unclear technical descriptions, limited dataset diversity, absence of efficiency analysis, and missing code or reproducibility guarantees reduce confidence in the paper’s broad impact and robustness.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposes a probabilistic segmentation model, MoDiff, a morphology-emphasized conditional diffusion model tailored for ambiguous medical image segmentation. In order to ensure preserving the anatomical structure boundaries, the method integrates two key modules: 1) a Learnable Discrete Frequency Filter (LDF) - a frequency-domain convolutional filter to capture high-frequency anatomical details; 2) a Morphology-based Cross Attention (MCA) module which is a patch-based cross-attention mechanism that enhances morphological feature extraction by jointly processing the input image and sampled label. Experiments on the LIDC-IDRI and MS-MRI datasets demonstrate that MoDiff outperforms state-of-the-art stochastic segmentation methods across several metrics (GED, HM-IoU, NCC, CI). Ablation and case studies further validate the contributions of LDF and MCA.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Using averaged probabilistic label maps instead of randomly sampled one-hot labels provides a more stable representation of ambiguous annotations.
    • not commonly applied in diffusion models.
    • The combination of LDF (FFT-based filtering) and MCA (attention mechanism) to handle anatomical detail and noisy gradients in diffusion steps.
    • LDF extracts high-frequency content without relying on handcrafted filters (like Sobel or Canny).
    • Quantitative and qualitative evaluations are strong across two datasets.
    • Ablation studies are provided for LDF and MCA individually and jointly, supporting their importance.
    • Comparison with traditional edge detectors is useful.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • LDF Not Fully Novel: the LDF module is essentially a learned frequency-domain filter applied with standard FFT/IFFT, using standard convolution and sigmoid gates.
    • the idea of learnable FFT filtering is not new. Prior works (e.g., MedSegDiff [Wu et al. 2024], Fourier U-Nets) already used frequency-domain regularization.
    • Comparaison to MedSeg not provided
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    This paper proposes a probabilistic segmentation model, MoDiff, a morphology-emphasized conditional diffusion model tailored for ambiguous medical image segmentation. In order to ensure preserving the anatomical structure boundaries, the method integrates two key modules: 1) a Learnable Discrete Frequency Filter (LDF) - a frequency-domain convolutional filter to capture high-frequency anatomical details; 2) a Morphology-based Cross Attention (MCA) module which is a patch-based cross-attention mechanism that enhances morphological feature extraction by jointly processing the input image and sampled label. Experiments on the LIDC-IDRI and MS-MRI datasets demonstrate that MoDiff outperforms state-of-the-art stochastic segmentation methods across several metrics (GED, HM-IoU, NCC, CI). Ablation and case studies further validate the contributions of LDF and MCA.

    Several points are raised below that need clarification in my opinion.

    1. I am not sure having understood what is considered as input in the MoDiff training scheme described in Fig. 1: are the input images x_b the 2D 128x128 pixels size images centered on the lesions, captured on successive slices of the 3D patient data? (experimental dataset in §4.1 gives this impression). In this case H = 128, L = 128 and the relationship between the number of patches in B and S sets (page 3, below eq 1) would be m = 4n? (since a label L_b or x_T contains 4x more patches than original image x_b). This also means that the training dataset contains only 128x128 images of lesions and no other “healthy” regions outside the lesions (with no associated labels, or label=0)? If this is not the case, please clarify the patch size used to partition the input images.

    2. There is an inconsistency in the MCA module description through eqs. 3-5 versus its graphical representation in Fig. 1c. According to equations 3 and 4, LDF and LN are applied to y^B = (f^B concat S’) before entering it into MSA, whereas in figure 1c of the MCA, (f^B concat S’) is shown as the input to the MSA, thus after LDF and LN application (according to figure 1a). Either figure 1c or equations 3 and 4 are incorrect and need revision.

    3. In fig 1c, the transpose block is not shown after W_k module. Also, MCA shows two inputs whereas in fig 1a, MCA receives three inputs - this requires more precision in fig 1c (the text mentions on page 4 a concatenation with x_t and also between original image and “the generated c_t, which is then used to infer the label for the next step” - so the third input should come from there; by the way, c_t is not explained here).

    4. No detail is provided on f^B() and g^B() in the MCA module.

    5. Concerning the LDF, I would have appreciated a deeper insight into the reason why is it implemented in the frequency domain. According to eq. 6 and fig 1b, LDF implements a convolution with a learnable filter H in the frequency domain, normalizes the result and performs inverse FFT. This would be equivalent with a tensor multiplication in the spatial domain that could also be learned.

    6. LDF has a 2D matrix as input. From fig 1a, it is suggested that the attention score (eq. 2) ensures a 2D output mapping of the required dimensions. Do you confirm this (implementation details are missing)?

    7. How y^B_target (eq. 7) is generated? Does it correspond to the lower row in fig. 1a?

    8. Concerning the hyperparameters lambda_i (eq 8), how they were chosen? The ablation study only informs on the interest of including both LDF and MCA. I also wonder how sensitive is MCA to patch size, number of attention heads, or positional encoding strategy?

    9. I regret that no comparison with MedSegDiff cited in §2.2 is provided. Also, it would be interesting to know if the improvements reported in Tables 1 and 2 are statistically significant.

    10. In Fig. 2, the binary segmentation results shown for each method come from the probabilistic segmentation map (shown in colors) using several thresholds? What are their values?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Several points are raised below that need clarification in my opinion.

    1. I am not sure having understood what is considered as input in the MoDiff training scheme described in Fig. 1: are the input images x_b the 2D 128x128 pixels size images centered on the lesions, captured on successive slices of the 3D patient data? (experimental dataset in §4.1 gives this impression). In this case H = 128, L = 128 and the relationship between the number of patches in B and S sets (page 3, below eq 1) would be m = 4n? (since a label L_b or x_T contains 4x more patches than original image x_b). This also means that the training dataset contains only 128x128 images of lesions and no other “healthy” regions outside the lesions (with no associated labels, or label=0)? If this is not the case, please clarify the patch size used to partition the input images.

    2. There is an inconsistency in the MCA module description through eqs. 3-5 versus its graphical representation in Fig. 1c. According to equations 3 and 4, LDF and LN are applied to y^B = (f^B concat S’) before entering it into MSA, whereas in figure 1c of the MCA, (f^B concat S’) is shown as the input to the MSA, thus after LDF and LN application (according to figure 1a). Either figure 1c or equations 3 and 4 are incorrect and need revision.

    3. In fig 1c, the transpose block is not shown after W_k module. Also, MCA shows two inputs whereas in fig 1a, MCA receives three inputs - this requires more precision in fig 1c (the text mentions on page 4 a concatenation with x_t and also between original image and “the generated c_t, which is then used to infer the label for the next step” - so the third input should come from there; by the way, c_t is not explained here).

    4. No detail is provided on f^B() and g^B() in the MCA module.

    5. Concerning the LDF, I would have appreciated a deeper insight into the reason why is it implemented in the frequency domain. According to eq. 6 and fig 1b, LDF implements a convolution with a learnable filter H in the frequency domain, normalizes the result and performs inverse FFT. This would be equivalent with a tensor multiplication in the spatial domain that could also be learned.

    6. LDF has a 2D matrix as input. From fig 1a, it is suggested that the attention score (eq. 2) ensures a 2D output mapping of the required dimensions. Do you confirm this (implementation details are missing)?

    7. How y^B_target (eq. 7) is generated? Does it correspond to the lower row in fig. 1a?

    8. Concerning the hyperparameters lambda_i (eq 8), how they were chosen? The ablation study only informs on the interest of including both LDF and MCA. I also wonder how sensitive is MCA to patch size, number of attention heads, or positional encoding strategy?

    9. I regret that no comparison with MedSegDiff cited in §2.2 is provided. Also, it would be interesting to know if the improvements reported in Tables 1 and 2 are statistically significant.

    10. In Fig. 2, the binary segmentation results shown for each method come from the probabilistic segmentation map (shown in colors) using several thresholds? What are their values?

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors proposed to use a morphology emphasized diffusion model for ambiguous medical image segmentation. The Learnable Discrete Frequency Filter (LDF) extracts high-frequency features for boundary precision, and the Morphology-based Cross Attention Network (MCA) enhances anatomical features. Evaluations showed improvement in segmentation accuracy and boundary precision on CT and MRI datasets.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed LDF is able to detect subtle boundary details and filters high-frequency noise, and the proposed MCA can synthesize the derived features into a robust condition for denoising.

    The evaluation and comparison is comprehensive.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Using diffusion model in segmentation mask generation can be time- and computation-expensive, especially when annotations are available, whether utilizing the generative method will be worth the additional cost. Please comment on this and add a time and memory comparison.

    If both LDF and MCA are removed, will the network be a simple diffusion model as in [2]? If both LDF and MCA are added to CCDM as in [8], will the performance be further improved?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    In LDF loss, it’s unclear how the ideal morphological feature maps were generated/acquired.

    Minor: all the table titles should be on top of each table.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The idea of morpholigically enhanced diffusion model for ambigous medical image segmentation is interesting and might help with challenging cases and subtle morphological textures. Some additional clarifications and discussions are needed.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

The rationale behind LDF’s use of frequency-space filtering with convolution + sigmoid is empirically motivated but not well justified theoretically. What are the benefits over standard spatial convolutions or adaptive frequency masks? Why does the sigmoid activation improve edge focus?

  • We have applied learnable frequency-domain kernels to generate image-adaptive high-frequency filters, unlike classical methods such as Canny or Sobel. To capture morphological features at multiple scales, we have used 3×3, 5×5, and 7×7 kernels independently and had concatenated their outputs channel-wise, followed by a 1×1 convolution. The sigmoid activation has helped suppress overly strong frequency responses and has stabilized boundary enhancement.

MCA’s contribution is largely implementation-driven. The attention mechanism appears to follow standard transformer-like design, with softmax suppression removed in some stages, which the authors claim improves contrast. However, this claim lacks experimental or theoretical grounding.

  • We have implemented an attention mechanism inspired by transformer designs but have removed softmax at certain stages to avoid suppressing weak but meaningful boundary cues. This design has retained raw attention scores, thus preserving contrast between structure and noise.

LDF Not Fully Novel: the LDF module is essentially a learned frequency-domain filter applied with standard FFT/IFFT, using standard convolution and sigmoid gates.

  • Although LDF has used standard FFT/IFFT operations, it has extended prior work by applying multi-scale, learnable frequency-domain filters. This design has enabled LDF to model both sharp edges and broader context, improving upon earlier edge-focused filters.

I am not sure having understood what is considered as input in the MoDiff training scheme described in Fig. 1.

  • Each x_b has corresponded to a 128×128 lesion-centered 2D slice extracted from the 3D patient data. Healthy regions had been excluded. For MCA, x_b had been divided into 2×2 patches, x_t into 4×4 patches, resulting in m=4nm = 4nm=4n.

There is an inconsistency in the MCA module description through eqs. 3-5 versus its graphical representation in Fig. 1c. According to equations 3 and 4.

  • We had confirmed that LDF and LN had preceded the multi-head attention module. We has been corrected Figure 1c and has been revised the accompanying explanation.

Concerning the LDF, I would have appreciated a deeper insight into the reason why is it implemented in the frequency domain.

  • We had chosen frequency-domain convolution because it captured global context more effectively than spatial filters, which in turn had allowed us to reduce the number of diffusion steps. This observation have highlighted in the revised text.

How y^B_target (eq. 7) is generated? Does it correspond to the lower row in fig. 1a?

  • y^target_B had been generated by applying morphological extraction directly to the ground-truth label masks, as shown in the lower branch of Figure 1. a). We have clarified this.

Concerning the hyperparameters lambda_i (eq 8), how they were chosen?

  • We had searched λ_1 and λ_2 over {0, 0.25, 0.5, 0.75, 1} and found that 0.5 yielded the best validation results.

In Fig. 2, the binary segmentation results shown for each method come from the probabilistic segmentation map (shown in colors) using several thresholds?

  • The binary segmentation masks in Figure 2 had been obtained by thresholding the probability maps at 0.9. We have included this detail in the revised caption.

In LDF loss, how are the ideal morphological maps obtained?

  • y^target_B had been computed by applying attention-style weighting to the ground-truth labels followed by LDF, encouraged the model to match realistic morphology.

  • We once again thank the reviewers for their insightful comments. We are confident that the above revisions will substantially improve the clarity and impact of our work.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top