Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Polyp segmentation is a representative task in computer-aided clinical diagnosis in colonoscopy analysis. However, strict regulations limit the availability of large, high-quality image-mask paired datasets for segmentation. As a result, recent studies have focused on models that generate images conditioned on masks. However, due to rigid annotation constraints and a high reliance on fixed masks, the synthesized images often exhibit limited variation, leading to a lack of generalization in downstream tasks. This study introduces the Semantic Interpolative Diffusion Model (SIDM), which applies interpolation to both the given masks and the colonoscopy images to generate pairs of interpolated masks and images. First, a background semantic label was devised by labeling background regions based on the colonoscopy imaging environment. Both the masks and the background semantic labels are applied as multi-conditions to the diffusion model for colonoscopy image generation. After training, interpolation on both the masks and background semantic labels is performed at a chosen ratio. Applying the interpolated masks and labels to the model generates an intermediate perspective of colonoscopy images that partially incorporates features from each condition. By augmenting the dataset with these pairs of interpolated masks and generated images with interpolated conditions, segmentation models can extend the coverage of possible colonoscopy scenarios and mitigate the limitations of fixed masks, leading to robust generalization. Experimental comparisons against existing generative models, using the same test data across different segmentation models and different test datasets with the same model, demonstrate the effective generalization of the proposed model. The code is available at https://github.com/DSLab-MJU/SIDM.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3072_paper.pdf

SharedIt Link: https://rdcu.be/eHw7p

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05141-7_50

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/DSLab-MJU/SIDM

Link to the Dataset(s)

N/A

BibTex

@InProceedings{HeoCha_Semantic_MICCAI2025,
        author = { Heo, Chanyeong AND Jung, Jaehee},
        title = { { Semantic Interpolative Diffusion Model: Bridging the Interpolation to Masks and Colonoscopy Image Synthesis for Robust Generalization } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15970},
        month = {September},
        page = {519 -- 529}
}

Reviews

Review #1

Please describe the contribution of the paper
- The paper proposes the Semantic Interpolative Diffusion Model, aiming to address the issues of insufficient sample diversity caused by fixed masks in colonoscopy image synthesis and limited generalization ability of downstream segmentation models.
- The paper designs a mask interpolation strategy, breaking through the constraints of fixed masks on lesion regions in traditional methods.
- The paper introduces background semantic labels and controls the background changes of generated images through label interpolation, achieving diversified generation of background environments.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper applies interpolation technology to both lesion masks and background semantic labels for the first time. Through signed distance maps, it realizes interpolation of masks in continuous space, thereby generating synthetic samples that contain both lesion morphology changes and integrated different background features.
- The model has achieved significant performance improvements on public datasets.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- I have significant doubts about the contributions of this paper. The paper proposes to generate diverse image-mask paired data through interpolation. However, in fact, the effect of generating diverse data can also be achieved by previous generation methods, such as SDM and ArSDM, combined with manually drawn masks (this operation does not seem to be very labor-intensive) or by methods that incorporate mask interpolation. The core contribution of the paper seems to be only the mask interpolation algorithm. However, the paper does not discuss it in more depth: (1) Regarding the interpolation ratio, it is unclear how to find the optimal solution. In the experiments, different interpolation ratios (1:1, 1:3, 3:1) have a significant impact on the model performance. Manual tuning is required for different models and datasets (for example, U-Net is suitable for a 1:1 ratio, and U-Net++ is suitable for a 3:1 ratio), which increases the complexity of practical applications. In addition, the paper does not explore more interpolation ratio combinations in the ablation study (like 1:2, 1:4, 2:3 etc.), nor does it explain why only the fixed 1:1, 1:3, and 3:1 interpolation combinations are selected in the experiments. Finally, the paper does not discuss the adaptive interpolation ratio algorithm. Manually selected interpolation ratio strategy may make it difficult to quickly determine the optimal ratio when facing unknown data distributions. (2) The selection of the two images for interpolation also seems to affect the results. If images with small differences are selected for interpolation, the effect may be minimal, resulting in a waste of computational resources. Therefore, selecting two appropriate interpolation images and avoiding redundant calculations should be in-depth issues for the paper.
- The paper only uses segmentation metrics to verify the effect of generated images. It should also add image quality evaluation metrics (such as FID, KID, etc.) to assess the authenticity of generated images.
- In the experimental part, the paper only compares classical CNN-based segmentation models and does not compare models based on the Transformer architecture. There are significant differences between these two types of models. In addition, the paper does not compare with the latest (2024 - 2025) polyp segmentation approaches.
- The paper does not provide a code repository, and it does not state the implementation details in the experimental part, such as training hyperparameters, optimizers, training steps, etc., lacking information related to reproducibility.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall, the paper has completed an interpolation-based generation work with certain effectiveness. However, I am concerned about its actual contributions because there seem to be simpler ways to achieve the goal of generating diverse samples. In addition, the paper has a low degree of completion and still requires a large amount of improvements. Anyway, I still look forward to the authors’ rebuttal.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

This paper proposes a semantic interpolative image synthesis method. However, compared with previous similar methods, this method lacks innovation, and its model is cumbersome. The performance improvement is not significant, and there are many aspects in terms of the method and paper that deserve further optimization and major revision. I suggest that the paper be further revised.

Review #2

Please describe the contribution of the paper

The main contributions of this study are as follows: 1) The introduction of the Semantic Interpolative Diffusion Model (SIDM), which applies interpolation to both lesion masks and background semantic labels to generate diverse pairs of interpolated masks and images for colonoscopy image synthesis. 2) The novel definition of background semantic labels and the incorporation of these labels alongside masks as multi-conditions in the diffusion model, enhancing the generative process. 3) The demonstration, through extensive experiments, that SIDM improves the generalization capability of segmentation models by augmenting datasets with interpolated masks and images, achieving state-of-the-art results across multiple datasets and models.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The proposed SIDM interpolates both masks and background labels to generate diverse colonoscopy images, improving segmentation generalization.
2. Extensive experiments demonstrate SIDM’s superior performance compared to existing models across multiple datasets.
3. SIDM enhances segmentation performance on both seen and unseen datasets, showcasing robust model generalization.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The paper does not provide experimental evidence to demonstrate the importance of mask size in generating diverse samples. The effect of varying mask sizes on image diversity remains unaddressed.
2. The paper fails to include a comparison with Polyp-Gen[1], a relevant and recently published work, which limits the thoroughness of the evaluation against state-of-the-art methods.
3. The paper lacks a sensitivity analysis regarding the interpolation process, particularly how the interpolation between background semantic labels and masks influences model performance across various datasets.
4. The impact of the size of the augmented dataset generated by SIDM on segmentation performance is not thoroughly examined, and it remains unclear whether adding more synthetic data further improves the model or leads to diminishing returns.
[1] Liu, Shengyuan, Zhen Chen, Qiushi Yang, Weihao Yu, Di Dong, Jiancong Hu, and Yixuan Yuan. “Polyp-Gen: Realistic and Diverse Polyp Image Generation for Endoscopic Dataset Expansion.” arXiv preprint arXiv:2501.16679 (2025).
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper proposes a novel and promising approach, but lacks critical experimental validation (e.g., impact of mask size, interpolation ratio selection) and misses comparison with key related work like Polyp-Gen.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have addressed most of my concerns clearly and thoroughly.

Review #3

Please describe the contribution of the paper

The paper introduces the Semantic Interpolative Diffusion Model (SIDM), a novel approach for generating diverse colonoscopy images and masks by interpolating both lesion masks and background semantic labels. This method addresses the limitations of fixed-mask conditioning in existing generative models, enhancing the generalization of polyp segmentation models through data augmentation.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The interpolation of both masks and background semantic labels is a unique and innovative approach, addressing the rigidity of fixed masks in prior work. The use of signed distance maps for mask interpolation and multi-conditional diffusion models is well-justified and technically sound. The work directly tackles the challenge of limited annotated medical data, offering a practical solution for polyp segmentation.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. It is strange that some results of ArSDM are lower than those of SDM and SPADE, which are inconsistent with the results in ArSDM’s paper.
2. The authors should clarify the mechanism for integrating background semantic labels into the diffusion model, including details on how these labels are encoded and combined with mask conditions during the denoising process.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The author proposes mask interpolation and background semantic label interpolation to enhance the generalization of polyp segmentation models through data augmentation. Compared with previous works, this mask generation strategy is novel.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

My original concerns have been addressed

Author Feedback

We sincerely thank the reviewers for their positive feedback on the novelty and uniqueness of our method[R1, R2, R3], and for the superior improved performance on downstream task[R1, R2]. Below we address key concerns. [R1-Q1,Q3 R2-Q1] Contribution of Proposed Method: Our method not only interpolates masks to obtain intermediate annotations, but also generates intermediate images using background semantic labels, thereby introducing intermediate aspects of real data pairs to address data scarcity. Ablation studies(w/o background label) in Tables 1, 2 and Fig. 3 demonstrate the effectiveness of interpolating the image as well as the mask. By comparing results with and without image interpolation using the same interpolated mask with 1:1 ratio(ours), we observe a 2–3% improvement in average Dice scores across multiple experiments, demonstrating the necessity of applying image interpolation. While comparison like “fixed mask + baseline” vs. “interpolated mask(ours) + baseline”(to evaluate mask size) vs. “interpolated mask(ours) + SIDM(ours)” could be done, we focused on verifying the proposed full pair interpolation framework due to page limit.

[R1, R2-Q2] Interpolation Ratio Selection: We chose 1:1, 1:3, and 3:1 as representative ratios—balanced and biased examples. The 1:1 ratio, which equally blends both inputs, incorporates features from both without bias, resulting in high diversity. In contrast, 1:3 and 3:1 are representative cases of interpolation biased toward a specific input, occupying the outer quartiles of the interpolation space. Our experiments show that, on average across all tasks, 1:1(balanced) is the most generalizable ratio. While 1:1 generally performs best, we fully agree that interpolation ratios should be adaptively selected based on dataset and sample characteristics. We see strong potential in adaptive ratio selection methods—e.g., using pixel distributions, class imbalance, or variations in mask size and location. We view this as a top priority for future work.

[R2-Q3] Selection of Interpolation Pairs: Pairs were selected based on differences in both image and mask. If the background semantic labels of two images differed, the pair was used for interpolation, as differing labels imply distinct image background regions, regardless of mask similarity. For masks, pairs were selected by thresholding based on size comparison. We will clarify this pairing strategy in the manuscript upon acceptance of paper.

[R3-Q2] Clarification of the Mechanism: In a mask-conditional model that learns an image distribution from a single mask input, mask interpolation can only induce smooth transitions within the masked region, lacking continuous control over the background or global image style. Thus, as long as the condition space is limited to the mask, interpolation cannot ensure coherent changes across the entire image. To solve this, we introduce a background semantic label that explicitly defines the image background in the condition space, serving as a controllable key for image-level interpolation. As shown in Fig. 2 and Section 2, the decoder of the denoising model receives both the mask and the background label as conditional inputs; training details are provided in Section 2.3.

[R3-Q1] ArSDM Discrepancy: We used ArSDM’s official pretrained model, while SDM and SPADE were reimplemented by us, which may have caused the discrepancy.

[R1-Q2,Q4] Baseline & Dataset Size: We followed ArSDM’s dataset size and settings. Comparison with PolypGen and analysis of dataset size will be addressed in future work.

[R2-Q4] Image Quality Analysis: Indeed, analyzing image quality is important, but we prioritized the segmentation task under data scarcity, due to the page limitation.

[R2-Q5] Segmentation Models: We evaluated both CNN-based(PraNet) and Transformer-based(FCBFormer) models on unseen datasets and plan to extend to recent models in future work.

[R2-Q6] Reproducibility: We will release code and implementation details.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Two reviewers came back to recommend acceptance. R2 voted for rejection, and I share some concerns, i.e. unclear significance of results - there is no kind of statistical analysis, or inconsistencies in optimal ratio across model arquitechtures). However, I disagree with some of R2 criticisms, e.g. “the paper does not compare with the latest (2024 - 2025) polyp segmentation approaches” but does not mention which approaches. The paper is quite borderline, but I do not find enough motives to overturn R1 and R3’s shared view of acceptance.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Two reviewers recommended acceptance after rebuttal, while one reviewer remained unconvinced and recommended rejection, citing limited novelty and practical relevance. However, given that:

The key concerns were largely addressed in a thorough and clear rebuttal;

The method introduces a technically novel contribution over existing generative models through the joint interpolation of both masks and semantic image labels;

The paper is empirically validated across multiple datasets and backbone architectures;

I recommend acceptance of this paper, contingent on minor revisions. Specifically, I suggest that the authors:

Clarify and detail the interpolation pairing mechanism and integration of background labels (as committed in rebuttal);

Add or at least discuss comparisons with contemporaneous baselines such as Polyp-Gen;

Improve reproducibility by making code and models available upon publication.

back to top

Semantic Interpolative Diffusion Model: Bridging the Interpolation to Masks and Colonoscopy Image Synthesis for Robust Generalization

Author(s):