Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Polyp segmentation is the foundation of colonoscopic lesion screening, diagnosis, and therapy. However, the data size of images and annotations is limited. The latent diffusion model (LDM) has emerged as a powerful tool in synthesizing high-quality medical images with low computational costs. However, the challenges of boundary-aligned image-mask pairs and image realism remain unresolved, showing that (i) the spatial relationship between the boundaries is easily distorted in the latent space; (ii) the diversity of colors, shapes, and textures, along with low boundary contrast and textures similar to surrounding tissue, makes boundary distinction of the polyps difficult. This paper proposes Polyp-LDM that encodes polyps and masks into the same latent space via a unified variational autoencoder (VAE) to align their boundaries. Furthermore, Polyp-LDM refines texture and lighting while preserving the structure by fine-tuning the VAE decoder with data augmentation and applying the style cloning module to enhance image realism. Quantitative evaluations and user preference study demonstrate that our method outperforms existing methods in image-mask pair generation. Moreover, segmentation models trained with augmented data generated by polyp-LDM achieve the best performance on three public polyp datasets. The code is available at https://github.com/16rq/Polyp-LDM.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2379_paper.pdf

SharedIt Link: https://rdcu.be/eHw3b

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05127-1_4

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/16rq/Polyp-LDM

Link to the Dataset(s)

N/A

BibTex

@InProceedings{QiuRiy_Accurate_MICCAI2025,
        author = { Qiu, Riyu AND Xia, Kun AND Gao, Feng AND Yang, Shuting AND Cai, Du AND Wang, Jiacheng AND Chen, Yinran AND Wang, Liansheng},
        title = { { Accurate Boundary Alignment and Realism Enhancement for Colonoscopic Polyp Image-Mask Pair Generation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15969},
        month = {September},
        page = {34 -- 44}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposed a novel framework for generating image–mask pairs for polyp segmentation. The proposed method is built upon a Latent Diffusion Model and introduces a Shared Variational Autoencoder to enhance the consistency between generated images and their corresponding masks. Furthermore, Style Cloning is incorporated to improve the fidelity of the generated images. Experimental results demonstrate both the high quality of the generated images and the effectiveness of the proposed method in data augmentation for polyp segmentation tasks.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

#1 The introduction of a Shared Variational Autoencoder to maintain consistency between images and masks is a reasonable approach, and its effectiveness is quantitatively demonstrated. #2 The figures are well-organized and easy to understand.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

#3 Is the comparative method “LDM” same as “Ours w/o boundary and realism”? If so, could you provide an explanation as to why “w/o realism” yields worse performance than LDM in evaluation metrics other than Dice for the generated images?

#4 The details of the loss functions L_adv, L_reg, and D_\psi are not clearly described.

#5 According to previous paper, Polyp-DDPM have been reported to outperform LDM. However, the results presented in this paper show the opposite trend. Could you explain the reason for this discrepancy?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Due to the inconsistency with previous paper and the lack of detailed description of the proposed method, I have assigned this score.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have addressed the reviewers’ comments well in the rebuttal. I hope the camera-ready version will clarify the remaining ambiguities in the proposed method.

Review #2

Please describe the contribution of the paper

The paper’s main contribution is a novel method for generating aligned image-mask pairs for the polyp segmentation task. The authors address the common misalignment issue in pairs generated by latent diffusion models (LDMs) by introducing a unified VAE-based projection, which enforces consistent alignment between the generated images and masks. The authors also introduce a contrast augmentation technique based on style cloning in the VAE latent space.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The major strenghts of the paper are:
1. The use of multiple, distinct datasets for training, validation, and testing enhances the robustness of the results and builds greater confidence in the authors’ conclusions.
2. The paper evaluates model performance using a range of quantitative metrics for both the generation and downstream segmentation tasks. Additionally, the inclusion of manual evaluations is consistent with best practices in generative modeling studies.
3. The authors present relevant ablation experiments that clearly demonstrate the contribution of individual components of the proposed method, helping to elucidate which factors drive the observed performance gains.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. While the experimental design is generally thorough and covers a range of relevant results, a key comparison is missing—namely, with conventional data augmentation strategies. This omission weakens the empirical support for the proposed method. Specific concerns include:
a. Traditional data augmentation techniques for segmentation tasks—such as rotation, noise injection, color jittering, and histogram matching (which closely resembles the style cloning described in the paper)—can offer significant performance benefits. Although the authors mention using brightness and contrast augmentation during VAE-Decoder fine-tuning, these augmentations were not applied during segmentation model training. This inconsistency raises questions about fairness in the experimental setup.

b. Given the above, the experiments involving varying ratios of real and synthetic images (e.g., R2S1, R1S2) could have been replaced or removed with comparisons against segmentation models trained using standard augmentations. This would better isolate the effectiveness of the proposed method by directly contrasting it with common alternatives. Without such comparisons, it is difficult to assess whether the observed gains are due to the proposed generation pipeline or simply the result of having additional (augmented) data.

c. As this is a data generation paper, it would also be valuable to compare the proposed method with alternative generative approaches such as GANs or VAEs. This would help justify the use of Latent Diffusion Models (LDMs). While LDMs are known to be more diverse in high-data regimes, it is unclear whether they offer the same advantages when trained on small datasets like the one used here (~3,000 samples), where model convergence may be limited. Additionallly Diffusion model have been shown to suffer from memorization on smaller datasets of similar magnitude considered in the paper, so I seriously doubt that the proposed method is using the full benefit of diffusion model.
1. Although the method is well-motivated, its presentation is at times unclear. After multiple readings, several questions remain unanswered:
a. What exactly does the “unified VAE” refer to? Does it mean that image and mask pairs are concatenated before being passed to the VAE?

b. If the inputs are not concatenated: How is the difference in channels handled (e.g., RGB image vs. single-channel mask)? Is the mask simply repeated across channels? If the same VAE is used for different inputs, how is consistency between image and mask ensured in the latent space?

This raises concerns about the effectiveness of the proposed “realism augmentation,” since feeding two domains (images and masks) into a shared VAE could distort the latent representation, potentially impacting the realism of the generated images.

c. If the inputs are concatenated:

How does the realism augmentation operate in this setting? Since the latent space now jointly represents image-mask pairs, how does the method ensure that meaningful color variations are applied without compromising structural alignment of images as it would distort generated masks as well ?
1. Implementation Details missing:
a. The paper does not specify the dimensionality of the latent space, even though the input image size is mentioned. This is an important detail for reproducibility.

b. Minor grammatical issues should be addressed, including: (a) Inconsistent or incorrect capitalization of certain words. Line 3 in Introduction.

(b) The use of informal terms like “etc.”—which are typically avoided in academic writing and should be replaced with specific examples or omitted entirely. Page 2 line 1.

(c) Claims made in the paper should be properly supported with references or explicitly stated as empirical observations. For example, Page 2, Paragraph 2 contains several assertions without any citations or indication that these are based on the authors’ own empirical findings. While I assume these are drawn from the authors’ experiments, clearly stating this would improve transparency and credibility.

(d) Equation 3 and 4 have very minor difference from equation 1 and 3. To conserve space, they could have been removed. They do not add anything to help in understanding of the paper.

Although I find the idea interesting there are several question unanswered in the current manuscript. In the rebuttal I would like the authors to focus on explaining their methodology and reasoning behind the methodology well.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
1. Confusing methods explaintation.
2. Missing comparison with traditional data augmentation methods.
3. Missing comparison with GAN based models.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.
1. The method remains unclear to me, even after reviewing the rebuttal. In fact, it has raised further conceptual questions about the core mechanisms of the architecture. I believe the Methods section would benefit from a thorough rewrite, with clearer explanations and more detailed descriptions of each component. Missing details and confusing description is pointed out by other reviewers as well.
2. Additionally, prior work in the literature has shown that diffusion models tend to memorize when trained on small datasets. This raises concerns about their suitability for data augmentation, which fundamentally relies on generating diverse samples—undermining the main purpose of this paper.
3. Finally, I disagree with the authors’ claim that conventional augmentation techniques cannot be applied. In practice, simple masking-based augmentation strategies can effectively simulate contrast and lighting variations for correction. Omitting such comparisons is, in my view, a significant oversight.
Due to the above reasons I do not believe this paper is ready to be accepted for a conference audience yet. This paper would benefit from another round of review cycle.

Review #3

Please describe the contribution of the paper

This work explores synthetic data generation for polyp segmentation. The authors propose a latent diffusion model-based approach. The novelty lies in the introduction of two modules: 1) polyp-mask boundary alignment, which is achieved by sharing latent space and initial noise for both the polyp and mask images, 2) enhanced realism of lighting, contrast, and texture of polyp images, by using augmentation-based finetuning of VAE decoder and style cloning from real polyp images. Experiments comparing synthetic and real polyp images’ distribution characteristics, and comparison with other diffusion models on multiple datasets, show good results.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper proposes two novel modules in the model architecture : increasing the boundary alignment of polyp-mask pairs using shared latent space (same VAE) and same inital noise and 2) realism of synthetic images - by finetuning decoder with adaptive augmentation and style transfer from real polyps.
- The experimental evaluation design is comprehensive: an ablation study (with clinician quality scores), comparison with other diffusion models with varying amounts of synthetic data, and multiple datasets in the training and testing.
- Strong performance improvements are shown in all quantitative experiments.
- The synthetic images shown in the figures seem realistic and are able to generate fine details such as vasculature, and pit pattern texture on polyps.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Major
- Can the authors please explain the motivation to separate polyp and background and independently adjust the brightness and contrast? Won’t this increase the contrast between the polyp and surrounding tissue, making it a ‘shortcut’ for the model learning the segmentation? Many of the illustrative synthetic images of proposed method in Figs. 2 & 3 seem to have polyps brighter than the surroundings. However, the major challenge of CADe is detecting polyps such as hyperplastic and sessile serrated, especially with white light imaging, where the background appears similar in color to the polyp, leading to an ambiguous boundary. Could the style cloning step reduce some of this increased contrast and harmonize the image?
- For the style cloning step, how is the shape of the polyp retained? Is structural information stored in Q, while style is stored in K & V? Is previous work establishing this available for transformers? Previous work [1,2] has used the disentangled features in GANs to transfer style without significantly changing the polyp shape.
Minor
- For fig 2, the authors could consider moving the subfigure c to the top and a & b to the bottom, as c is a high-level figure and a&b are module details. Most readers look at figures from top to bottom, left to right.
- There is a large pool of prior work which has focused on synthetic polyp image generation from mask or increased diversity in terms of lighting, style, etc. like the previously mentioned works. The authors could consider adding this area of prior work. But it is understandable if the page limit prevents doing it.
[1] Mathew, S., Nadeem, S., & Kaufman, A. (2022, September). CLTS-GAN: color-lighting-texture-specular reflection augmentation for colonoscopy. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 519-529). Cham: Springer Nature Switzerland. [2] Golhar, M. V., Bobrow, T. L., Ngamruengphong, S., & Durr, N. J. (2024). GAN inversion for data augmentation to improve colonoscopy lesion classification. IEEE Journal of Biomedical and Health Informatics.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Novel architecture, comprehensive set of experiments, strong quantitative results, good paper structure.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The paper proposes two modules for 1) increasing the boundary alignment between polyp masks and images via a shared VAE latent space 2) increasing the realism of the synthetic polyp images using a style cloning module. The integration of these modules in the stable diffusion pipeline is novel.

Quantitive results on alignment between mask & polyp, and the realism of synthetic polyps showcase strong results. Clincian user study also gave method’s images high score on realism.

The generated images improved the downstream task of polyp segmentation. A comprehensive evaluation involved different ratios of real to synthetic images and multiple public datasets, and the method was compared against other diffusion models in addition to ablation study. The proposed method performed well in most cases.

The authors promise to release the code.

Author Feedback

Thanks to the reviewers for the helpful feedback. We address the comments below and will update the camera-ready if accepted. 1.VAE (a) Input and consistency (R2) The VAE maps polyps and masks (tiled to 3 channels) into a latent space without concatenation. Visualizations show that, starting from a pre-trained VAE, this design enables accurate reconstruction of two domains. Table 1 shows using two VAEs (“LDM”) has slightly better realism but far worse spatial consistency than the unified VAE (“w/o realism”). This confirms the spatially aligned paired embedding are learned by unified VAE, which belongs to boundary module, not the realism module. It also inspires our realism module, including realism augmentation, to boost realism while keeping boundary accuracy. As noted in 1(b), realism augmentation helps correct artifacts. Further, concatenation doesn’t work with our dual diffusion models.

(b) Motivation and role of augmentation (R2&R3) The motivation is to correct lighting/contrast artifacts (Fig. 1c) to improve realism, not to add diversity like standard augmentations. We separate polyps from backgrounds during augmentation, unlike standard whole-image transforms. Further, conventional augmentations were tried early but proved ineffective. Thus, VAE augmentation doesn’t unfairly help segmentation. For R3, the slightly higher brightness (Figs. 2&3) may stem from strong boundary guidance, unintentionally boosting contrast. Our style cloning mitigates it without altering structure.

(c) Loss (R1) L_adv, L_reg and D_\psi are parts of VAE loss. To save space, we left standard Stable Diffusion details. D_\psi is the discriminator with parameters \psi. It assigns high scores to real and low to fake images. L_adv is adversarial loss for the fake, defined as -log(1-D_\psi(·)). L_reg is KL divergence that regularizes latent space by aligning posterior q(z|x)=N(u,sigma^2) with standard normal prior N(0,I), noted as KL(N(u,sigma^2)||N(0,I)). We’ll add details.

2.Style cloning (R3) Shape is preserved in Q, while style is stored in K and V, consistent with transformers as follows. Song W, Jiang H, Yang Z, et al. Insert Anything: Image Insertion via In-Context Editing in DiT[J]. arXiv preprint arXiv:2504.15009, 2025.

3.Comparison work (a) Polyp-DDPM v.s. LDM (R1) Performance discrepancy may stem from data/setup differences. To our knowledge, Polyp-DDPM only beats LDM in its original paper [7]. Compared to [7], our multi-center dataset with more complex scenarios is more challenging and reflects real-world needs, explaining the reversed trend.

(b) “w/o realism” v.s. LDM (R1) LDM is “Ours w/o boundary and realism”. “w/o realism” only has boundary module, which improves boundary (higher Dice) but compromises realism (other Table 1 metrics). Accordingly, it inspires us to propose the realism module to enhance realism while preserving boundary accuracy.

(c) Augmentation comparison and downstream task (R2) Our pipeline had the biggest gain with equal added data (Table 2), proving its contribution to segmentation. We also compared traditional augmentations while training segmentation and found Histogram matching and Rotation had limited or inconsistent benefits, while Jittering and Noise often degraded performance. This shows our advantage in producing high-quality data. For fairness, see response 1(b).

(d) GANs/VAEs comparison (R2) We conducted experiments during baseline selection. GANs/VAEs (pix2pixHD, BicycleGAN) made artifacts (FID~163.80, KID~0.21) but had better alignment (Dice~62.33) than some diffusion models. Ours has high realism and accurate boundaries. Prior work [7] shows diffusion models beats GANs on small datasets (<2000 images). However, diffusion often suffer from memorization, which we’ll add discussion.

4.Writing (R2&R3) We’ll clarify [3,256,256]→[4,32,32] encoding, update Fig. 2 layout, grammar, and citations. Works noted by R3 are like our style cloning but need training, unlike ours. The citation will be included.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Two reviewers are in favour of the paper’s acceptance and the rebuttal has addressed their concerns. R2 has remaining concerns regarding the missing implementation and methodological details, and lack of comparison to conventional augmentation. The authors should add the missing details and add further discussions to address the remaining concerns.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

Accurate Boundary Alignment and Realism Enhancement for Colonoscopic Polyp Image-Mask Pair Generation

Author(s):