Abstract

Computer-aided diagnosis (CAD) systems for skin lesion analysis reduce costs and workload associated with the manual inspection of skin diseases. Nevertheless, the performance of deep learning (DL)-based CAD systems is constrained by the limited availability of labeled data, necessitating advanced dataset augmentation techniques. To address this limitation, we propose DiDGen, a novel method that employs Diffusion models (DMs) for Dermoscopic image Generation and lesion-mask pair synthesis. Specifically, we introduce DermPrompt, a new type of structured text prompt rich with clinical details annotated by large language models (LLMs), which facilitates DMs’ learning of fine-grained visual representations. Additionally, we propose a new paradigm for lesion-mask pair synthesis by incorporating a region-aware attention loss during finetuning to facilitate the build of semantic connections between text and visual representations, and then integrating test-time layout guidance with attention-based annotation to synthesize diverse and accurate lesion-mask pairs in a training-free manner. Extensive experiments demonstrate that our method improves the quality and diagnostic utility of generated dermoscopic images, thereby enhancing DL model performance in skin lesion classification and segmentation tasks. Our code is available at https://github.com/junjie-shentu/DiDGen.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4243_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/junjie-shentu/DiDGen

Link to the Dataset(s)

ISIC2018 dataset: https://challenge.isic-archive.com/data/#2018

BibTex

@InProceedings{SheJun_DiDGen_MICCAI2025,
        author = { Shentu, Junjie and Watson, Matthew and Al Moubayed, Noura},
        title = { { DiDGen: Diffusion-based Dual-task Synthesis for Dermoscopic Data Generation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15970},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents a diffusion-based framework for dermoscopic dataset augmentation targeting both image generation and lesion-mask pair synthesis. The key components of the method are:

    • A structured prompt generation technique leveraging Llama3, designed to embed clinically relevant details into the text guidance of Stable Diffusion.

    • Region-aware attention loss: a finetuning mechanism using placeholder tokens and cross-attention maps to align text tokens (“” and “”) with corresponding visual regions in the image.

    • Training free lesion mask generation: the method introduces test time layout control using cross and self-attention map regularization, and mask extraction via thresholding of the fused attention maps.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Clinically inspired prompts: Using LLMs to generate detailed, structured prompts (DermPrompt) offers a semantically grounded way to guide diffusion models.

    • Lesion-mask pairing: The use of attention map regularization avoids retraining-heavy pipelines like Pix2PixHD or ControlNet, increasing practicality.

    • Ablation studies: The paper contains thoughtful ablations showing the individual benefits of method components.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Limited novelty in core architecture: Much of the method builds on known techniques, Stable Diffusion, classifier-free guidance, and attention map control, combined in a task-specific way. The combination is practical but not deeply novel algorithmically.

    • Significance of results: Improvements in downstream tasks (classification and segmentation) seem to be marginal. No statistical significance tests are reported to assess whether the gains are robust. This weakens the empirical claims.

    • Limited dataset: All experiments are on ISIC 2018. A second dataset (e.g., Derm7pt or PH2) would help evaluate generalization and robustness.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • Including a second dataset would significantly strengthen the generalizability and clinical relevance of the method.

    • Considering releasing example prompts and attention maps to better illustrate how DermPrompt and the attention mechanisms contribute to the model’s performance could add value to the paper.

    • The segmentation experiments do not fully mirror the classification setup. Instead of expanding the training set with newly synthesized image-mask pairs, the method generates lesion-mask pairs based on existing masks. Could you please further clarify this design choice?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper proposes a technically sound method for dermoscopic dataset augmentation via diffusion models, key aspects require further validation:

    • The empirical improvements are modest and not backed by statistical significance testing.
    • The method is evaluated solely on a single dataset, limiting claims about generalization.
    • The method combines existing techniques in a task-specific way, offering practical value but limited architectural novelty.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The rebuttal addresses the main concerns. The authors provided statistical significance tests, clarified the segmentation setup and the use of training masks, and justified their choice of dataset. While the approach builds on existing techniques, the unified dual-task framework is contextually relevant. They have also committed to releasing their code, which strengthens the paper’s reproducibility. Based on these clarifications, I am now inclined to support acceptance.



Review #2

  • Please describe the contribution of the paper

    The paper proposed a novel diffusion model framework (DermDiff) that can simultaneously generate dermoscopic images and lesion-mask pairs, addressing limitations in existing data augmentation methods.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1.The use of structured text prompts (DermPrompt) generated by large language models (LLMs) enhances the diffusion model’s ability to learn fine-grained visual representations, reducing ambiguity in generated images. 2.The region-aware attention loss establishes semantic connections between text and visual representations, improving the semantic consistency of generated images. 3.A training-free pipeline is proposed to generate diverse and accurate lesion-mask pairs by combining test-time layout guidance and attention-based annotation, avoiding additional training overhead.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1.While DermPrompt leverages clinical details annotated by large language models, the paper does not clarify the source and quality of these details. 2.the implementation details of the region-aware attention loss (e.g., initialization and optimization of P-Tokens) are insufficiently described. 3.While the paper demonstrates the quality of generated images, it lacks validation of their clinical applicability, such as whether dermatologists were involved in assessing their diagnostic value.The paper does not analyze the differences in diagnostic features between generated and real images.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    A novel diffusion model framework (DermDiff) has been proposed, which can simultaneously generate skin microscopy images and lesion masks, addressing the limitations of existing methods. However, insufficient implementation details and lack of clinical validation have affected the integrity of the paper.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The author has submitted the rebuttel, and addressed some of the reviewers’ concerns; so I maintain my original decision.



Review #3

  • Please describe the contribution of the paper

    The authors propose DermDiff, a single-pass fine‑tuning of Stable Diffusion v2.1 that is subsequently used for two separate data‑augmentation tasks: (1) Dermoscopic image synthesis guided by a new “DermPrompt”, LLM‑generated, attribute‑rich textual captions. (2) Lesion-mask pair synthesis produced at test‑time via a training‑free pipeline that combines a region‑aware attention loss injected during fine‑tuning, layout guidance on cross‑ and self‑attention maps, and an attention‑based annotation step.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Overall well written and interesting read
    • The proposed method is a unified dual‑task framework, which finetunes once, then re‑uses the checkpoint for both image‑only and image + mask synthesis, which is elegant and resource‑aware.
    • The idea of exploiting cross‑token placeholders (“”) and layout guidance to bypass a dedicated pix2pix model is novel and could generalize to other organ‑mask domains.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • There is a risk of information leakage in segmentation experiment. Specifically, the masks from the training set are used as guidance to generate synthetic training images, where improvements may simply mirror the original masks. Authors try to mitigate by “minor differences”, but no quantitative quantification is given.
    • Can the authors provide more information about the attribute-aware prompting, i.e. how they use Llama‑3 to mine shape/color/structure attributes for each training image and where they rely on existing meta data.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • In section 2.3 there is a typo in the 2nd line “DermPropmt”
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper introduces a dual‑use of diffusion models that deserves attention.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    Even though the paper would be interesting and discusses an relevant problem, the authors failed to convince me during the rebuttal. For example, reviews questioned the clinical relevance and truthfulness of the generated images, which the authors justified solely based on downstream performance which does not cover the addressed concerns in any way. Furthermore, questions regarding the exact procedure of certain steps of the methodology revealed additional complexities which have not been adequately described in the paper.




Author Feedback

We appreciate the comments from reviewers, and we are hereby responding to the comments. Common comment: Reproducibility: We will release our code to promote reproducibility. Review 1, W1 & Review 3, W2: How to mine clinical details? Quality of details, and use of metadata. The clinical details in DermPromt are annotated by a Llama 3.2 vision model. The model first generates descriptive attributes of the given image using the prompt illustrated in Fig. 3(a). The model is reapplied to summarize them into 77 tokens to adapt to the SD’s input capacity. Because these attributes are general visual features, Llama can reliably capture them. To verify quality, we randomly examined 200 samples and found a high degree of image-prompt agreement. For metadata, we inserted the diagnostic labels into the prompt (e.g., an image of melanoma on ) to enforce class-specific features for classification. Individual issue: Review 1: W2: Details of the region-aware attention loss: P-Tokens "" and "" are initialized by standard tokens "skin" and "lesion". During training, P-Token embeddings remain fixed, and we train UNet to learn visual features and to align its attention maps with the P-Tokens. W3: Validate clinical applicability: We assessed the clinical applicability of synthetic data in downstream tasks. Table 2 and 3 show robust accuracy gains provided by generated images. Due to space constraints, we omit a detailed difference analysis, but we plan to include case-specific comparisons and expert dermatologist evaluations in future work. Review 2: W1: Limited novelty The novelty of our method lies in a unified task-specific framework tailored for dual-task dermoscopic image synthesis. Beyond the mask-image synthesis pipeline, the region-aware finetuning offers innovative solutions for guided medical image generation, and can be generalized to other domains using organ-mask data. W2: Significance of results The overall improvement is diluted due to averaging across all samples, and we have not fully explored the impact of synthetic dataset size; we expect greater improvements with larger synthetic sets. In future work, we will provide a detailed analysis of improvements on specific data subsets (e.g., images in MEL class; images poorly segmented by the original model), and investigate the scaling laws of synthetic data. Our method gets consistent performance gains across multiple classifiers and segmentation models, underscoring its robustness. We performed paired t-tests on the classification and segmentation results, and p-values were 1.35e-3 and 6.64e-29 (both<0.05). W3: Limited dataset ISIC 2018 is a widely adopted public segmentation and classification benchmark suitable for our dual-task study. It enables fair comparison against baselines, and is larger and more diverse than Derm7pt and PH2. We acknowledge the advice and will include evaluations on other datasets. ADD_COM3: Segmentation experiment setup Our method can directly generate lesion-mask pairs without mask guidance, but we find that adding mask guidance via L_CA can improve the diversity of generated masks (See ablation in Sec 3.4), benefiting the segmentation-model training. Review 3: W1: Risk of information leakage (info-leak) We note that info-leak primarily arises from the image translation model (e.g., Pix2PixHD): when inputting real training-set masks, it can reproduce images that closely mirror originals, leaking real image–mask pairs (especially for S1k scale, which uses a subset of the training data). Conditioning on real masks, our framework produces new image-mask pairs. The generated masks retain global alignment with inputs but show local deviations (referred to as “minor differences”) due to SD's inherent variability, avoiding info-leak induced by using real masks. Also, no test data is used for training to avoid test-set info-leak. We acknowledge the advice and will quantify the mask differences (e.g., pairwise IoU) in future work.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top