Abstract

Skin images from real-world clinical practice are often limited, resulting in a shortage of training data for deep-learning models. While many studies have explored skin image synthesis, existing methods often generate low-quality images and lack control over the lesion’s location and type. To address these limitations, we present LF-VAR, a model leveraging quantified lesion measurement scores and lesion type labels to guide the clinically relevant and controllable synthesis of skin images. It enables controlled skin synthesis with specific lesion characteristics based on language prompts. We train a multiscale lesion-focused Vector Quantised Variational Auto-Encoder (VQVAE) to encode images into discrete latent representations for structured tokenization. Then, a Visual AutoRegressive (VAR) Transformer trained on tokenized representations facilitates image synthesis. Lesion measurement from the lesion region and types as conditional embeddings are integrated to enhance synthesis fidelity. Our method achieves the best overall FID score (average 0.74) among seven lesion types, improving upon the previous state-of-the-art (SOTA) by 6.3%. The study highlights our controllable skin synthesis model’s effectiveness in generating high-fidelity, clinically relevant synthetic skin images. Our framework code is available at https://github.com/echosun1996/LF-VAR.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0807_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/echosun1996/LF-VAR

Link to the Dataset(s)

HAM10000 dataset: https://www.kaggle.com/datasets/kmader/skin-cancer-mnist-ham10000 ISIC2017 dataset: https://doi.org/10.48550/arXiv.1710.05006 Dermofit dataset: https://licensing.edinburgh-innovations.ed.ac.uk/product/dermofit-image-library

BibTex

@InProceedings{SunJia_Controllable_MICCAI2025,
        author = { Sun, Jiajun and Yu, Zhen and Yan, Siyuan and Ong, Jason J. and Ge, Zongyuan and Zhang, Lei},
        title = { { Controllable Skin Synthesis via Lesion-Focused Vector Autoregression Model } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15975},
        month = {September},
        page = {128 -- 138}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The work introduces LF‑VAR, a two‑stage framework for controllable synthesis of dermoscopic skin‑lesion images: (i) a multi‑scale lesion‑focused VQ‑VAE encodes each image into discrete tokens and (ii) a VAR Transformer that autoregressively predicts the next‑scale tokens, conditioned on lesion‑type embeddings and a codebook of quantitative lesion‑measurement features. The approach achieves state‑of‑the‑art FID across seven lesion types and generalizes to inter‑class and cross‑dataset synthesis.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Timely problem & clear motivation – Exploring lightweight autoregressive models as an alternative to diffusion for dermatology is novel and addresses the high computational cost of diffusion models in clinical settings. The combination of lesion‑focused VQ‑VAE, measurement embeddings, and class‑specific codebooks is well‑motivated.

    Comprehensive evaluation – Intra‑class, inter‑class, and cross‑dataset experiments against five baselines show consistent FID/IS improvements.

    Potential clinical impact – Controllable synthesis of rare lesion types can help mitigate data imbalance and may support privacy‑preserving data sharing for downstream dermatology tasks.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Lack of downstream validation – FID and IS capture perceptual fidelity but not clinical utility. Downstream task (e.g., classifier training or segmentation fine‑tuning) or visual evaluation with dermatologists to demonstrate that LF‑VAR images improve model performance relative to baselines, are missing.

    Codebook provenance unclear – The manuscript mentions a “class‑average measurement codebook” but does not specify how measurements are extracted, normalized, or updated. Provide a detailed description (feature list, units, extraction protocol) to ensure reproducibility.

    Segmentation dependency – Because the pipeline relies on lesion masks, performance may hinge on mask quality. In cases where segmentation masks are not available (e.g., clinical pictures captured by smartphones), the practical value of the method may be reduced .

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The manuscript convincingly shows that LF‑VAR surpasses diffusion‑based and autoregressive baselines (e.g., Derm T2IM) in perceptual metrics, yet it fails to demonstrate that these high‑fidelity images actually improve downstream dermatology tasks. Because the primary motivation is to alleviate data scarcity for deep‑learning models, a classification experiment using the synthetic data—or at minimum a clinician evaluation—is essential to validate the practical benefit of the proposed controllable synthesis. Without such evidence, the real‑world value of the class‑specific measurement codebook and controllability claims remains speculative.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The author has successfully addressed my concerns.



Review #2

  • Please describe the contribution of the paper

    The authors propose a method for synthesizing skin lesion images with controllable attributes.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors’ method enables the generation of more controllable skin lesion images by incorporating both the lesion class and the corresponding lesion mask.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. While Table 1 includes both FID and IS scores, the IS does not indicate the best performance. It would be helpful if the authors could provide an interpretation or discussion regarding this discrepancy, particularly in relation to the quality and diversity of the generated images.
    2. To further validate the utility of the synthesized images, it would strengthen the paper to include results on downstream tasks—such as evaluating the performance of models trained with the synthetic skin lesion images.
    3. In Figure 4(a), it appears that in some cases the generated lesions are smaller than their corresponding masks, or their shape differs noticeably. Additional clarification or discussion on this mismatch between the lesion mask and the generated lesion appearance would be appreciated.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors’ method shows potential as a controllable framework for synthesizing skin lesion images.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    According to the guidelines, they were not allowed to mention any new additional experimental results. Furthermore, an ablation study would clarify the relative importance of the model components.



Review #3

  • Please describe the contribution of the paper

    The primary contribution of the paper is the innovative integration of an autoencoder and a transformer to generate skin lesion images. This hybrid approach uses text input to control key attributes such as lesion type and position, overcoming limitations of current methods that yield lower quality images and exhibit poor precision on lesion properties. The authors claim that by enabling users to specify these characteristics through descriptive text, the method produces high-fidelity images tailored to clinically relevant features, ultimately benefiting both research and diagnostic applications.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors present a method with several key strengths. First, they introduce an innovative integration of an autoencoder and a transformer, leveraging text input to precisely control lesion attributes such as type and position. This controlled synthesis enables the generation of high-resolution (512 × 512) images that accurately represent a wide range of disease categories. Moreover, the approach automatically quantifies lesion characteristics using PyRadiomics—capturing texture, shape, histogram, and clinical attributes—which are then employed by the VAR to improve lesion generation. The method clearly differentiates between normal skin and affected areas, ensuring that the background is preserved while the lesion is synthesized. The authors validate their approach extensively through three parts of evaluation: intra-class synthesis (demonstrating improved skin synthesis quality and artifact prevention on normal skin), inter-class comparisons (showing proper background preservation), and cross-dataset validation (achieving top or near-top scores across all metrics on unseen datasets). In each evaluation, the performance is compared with five other methods, and the study also identifies which components of the pipeline significantly boost performance. The use of a public dataset further ensures reproducibility, making the work highly relevant for both research and clinical applications.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The paper exhibits several notable weaknesses. One key limitation is its reliance on a pre-generated lesion mask as input; the authors claim that without an accurate mask, the VAR produces images with noisy backgrounds and unnatural transitions, thereby compromising image realism. Obtaining high-quality lesion masks is itself a challenging task (see “A survey on deep learning for skin lesion segmentation”), yet the paper does not address how to reliably generate these masks in a clinical setting. Furthermore, the evaluation compares the proposed method against five other approaches, but two of these methods involve non-conditional image generation and do not clarify how each disease type is generated—whether through training separate models on limited data or via a single model that ambiguously generates various lesion types. In practice, the generated images frequently display artifacts, and aside from occasional instances of melanoma or nevus, they fail to consistently reproduce clinically realistic lesions. Moreover, the diffusion-based method does not incorporate text input to guide the generation process, which further exacerbates discrepancies in image quality when compared with conditional approaches. Finally, the evaluation metrics employed, such as FID and IS, are limited in their ability to capture clinical realism (see “Rethinking FID: Towards a Better Evaluation Metric for Image Generation” ); these metrics tend to be artificially low because only a segmented part of the image is modified, leaving large portions of normal skin untouched, and there is a notable absence of quantitative validation from a clinical expert’s perspective.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    In addition, we have an observation regarding the notation on Page 4, where the paper defines T₀ = [S,Fr]. The symbol “Fr” is not defined anywhere in the manuscript, and it is unclear whether it was meant to be “Fq” or some other term

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper presents innovative contributions through the integration of an autoencoder and a transformer, leveraging text input for controlled synthesis of high-resolution skin lesion images, there are significant issues that need to be addressed. Notably, the method’s reliance on pre-generated lesion masks, the limitations in its evaluation framework—especially the use of metrics like FID and IS that may not fully capture clinical realism—and some unclear comparisons with non-conditional methods raise concerns about its clinical applicability. If these weaknesses are adequately addressed in a rebuttal or revision, the paper’s contributions could be further strengthened.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Authors answered my concerns.




Author Feedback

We thank all reviewers for their valuable feedback. We address the major concerns as follows:

  1. Lack of Downstream Validation(All Reviewers) These experiments were conducted as part of our original study but were omitted from the submission due to space constraints. Following the established protocol by Philipp et al. (Nature Medicine), our downstream classification uses a ResNet-50 classifier for 7 lesion types. The baseline achieved a mean recall of 0.692. Adding a weighted random sampler improved it to 0.715. When using augmented training data with our synthetic images (balanced to 500 images per class), the recall increased to 0.771 (+0.079, +11.4% from baseline; +0.056, +7.8% from weighted sampling). This demonstrates the practical utility of our synthetic data for improving classification performance.

  2. Dependency on Segmentation Masks and Practical Applicability(R2, R3) We clarify that the input mask is not a precise segmentation boundary, but rather serves as a rough guidance signal. The goal is to guide lesion synthesis toward the user-indicated region, avoiding lesion placement over background skin, rather than enforcing pixel-level mask fidelity. As correctly observed by R1 (Fig. 4), the synthesized lesions differ in shape and size from the input mask. This is intentional and desirable, as it allows for diverse yet contextually appropriate lesion generation. Our model explicitly weakens the boundary constraints through the lesion-focused loss (Eq. 2), encouraging natural transitions while still respecting the approximate location of the mask. Importantly, the use of input masks in our pipeline simulates human interaction. Users (e.g., clinicians) can flexibly define a region of interest on a skin image, and our model generates a new lesion within that region. This interactive inference paradigm supports real-world use cases where precise lesion masks may not be available, but approximate inputs (e.g., bounding boxes or rough sketches) can be provided. Therefore, rather than being a limitation, the mask-based input design enhances practical applicability by enabling controllable synthesis in settings with imperfect or human-defined region inputs, such as smartphone-captured images or clinical editing tools.

  3. FID/IS Discrepancy and Evaluation Metric Interpretation(R1, R3) Our evaluation approach aligns with established practices in medical image synthesis (Jiarong et al. MICCAI2021; Wenting et al. MICCAI2024). While we acknowledge that neither FID nor IS is a perfect proxy for clinical realism, FID remains the most widely accepted measure of perceptual fidelity in medical image synthesis. The FID/IS trade-off is well-documented in evaluation metrics literature, where higher diversity can artificially inflate IS scores while compromising fidelity (Yaniv et al. IJCV2021). As noted in Section 3.2, Derm T2IM’s higher IS results from increased diversity due to artifacts in normal skin regions, which artificially inflates the metric. Our method prioritizes fidelity (lowest FID) while maintaining reasonable diversity. Additionally, we address metric limitations through feature visualization analysis(Fig. 3) and downstream validation, which further strengthens our results. Lastly, in response to R3, we clarify that our model modifies not only the segmented lesion area but also regenerates the background, as evident in Fig. 4 and noted by R1.

  4. Measurement Codebook Implementation and Reproducibility(R2) Section 2.2, we extract lesion measurement scores (εr) using PyRadiomics (Python package), encompassing 103 numerical features across shape, histogram, texture, and clinical categories. The encoding function Fq applies a linear projection followed by layer normalization and SiLU activation. For inter-class synthesis, we maintain a dynamic codebook of class-average measurements to save the output of SiLU, updated via running averages during training. Section 3.1 provides implementation details. We will release the code.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Two reviewers were satistfied with the rebuttal. The authors should address the remaining concerns in the camera-ready version with additional discussions on the results discrepancy and shape mismatch.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    While an ablation study would help clarify the relative importance of the model components, the authors have adequately addressed all other major concerns.



back to top