Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

As the appearance of medical images is influenced by multiple underlying factors, generative models require rich attribute information beyond labels to produce realistic and diverse images. For instance, generating an image of skin lesion with specific patterns demands descriptions that go beyond diagnosis, such as shape, size, texture, and color. However, such detailed descriptions are not always accessible. To address this, we explore a framework, termed Visual Attribute Prompts (VAP) Diffusion, to leverage external knowledge from pre-trained Multi-modal Large Language Models (MLLMs) to improve the quality and diversity of medical image generation. First, to derive descriptions from MLLMs without hallucination, we design a series of prompts following Chain-of-Thoughts for common medical imaging tasks, including dermatologic, colorectal, and chest X-ray images. Generated descriptions are utilized during training and stored across different categories. During testing, descriptions are randomly retrieved from the corresponding category for inference. Moreover, to make the generator robust to unseen combination of descriptions at the test time, we propose a Prototype Condition Mechanism that restricts test embeddings to be similar to those from training. Experiments on three common types of medical imaging across four datasets verify the effectiveness of VAP-Diffusion.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4892_paper.pdf

SharedIt Link: https://rdcu.be/eHw7z

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05141-7_60

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/YiBaiHP/VAP-Diffusion

Link to the Dataset(s)

N/A

BibTex

@InProceedings{HuaPen_VAPDiffusion_MICCAI2025,
        author = { Huang, Peng AND Fu, Junhu AND Guo, Bowen AND Li, Zeju AND Wang, Yuanyuan AND Guo, Yi},
        title = { { VAP-Diffusion: Enriching Descriptions with MLLMs for Enhanced Medical Image Generation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15970},
        month = {September},
        page = {624 -- 633}
}

Reviews

Review #1

Please describe the contribution of the paper

In this paper the authors propose a diffusion model for medical images that leverages image descriptions, as derived from pre-trained multi-modal LLMs, in order to improve the diversity and precision of the generated images. The authors employ a specific chain of prompts to generate the medical image descriptions that is used then as input to the diffusion model, along with the input image and the class label. Their diffusion model is taken from [31] (reference in the paper). Moreover, the authors propose an additional “prototype condition mechanism” (PCM) in order to regularize the embeddings of training samples from the same class to remain close to each other. Empirical results illustrate the certain performance benefits compared to related work.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Interesting use of pre-trained multi-modal LLMs to enhance quality of generative models for medical images.
- Empirical results using FID, IS, Precision and Recall show improved empirical performance in image generation compared to other state-of-the-art methods
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- There are two novel components in the proposed framework VAPS (use of prompt descriptions from multi-modal LLMs) and PCM (prototype condition mechanism, see Section 2.4 in paper). The utility of the PCM component is not sufficiently supported by the presented empirical results. Looking at Tables 1,2, we can observe that the performance of the proposed framework “VAP-Diffusion” without the VAPS component is identical to the performance of UViT. This empirical result requires additional analysis (why are these results equal? why did the use of PCM produce exactly the same results?). Additional empirical analysis should be conducted to support the potential utility of PCM component.
- PCM is not explained with sufficient details. For example in the sentence “To stabilize the training process, we build a linear layer initialized with all zero which can gradually inject the multi-modal priors.”, what is the linear layer that is referred to? Is it the p_c? Moreover, what is the effect of different values of \alpha term to empirical performance?
- Why is the proposed framework compared only to StyleGAN in the “downstream tasks” section? Additional experiments including also other frameworks (like the ones used in Section 3.1) should be performed.
- The empirical results do not include several details that would allow for the reproduction of their experiments, like UViT exact architecture setup (in [31] there are several variations with different number of parameters) and alpha parameter in PCM loss.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
- The utility of one of the two novel components (PCM) is not supported by the empirical results
- Level of details provided in paper does not allow for the reproduction of empirical results
- The empirical comparison for downstream tasks should be performed with additional methods, like the ones used in Section 3.1
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This study introduces VAP-Diffusion, a framework that leverages Multi-modal Large Language Models (MLLMs) to generate realistic synthetic medical images using clinically relevant attributes. In general, medical image generation is challenging and even images with the same condition can vary significantly. Herein, leveraging synthetic samples is a cost-efficient approach to generate more data for downstream analyses. To address the challenge of creating realistic and diverse medical images, the authors use MLLMs to provide detailed visual attribute descriptions that go beyond basic class labels. Their approach includes three key components: Visual Attribute Prompt Strategy (VAPS), which extracts accurate descriptions from MLLMs through a Chain-of-Thoughts process; Class-Specific Prompt Bank (CSPB), which stores descriptions for retrieval during inference; and Prototype Condition Mechanism (PCM), which ensures robustness to unseen description combinations. Experiments across three medical imaging types (dermatologic, colorectal, and chest X-ray) demonstrate that VAP-Diffusion outperforms existing methods in generating realistic and diverse images, and also improves downstream classification performance by up to 11.9% compared to the baselines.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is well written and easy to follow. The authors address a significant challenge in medical image generation by leveraging external knowledge from MLLMs to provide enriched attribute descriptions, which a more cost-effective way when limited by data collection and annotations.
- The proposed framework shows consistent improvements across multiple medical imaging modalities and datasets, demonstrating its generalizability. While the generative quality metrics are not the best except in terms of FID compared to other approaches, on the colonoscopy dataset their approach is clearly better, and the downstream performance supports the quality of the generated samples.
- The details regarding the overall pipeline are sufficient. I equally appreciate the use of multi-step prompting to ensure robustness and avoid biasing. In fact, what stands out more is the ability to handle free text and generalize to unseen inputs.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- While this study shows competitive results, the evaluation does not include a comprehensive analysis of how different MLLMs might affect the quality of the generated descriptions and subsequently the synthesized images. It would be beneficial discuss or highlight this point, including whether this was considered in the early stage of the study.
- It is unclear how clinically applicable the synthetic images generated by VAP-Diffusion are and whether expert radiologists assessed the images. Along this line, I am curious how this approach would transfer to more complex 3D medical imaging modalities like MRI or CT scans.
- There is limited discussion on how the model handles rare or unusual medical presentations that might not be well-represented in the training data.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall, the study presents a novel approach to medical image generation with strong empirical results across multiple datasets. The initial hypothesis that enriched descriptions from MLLMs can improve medical image generation is well-supported by the experimental results. I am happy to provide an initial positive rating and open to hear clarifications on the points raised.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

I wish to thank the authors for clarifying the majority of concerns in the rebuttal. I have taken into consideration the points raised by my fellow reviewers and find the responses sufficient. I am of the view this work is worth presenting at MICCAI in its current form. I equally appreciate the clarifications on the core issues I initially had regarding (i) how different MLLMs would affect quality of descriptions, (ii) generalization and (iii) unusual manifestations (rare) – which have been mentioned in the response. Beyond these, the authors have made reasonable clarifications on other issues raised. Overall, I am happy to keep my rating of acceptance.

Review #3

Please describe the contribution of the paper
- This paper addresses the challenge of generating realistic medical images, a task complicated by the inherent variability and complexity of medical data.
- Rather than solely aligning overall distributions, the authors aim to match distributions conditioned on class-specific contexts.
- To achieve this, they utilize large multimodal language models (MLLMs) to enrich existing medical image descriptions, leveraging the reasoning capabilities of MLLMs to generate more informative textual annotations.
- The key contribution of the paper lies in enhancing the accuracy of medical image generation by using semantically enriched textual annotations and conditioning the generation process on class-specific prompts.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper is well-written, with a clear and coherent narrative that makes it easy to follow. The logical organization of sections supports the reader’s understanding of both the motivation and methodology. The writing shows a good balance between technical depth and accessibility, and the figures are well-designed, effectively illustrating key concepts and results. While the “Preliminaries” section might be more intuitively titled “Foundation” or “Background,” this is a stylistic preference rather than a flaw.
- The introduction of a class-specific prompt bank is a compelling idea, enabling greater intra-class variability through diverse textual descriptions.
- The visual results demonstrate a striking degree of variation and realism, highlighting the effectiveness of the approach.
- The paper shows a thoughtful integration of MLLMs for domain-specific text augmentation, which has potential implications for other medical domains as well.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The use of class-conditioned settings for guiding image generation is not a novel concept. Previous studies, particularly in surgical data, have already explored class-specific generation in even more granular ways. Moreover, one could argue that detailed scene descriptions alone may suffice for conditioning. Introducing explicit class labels adds complexity and may limit the flexibility and intuitiveness that prompt-based approaches typically offer. For instance, it is unclear at what level of abstraction the class labels should operate, e.g., medical domain, medical diagnosis, … What is the authors perspective on that?
- The datasets used are limited to dermatology and chest X-ray images, which are relatively less complex than domains such as MRI or surgical imaging. This limits the generalizability of the findings.
- Minor comment: When building upon a state-of-the-art model, the specific model should be explicitly named. Merely citing reference [31] does not provide sufficient clarity or transparency.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper showcases impressive results. However, since no code is released and the provided information is not sufficient to reproduce the VAP-Diffusion, I encourage the authors to share the codebase with all the necessary data.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.
I appreciate the authors’ thorough rebuttal, which addresses the main concerns raised across the reviews.
- The preliminary results on MRI/CT image generation provide additional evidence of the method’s broader applicability beyond dermatology and chest X-ray domains.
- While I still find the role and necessity of explicit class conditioning somewhat debatable, I appreciate the authors’ discussion in the rebuttal. The explanation that conditioning should be aligned with the target application is both academically valuable and practically relevant. The argument remains somewhat ambiguous, yet opens the door for a broader discussion on what aspects of medical image generation should be fixed (e.g., condition labels) versus left flexible (e.g., prompt-driven variation).
Given these clarifications and the promising early MRI results, I maintain my weak accept recommendation and encourage further exploration of conditioning strategies in future work.

Author Feedback

We thank all reviewers for their valuable feedback. Reviewers recognize the importance and novelty of our work (R1-3), as well as the thorough evaluation (R1), strong performance (R1, R3), and clarity (R1-3). Below, we address main concerns with clarifications.

R2 on the downstream tasks. We compared our method with StyleGAN, CBDM, LDM, DiT, and U-ViT on downstream tasks (e.g., ISIC2018, Densenet, 1% data, mAUC: StyleGAN:0.913, CBDM:0.734, LDM:0.785, DiT:0.734, U-ViT:0.811, VAP-Diffusion:0.923). Due to page limits, we report the second-best approach (StyleGAN) alongside VAP-Diffusion in Table 3.

R2 on the effectiveness of PCM. We need to clarify the definition in Table 2 to avoid misunderstanding. The PCM module only functions with the VAPS module and remains inactive in its absence, as it requires input features generated from text prompts (i.e., “w/o VAPS” should be identical to original U-ViT). Our method is robust to different values of \alpha term; we select 0.5 as it gives slightly better FID and IS.

R1, R2 on details of PCM to handle unusual text. During inference, we first calculate class-wise similarities between encoding features F_e and class-specific prototypes p_c. We then produce a refined feature representation F_e’ by reweighting these class-wise similarity metrics. The final features are obtained by adding the original features and Z(F_e’), where Z is a linear layer. This way, PCM aligns test-time features with prototypical patterns learned during training, improving model robustness.

R2 and R3 on the reproduction of VAP-Diffusion. We used U-ViT-S/2(deep) as our backbone and will explicitly name U-ViT in the camera-ready version. We will open-source this project with detailed hyperparameter settings.

R1 on the influence of different MLLMs. In preliminary studies, we evaluated LLaVA, BLIP, QwenVL, InternVL, and GPT4o. QwenVL and InternVL produced accurate visual descriptions, matching GPT4o in capturing lesion features and rare backgrounds. However, GPT4o is limited by API constraints and higher costs. Due to limited domain-specific training data, other MLLMs might miss lesion features (e.g., color) and uncommon backgrounds (e.g., inflammation and scabbing). Given performance and deployment, we recommend QwenVL and InternVL for image prompting.

R1, R3 on the application to CT/MRI. VAP-Diffusion can be adapted for 3D CT/MRI by replacing U-ViT with a 3D ResUnet. 3D descriptions of MLLMs can be achieved by sequentially feeding MR or CT slices and aggregating them. In fact, our early results demonstrate that VAP-Diffusion effectively synthesizes 2D CT/MRI scans. Specifically, it outperformed U-ViT baselines, achieving superior FID of 35.691(↓9.195) for 2D CT and 16.127(↓12.287) for 2D MRI. Similarly, it delivered accuracy improvements in downstream tasks, with scores of 0.9435(↑0.0329) for patch-wise nodule classification and 0.9275(↑0.0445) for brain tumor classification.

R3 on the use of class-conditioned generation setting. Class-conditioned generation is widely studied, but not our focus. Instead, we investigate how to construct detailed prompts to aid complex medical image generation when only class labels and images are available. While scene descriptions help capture image content, they alone are insufficient due to variability in medical imaging (“same(different) disease, different(same) manifestations”). Including disease-specific category information enables finer control over generation outputs, while the corresponding condition labels enhance downstream task performance. In our view, the choice of class labels for medical image generation should align with its ultimate goals: For disease classification, use disease categories; To improve cross-domain robustness, incorporate hospital IDs and gender labels; To enhance learning of rare visual features, employ attribute prompts. VAP-Diffusion provides a framework that could make all these scenarios feasible, mainly by prompting MLLMs.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

VAP-Diffusion: Enriching Descriptions with MLLMs for Enhanced Medical Image Generation

Author(s):