Abstract

Vision-Language Foundation Models (VLFM) have shown a tremendous increase in performance in terms of generating high-resolution, photorealistic natural images. While VLFMs show a rich understanding of semantic content across modalities, they often struggle with fine-grained alignment tasks that require precise correspondence between image regions and textual descriptions, a limitation in medical imaging, where accurate localization and detection of clinical features are essential for diagnosis and analysis. To address this issue, we propose a multi-stage architecture where a pre-trained VLFM (e.g. Stable Diffusion) provides a cursory semantic understanding, while a reinforcement learning (RL) algorithm refines the alignment through an iterative process that optimizes for understanding semantic context. The reward signal is designed to align the semantic information of the text with synthesized images. Experiments on the public ISIC2019 skin lesion dataset demonstrate that the proposed method improves (a) the quality of the generated images, and (b) the alignment with the text prompt over the original fine-tuned Stable Diffusion baseline. We also show that the synthesized samples could be used to improve disease classifier performance for underrepresented subgroups through augmentation. Our code is accessible through the project website: https://parhamsaremi.github.io/rl4med-ddpo

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4771_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/parhamsaremi/rl4med-ddpo/

Link to the Dataset(s)

N/A

BibTex

@InProceedings{SarPar_RL4MedDDPO_MICCAI2025,
        author = { Saremi, Parham and Kumar, Amar and Mohamed, Mohamed and TehraniNasab, Zahra and Arbel, Tal},
        title = { { RL4Med-DDPO: Reinforcement Learning for Controlled Guidance Towards Diverse Medical Image Generation using Vision-Language Foundation Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15963},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents RL4Med-DDPO, a novel framework for medical image generation that enhances semantic alignment between text prompts and generated images using reinforcement learning. By fine-tuning Stable Diffusion with a policy optimization strategy and an attribute alignment reward, the method improves control over image synthesis and reduces common artifacts. Experiments on the ISIC2019 skin dataset demonstrate improved image-text alignment and generation quality, with additional benefits for data augmentation in underrepresented subgroups.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper is well motovated and the research question is important for further potential medical application. The paper propose the first policy optimization.-based medical generalition model.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The work is generally good, but the experimental design has some limitations. The main application of the RL-based generative method should be applied to generate customized datasets that contains dermoscopic images with specific artifacts or without them. The value of the work should be evaluated on whether a generated dataset with these specific attributes can enhance classifier robustness to spurious correlations and distribution shifts. Without these experiments, the current setting cannot sufficiently demonstrate the value of the method. Also, Table 4 shows very marginal improvement on downstream classification. Futher, as the author works on dermoscopic image-based bias problems, related papers [1,2] about bias in dermatology should be discussed to provide some background information of this field, which can enhance the main motivation of the paper. [1] Bissoto, A., Valle, E. and Avila, S. (2020) Debiasing skin lesion datasets and models? Not so fast. CVPR2020. [2] Yan, S. et al. (2023) Towards trustable skin cancer diagnosis via rewriting model’s decision. CVPR 2023.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The main concern is the experimental setting is not convincing to me and should be improved, and the author also fails to sufficiently discuss why the specific generative model application (i.e., generating specific artifacts and removing unwanted artifacts) is important. The author should engage more with existing literature in related fields.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The work is generally solid and offers valuable insights to the medical AI field. Enhancing the discussion of related topics such as bias in dermatology, spurious correlations, and distribution shifts would significantly strengthen the paper. I encourage the authors to incorporate these suggestions in the camera-ready version.



Review #2

  • Please describe the contribution of the paper

    This paper presents a multi-stage framework for text-guided image generation in the medical imaging domain, leveraging a vision-language foundation model (VLFM). Initially, a pre-trained VLFM offers a high-level semantic interpretation of the input text. This is followed by a reinforcement learning (RL) phase, which iteratively refines the alignment between textual semantics and generated images. The RL component employs a reward signal specifically designed to enhance semantic consistency between the text and the synthesized images. The authors demonstrate that the generated images can improve downstream disease classification performance.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The use of reinforcement learning to align text and generated images is an interesting extension.

    • Introduction of the Artifact Prevalence Rate (APR) as a new metric provides a clear method to evaluate the presence of target attributes in synthesized images.

    • The evaluation includes both qualitative results and quantitative assessments such as downstream classification, helping establish clinical relevance.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The improvement in classification performance from synthesized data augmentation, as shown in Table 4, appears marginal, and its statistical significance is unclear.

    • The RL reward design lacks clarity on how prompts with fewer attributes are handled, especially in cases where the prompt does not include all six attributes used for evaluation.

    • The implementation details of the reinforcement learning process are insufficiently documented, making reproduction of results difficult.

    • The scope of evaluation is relatively narrow, limited to six artifacts; generalization to more complex scenarios or broader datasets is not demonstrated.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Questions to Authors:

    Q1. How does the reward formulation handle prompts that reference only a subset of attributes? Does it penalize images that do not include unrelated attributes?

    Q2. Could you clarify the implementation details of the RL stage? What libraries or toolkits were used, and what were the critical hyperparameters?

    Q3. For the fine-tuning of Stable Diffusion (baseline), was it done in a similar setup as your RL-based tuning? Could you elaborate on the process?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper introduces a promising multi-stage framework that integrates reinforcement learning into a vision-language foundation model for medical image synthesis. Overall, the paper is well-structured and demonstrates a meaningful application. The Artifact Prevalence Rate is a useful contribution, and the downstream evaluation offers a potential insight into clinical utility. However, limited improvement in classification performance, a narrow scope of application, and lack of implementation details are notable concerns.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Given the authors’ clarification during the rebuttal, some concerns regarding scope and implementation have been reasonably addressed, which strengthens the case for the paper’s potential impact and relevance.



Review #3

  • Please describe the contribution of the paper

    This paper addresses controllable skin medical image generation by building on pre-trained Stable Diffusion and enhancing text-alignment in generation, such as aligning disease categories with artifact attributes, through Denoising Diffusion Policy Optimization (DDPO). Specifically, a pre-trained attribute classifier is employed to determine the reward, which measures the proportion of correctly predicted attributes in the generated images. Experiments on the ISIC2019 dataset demonstrate the effectiveness of the proposed method.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) The proposed medical image generation method with specified attributes benefits long-tail disease classification by serving as an effective data augmentation technique. (2) The paper introduces the use of DDPO for fine-tuning diffusion-based image generation models, which improves performance by reinforcing image-text alignment, particularly for generating rare attribute images. (3) Experimental results demonstrate that the DDPO-tuned text-to-image (T2I) model outperforms the naively fine-tuned T2I model in terms of generation quality and downstream long-tail classification performance. (4) The paper is well-written.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    (1) Regarding the mentioned artifacts, such as rulers, in the naively fine-tuned Stable Diffusion (SD), it is likely that such artifacts appear due to their presence in the training data, as the diffusion model tends to learn the underlying data distribution. I recommend that the paper calculate relevant statistics for the real data to examine this issue more thoroughly. Additionally, based on the experiments, it appears that the fine-tuned SD already synthesizes diseases well, while the RL-guided SD shows improvements specifically in generating diseases with particular attributes. I suggest that the authors explicitly highlight this improvement in the Introduction. Furthermore, it would be beneficial to conduct experiments focusing solely on disease synthesis without attribute specifications. (2) Since the paper primarily leverages the existing DDPO technique, it should discuss the challenges and limitations of applying DDPO to medical image generation. Additionally, the role of the attribute classifier in the reward function should be explored further, including the classifier models beyond EfficientNet, as the reward classifier plays a critical role in DDPO. (3) Experiments are conducted exclusively on skin medical images, so the generalization of the method to other medical applications and image modalities remains unexplored. To reflect the study’s scope more accurately, I suggest constraining the paper title to focus specifically on skin medical image generation. (4) The paper should provide more details about the training process, including important parameters such as the learning rate and the choice of sample number $M$. Besides, for the evaluation, the paper should also discuss the significance to convey of the LPIPS metric, which measures the similarity between paired image data.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    For the notations in the DDPO formulation, such as states, it would be preferable to use $R$ (in mathematical notation) instead of plain text “R” for clarity.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper focuses on controllable skin medical image generation and proposes using DDPO to improve image-text alignment. While the method demonstrates promising results, the paper is currently limited in its application scope and lacks sufficient experimental analysis and discussion.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The paper proposes a new policy optimization-based method for controllable image generation. The rebuttal addresses some of my concerns, and I am inclined to recommend acceptance. I encourage the authors to further refine the paper to address the remaining issues.




Author Feedback

We thank the reviewers for their positive and insightful comments. We now address several questions raised by the reviewers:




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All reviewers were satistfied with the rebuttal. The authors should address the remaining concerns in the camera-ready version.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top