Abstract

Recent advances in text-conditioned image generation diffusion models have begun paving the way for new opportunities in modern medical domain, in particular, generating Chest X-rays (CXRs) from diagnostic reports. Nonetheless, to further drive the diffusion models to generate CXRs that faithfully reflect the complexity and diversity of real data, it has become evident that a nontrivial learning approach is needed. In light of this, we propose CXRL, a framework motivated by the potential of reinforcement learning (RL). Specifically, we integrate a policy gradient RL approach with well-designed multiple distinctive CXR-domain specific reward models. This approach guides the diffusion denoising trajectory, achieving precise CXR posture and pathological details. Here, considering the complex medical image environment, we present “RL with Comparative Feedback” (RLCF) for the reward mechanism, a human-like comparative evaluation that is known to be more effective and reliable in complex scenarios compared to direct evaluation. Our CXRL framework includes jointly optimizing learnable adaptive condition embeddings (ACE) and the image generator, enabling the model to produce more accurate and higher perceptual CXR quality. Our extensive evaluation of the MIMIC-CXR-JPG dataset demonstrates the effectiveness of our RL-based tuning approach. Consequently, our CXRL generates pathologically realistic CXRs, establishing a new standard for generating CXRs with high fidelity to real-world clinical scenarios.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0165_paper.pdf

SharedIt Link: https://rdcu.be/dV1Vf

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72384-1_6

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0165_supp.zip

Link to the Code Repository

https://github.com/MICV-yonsei/CXRL

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Han_Advancing_MICCAI2024,
        author = { Han, Woojung and Kim, Chanyoung and Ju, Dayun and Shim, Yumin and Hwang, Seong Jae},
        title = { { Advancing Text-Driven Chest X-Ray Generation with Policy-Based Reinforcement Learning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15003},
        month = {October},
        page = {56 -- 66}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents a novel approach to incorporate policy-based reinforcement learning (via three different feedback steps) for chest X-Ray image generation based on text prompts. Results were shown to be improved to previous results and ablation studies were conducted on a single dataset.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The results suggest that the novel approach of using policy-based reinforcement learning to improve generated chest X-Ray images. The comparisons and study on ablations can provide useful information to those involvement in developing similar approaches.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper makes somewhat strong claims about the generated images. Ultimately, the lack of full bit-depth images for training would limit the quality of the output of the images. The manuscript should discuss this limitation as well as the limitations involved in the study presented.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    I may have missed this in the paper, but I have the following conclusions on reproducibility: The paper does not state any open sourcing of the code; does not state what hardware was required to train the model; does not state the exact split of data used to train.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The paper results suggest improved performance, but the claims are rather strong and the metrics listed could be more detailed and described.

    Ultimately, it appears that a radiologist can easily distinguish a real CXR from the one generated from a limited bit-depth image it was trained on. This limits the output usage of the method.

    Some individual points:

    • What is the resolution and bit-depth of the output images?

    • Mention that an older version of MIMIC-JPG-CXR (Version 2.0) is used.

    • The AUROC classification metric and how it was computed across multiple findings could be more detailed or at least plotted to get a better sense of potential sensitivity/specificity.

    • The Medical Expert Assessment section appears to be adhoc. It is not clear what “radiology exports” are exactly. What hardware/software were they provided for image evaluation? Where the monitors calibrated? What training materials and practice sessions were provided prior to the study? What would be an example of a 2 scale versus a 3 scale on a report and images? How were readers calibrated for this scale?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper lacks details for reproducibility but does offer an interesting approach to improve X-Ray image generation along with some evidence of the improvement. Details and reproducibility are factors in this rating.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper introduces CXRL, a novel framework that applies reinforcement learning (RL) with Comparative Feedback (RLCF) to the generation of Chest X-rays (CXRs) from diagnostic reports. This method enhances the precision of CXR posture alignment and pathological details accuracy and ensures consistency between input reports and generated CXRs. It presents an interesting approach to applying RL to text-conditioned medical image synthesis, particularly in CXRs.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper successfully demonstrates the application of policy gradient reinforcement learning in refining the quality and clinical accuracy of generated medical images.  
    2. A key strength is the introduction of a novel reward feedback system that improves posture alignment, pathological details accuracy, and consistency between input reports and generated CXRs.    
    3. The extensive evaluations and quantitative results using datasets such as MIMIC-CXR-JPG validate the model’s efficacy in producing good-quality CXRs.  
    4. The work is well-motivated, and the proposed approach outperformed the baselines.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    More information about the compared baselines are lacking and needs to be provided.

    Ethical considerations, particularly concerning the potential for misuse of synthetic medical image generation, should be thoroughly discussed.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors should make the code available for easy reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The paper is well-written and shows improvement over the baselines. The proposed reward feedback system, which improves posture alignment and pathological details accuracy and ensures consistency between input reports and generated CXRs, is a good contribution.

    The paper needs more information about the compared baselines.

    Including ethical implications or concerns with generating synthetic medical images is necessary.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Technical novelty and the results achieved. Also, the experiments suggest that the approach is effective.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This work propose CXRL, a framework motivated by the potential of reinforcement learning (RL), to advance text-driven Chest X-Ray generation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This study pioneers in applying RL to text-conditioned medical image synthesis, particularly in CXRs, focusing on detail refinement and input condition control for clinical accuracy.
    2. They advance report-to-CXR generation with a RLCF-based rewarding mechanism, emphasizing posture alignment, pathology accuracy, and consistency between input reports and generated CXRs.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Does this work conduct the hyperparameter analysis? How did you choose the hyperparameter for different terms in final reward function.
    2. The result in table 2b should also add MS-SSIM score.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    see above

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The qualitative and quantitative evaluation have demonstrated the effectiveness of proposed method.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank the reviewers for the constructive feedback and for valuing our paper. To address the major weaknesses identified by the reviewers, we would like to clarify any possible misunderstandings as follows:

R3: Distinguishing real CXRs from limited bit-depth generated CXR. We visualize the generated CXR by normalizing the model’s output into an int8 image. However, since our model is trained in bf16, the model output can represent sufficient bit-depth to produce detailed CXRs. Furthermore, our goal is not necessarily to make real and generated CXRs indistinguishable. While future advancements in medical image generation technology might require creating CXRs realistic enough for clinical use where indistinguishability is crucial, our current intent is to use this technology for research and educational purposes, where such a requirement is not necessary.

R3: Reproducibility. We have included our code in the supplementary materials and plan to make it publicly available. This will assist future researchers to fully grasp and build upon our work.

R3: Misunderstanding of MIMIC-CXR-JPG dataset version. We utilize the MIMIC-CXR-JPG dataset for training and evaluating our model. This dataset is completely based on the latest version of the MIMIC-CXR database v2.0.0, offering JPG format files derived from DICOM images and structured labels extracted from free-text reports. Thus, we are employing the most up-to-date dataset available.

R3: Detailed explanation about multi-class AUROC. As detailed in Tab.1(b) and Supp. Tab.1, we calculated the total AUROC by averaging the AUROCs from binary classifications for each of the 11 labels, which were determined by integrating the area under the ROC curve generated from model probabilities and ground truth labels. This calculation follows the same evaluation metric as the baseline. Although sensitivity/specificity are not separately discussed due to space constraints, their values can be inferred from the AUROC presented.

R3: Details of Medical Expert Assessment. “Radiology experts” with experience in CXR evaluated the images using three main criteria: Report Consistency, Image Completeness, and Factuality. Report Consistency measured alignment with the diagnostic report, Image Completeness assessed whether the image was fully displayed without cropping, and Factuality checked the image’s resemblance to an actual CXR. Our model was compared and rated against existing baselines simultaneously and fairly on the same hardware/software configuration, ensuring impartiality.

R4: More information about baselines. Due to space limitations, we were unable to describe the two recent state-of-the-art report-to-CXR generation models as baselines in detail. First, RoentGen is a model fine-tuned on the MIMIC-CXR dataset using Stable Diffusion 1.5. The other baseline, LLM-CXR, generates CXRs from text-based radiology reports by instruction fine-tuning a pre-trained LLM with images tokenized using VQ-GAN.

R4: Ethical concerns. Ethical concerns related to generated medical images are indeed crucial, and we fully understand and acknowledge the importance of addressing these issues comprehensively. Generated images are clearly marked for research or educational use only, not for clinical diagnosis. Conversely, when used solely for educational and experimental purposes, our generated images have the advantage of preventing patient privacy violations by not using actual patient images. However, we emphasize the need for strict marking to ensure these images are not used directly for diagnosis, and we are fully aware of this requirement.

R5: Hyperparameters for each feedback model in the final reward function. In the process of finding the hyperparameters for each feedback model, our focus was on ensuring that each feedback model provided balanced feedback without dominating the feedback from other models. In other words, we focused solely on scaling the values, simplifying the tuning process.




Meta-Review

Meta-review not available, early accepted paper.



back to top