Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Automated radiology report generation (RRG) of routine 2D and 3D radiographs, such as X-ray and computed tomography (CT), has great potential in reducing the workload, variations, and errors of report writing and facilitating patient care. Despite significant advancements in linguistic quality, existing methods may generate reports with hallucinated type I and II (false positive and false negative) errors, which limit clinical efficiency. To mitigate the hallucinations, we propose RRG-DPO, an innovative direct preference optimization procedure with a new loss term, both tailored for effective alignment with the preference for clinically accurate RRG. RRG-DPO retrieves a set of highly relevant reports closest to the preferred response (i.e., the ground truth (GT) report) in a biomedical CLIP embedding space, and selects the one with the most significant abnormality conflicts with the GT as the dispreferred response. Besides being clinically relevant and abnormally aware, this preference data curation process is cost-effective and scalable compared to using large language models for response sampling or evaluation. In addition, we note that except for the abnormality-conflicting sentences, other sentences of the dispreferred report can legibly describe the radiograph of the preferred in a clinically equivalent manner, despite variations in expression. Thus, RRG-DPO creates a sub-preferred report from the dispreferred by deleting the abnormality-conflicting sentences, and promotes its likelihood with a new loss term. RRG-DPO is evaluated on both 2D X-ray and 3D CT data to align a wide range of RRG models. Experiments show that it boosts the clinical efficiency of all assessed models in six metrics: precision, recall, F1 score, RadGraph, RadCliQ, and RaTEScore, effectively reducing hallucinations. Further ablation studies show that our method outperforms DPO and DPOP, and its components are helpful. Our code will be available.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1273_paper.pdf

SharedIt Link: https://rdcu.be/eHwUT

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04971-1_52

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/ccarliu/RRG-DPO

Link to the Dataset(s)

N/A

BibTex

@InProceedings{LiuHon_RRGDPO_MICCAI2025,
        author = { Liu, Hong AND Wei, Dong AND Xu, Zhe AND Wu, Xian AND Zheng, Yefeng AND Wang, Liansheng},
        title = { { RRG-DPO: Direct Preference Optimization for Clinically Accurate Radiology Report Generation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {552 -- 562}
}

Reviews

Review #1

Please describe the contribution of the paper
This work designed a novel alignment approach for radiology report generation (RRG) —— by setting a special sampling method, DPO makes the model perform better in various scenarios.
1. Clinically Relevant and Abnormality-Aware Data Curation: Proposes a cost-effective and scalable approach to curate preference data by retrieving clinically relevant reports and selecting dispreferred ones based on abnormality conflicts, mimicking hallucinated type I and II errors.
2. Sub-Preference Optimization: Introduces a novel sub-preference optimization process by removing abnormality-conflicting sentences from dispreferred reports and promoting their likelihood with a new loss term.
3. Comprehensive Evaluation and SOTA Performance: Demonstrates improved clinical efficiency across multiple metrics (precision, recall, F1, RadGraph, RadCliQ, RaTEScore) on 2D X-ray and 3D CT datasets, achieving state-of-the-art results on the MIMIC-CXR dataset and outperforming DPO and DPOP.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. In the RRG scenario, this work enhances model performance by constructing suitable samples through retrieval, alignment, and optimization methods.
2. Most experiments and metrics demonstrate the effectiveness of this method.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. In evaluating the effectiveness of alignment methods, safety and harmfulness are crucial aspects. Does this work consider and optimize these factors?
2. Will the related alignment data be open-sourced?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The starting point of this work is very good. I think some additional experiments or data are needed to make it more complete. I will adjust the score based on the corresponding feedback.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

I think the author has resolved my confusion and I have adjusted my rating.

Review #2

Please describe the contribution of the paper

This paper presents a method for automated radiology report generation (RRG). In RPG, hallucinations often occur. To mitigate the hallucinations, this paper proposes RRG DPO (direct preference optimization). This DPO procedure works for tailored for effective alignment with the preference for clinically accurate RRG with a new loss term. The experimental results show that it boosts the clinical efficiency of all assessed models in six metrics: precision, recall, F1 score, RadGraph, RadCliQ, and RaTEScore, effectively reducing hallucinations.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The authors proposed DPO, which consists of clinically relevant dispreference retrieval, abnormal disprefrerred response selection, and sub-preference selection. These selection processes are optimized by the new loss function. The authors designed this process with deeply understanding the current RPG. Ablation studies are well demonstrated.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

One of the drawback is that this paper is lacking to analyze the generated results with really looking at actual input images. Sometimes radiology report does not cover all findings on images and RPG makes some finding texts for the areas that GT finding report missing. Deeper look into images is required.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

As mentioned in the weakness part, it is recommended to show paris of images and text for more deeper discussion.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

LMM is very exciting area nowadays. The proposed method tries to focus on suppressing hallucination by introducing DPO. This is good idea to be introduced at the MICCAI conference.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This paper introduces RRG-DPO, a direct preference optimization method tailored specifically for automated radiology report generation (RRG). Existing RRG methods often generate clinically inaccurate reports due to hallucinations (false positives and negatives). To address this, the authors propose RRG-DPO that employs a clinical-relevant and abnormal-aware data curation process using embeddings and abnormality classifications. RRG-DPO retrieves clinically relevant reports similar to the ground truth and selects the one with the most significant abnormality discrepancies as the dispreferred response. It then creates a sub-preferred report by removing conflicting sentences and integrating this into a new loss function to promote clinically accurate reporting. Experiments conducted on both 2D X-ray (MIMIC-CXR) and 3D CT (CT-RATE) datasets demonstrate that RRG-DPO significantly improves clinical metrics (CheXbert classification, RadGraph, RadCliQ, and RaTEScore) across various RRG models, effectively mitigating hallucinations.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Clinically Oriented Preference Optimization: The authors carefully design a data curation process that emphasizes clinical relevance and abnormality awareness, differentiating it from conventional random sampling or general-purpose DPO methods. The sub-preference construction, which involves removing abnormality-conflicting sentences from dispreferred reports, is also interesting, as it preserves clinically accurate information, preventing the loss of meaningful content during optimization.

Strong and Comprehensive Evaluation: The paper provides thorough evaluations of RRG-DPO using a broad range of clinical metrics (CheXbert-classification, RadGraph, RadCliQ, and RaTEScore). The method’s effectiveness is consistently demonstrated across both 2D (X-ray) and 3D (CT) imaging modalities.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Dependence on Embeddings and Abnormality Classifier: The selection process for dispreferred reports relies heavily on BiomedVLP embeddings and CheXbert-based abnormality classifications. However, the authors do not adequately validate or discuss how variations in the performance of these components might affect the quality or reliability of the selected dispreferred responses.

Limited baseline models: The paper excludes certain non-LLM baselines (e.g., CXR-Mate[1]) and omits newer LLM-based methods (e.g., MAIRA-2[2], CXR-LLaVA[3], LLaVA-Rad[4]), raising questions about how RRG-DPO can be effective in these models. [1] CXRMate: Longitudinal data and a semantic similarity reward for chest X-ray report generation. Nicolson et al., Informatics in Medicine Unlocked 2024. [2] MAIRA-2: MAIRA-2: Grounded Radiology Report Generation. Bannur et al., arXiv 2024. [3] CXR-LLaVA: CXR-LLaVA: a multimodal large language model for interpreting chest X-ray images. Lee et al., European Radiology, 2025. [4] LLaVA-Rad: Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation. Chaves et al., arXiv, 2024.

Limited Scope and Detail in Ablation Studies: The ablation study presented is restricted only to the EKAGen model on the MIMIC-CXR dataset. It is unclear why this model and dataset combination was exclusively selected. Additionally, the study lacks analysis of a scenario where only RRG-DPO (without other supportive components) is implemented, limiting insights into the individual contribution of the novel method itself.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

Clarity in Figures: In Figure 1, the red color scheme is somewhat confusing, making it challenging to distinguish clearly between the preferred and dispreferred reports.

Abbreviation Issue: The abbreviation “DPOP” is introduced without clearly defining it initially. Clarify this abbreviation at its first occurrence.

Typos: “RGG-DPO” in Table 2
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The RRG-DPO is effective in many baseline models, but there needs improvement in the justification of their methods and experiment design.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

Although some issues, missing the key state-of-the-art baselines and the validation of individual contribution of RRG-DPO, are still not resolved, the authors have addressed other significant concerns.

Author Feedback

We thank the reviewers for 1) appreciating our work’s novelty, effectiveness, clinical relevance, strong and comprehensive evaluation, and 2) the constructive comments.

R3Q1 Safety and harmfulness Our method enhances the clinical efficiency of radiology report generation (RRG), particularly by improving the recall and precision of abnormality descriptions. The improvements naturally contribute to the safety and mitigate the harmfulness of generated reports from the clinical perspective. Meanwhile, no additional harmful content, such as bias or hate, is introduced, as we use the original reports as preferred responses for alignment. So far as we are aware, there is no specialized benchmark for assessing the safety and harmfulness of RRG. We will evaluate these critical factors when such benchmarks become available.

R3Q2 Open-source alignment data We will open-source the alignment data with our code.

R5Q1 Lack of visual analysis We will include pairs of images and text, and corresponding discussion in the extra 0.5 page of the final version. Like the reviewer, we also noticed some suspect false-positive descriptions that might be omissions in the GT reports, which calls for refinements to public RRG benchmarks.

R6Q1 Dependence on embeddings and abnormality classifier We agree that the embeddings and abnormality classifier are critical to our method. In our preliminary experiments (not included in the initial submission due to page limit) with EKAGen on MIMIC-CXR, we compared BiomedVLP and CXR-RePaiR (Endo et al., 2021) for embedding, and CheXbert with CheXpert (Irvin et al., 2019) for abnormality classification, our final model demonstrated superior performance compared to these alternatives. The performance discrepancies highlight the impact of these components’ effectiveness. In the Conclusion, we have discussed using a more diverse and accurate classifier for better performance. We will expand the discussion to include the content above.

R6Q2 Limited baseline models Our current baselines include non-LLM (R2Gen, EKAGen, PromptMRG for 2D data; CT2Rep for 3D data) and LLM-based (R2GenGPT/3D-CT-GPT for 2D/3D data). We respectfully emphasize that this collection encompasses a wide variety of representative approaches to RRG: classical to recent (up to Oct. 2024), 2D and 3D, and non-LLM and LLM-based. However, we agree on the necessity of assessing our method on a broader spectrum of baselines to understand its generalizability, and thank the reviewer for bringing the more recent baselines to our attention. We will discuss these important works in this paper, and plan to apply our method to them in future work.

R6Q3 Ablation study restricted to EKAGen on MIMIC-CXR Due to its exploratory nature and page limit, this paper conducts ablation studies with a single combination of model and dataset. MIMIC-CXR is the most impactful benchmark for RRG to date. EKAGen was a recent SOTA method while developing this work. In addition, it is well open-sourced with both code and checkpoints available, facilitating experiments. A future extension of our paper could widen the scope of the ablation studies with more models and data.

R6Q4 Individual contribution of RRG-DPO According to preliminary experiments, implementing only RRG-DPO (without other supportive components) yields results better than rows (a) and (b) in Tab. 2, but worse than our complete method in row (f). These results indicate that: 1) built on the novel notion of sub-preference, RRG-DPO is effective even with random selection of dispreferred responses; and 2) our proposed clinical-relevant retrieval and abnormal-aware selection can further boost RRG-DPO’s efficacy. Initially, we did not include this ablation setting, considering the component-wise incremental presentation in Tab. 2. However, we will add it after reading this insightful comment if allowed.

R6Q5 Optional comments We will address the color scheme (Fig. 1), abbreviation definition (DPOP), and typo (“RGG-DPO”).

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

All authors lean towards acceptance after reading the rebuttal. I also support this work.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

All reviewers rated the paper at least as a weak accept, and the research idea was considered novel.

back to top

RRG-DPO: Direct Preference Optimization for Clinically Accurate Radiology Report Generation

Author(s):