List of Papers Browse by Subject Areas Author List
Abstract
The vision-language modeling capability of multi-modal large language models has attracted wide attention from the community. However, in medical domain, radiology report generation using vision-language models still faces significant challenges due to the imbalanced data distribution caused by numerous negated descriptions in radiology reports and issues such as rough alignment between radiology reports and radiography. In this paper, we propose a truthful radiology report generation framework, namely TRRG, based on stage-wise training for cross-modal disease clue injection into large language models. In pre-training stage, During the pre-training phase, contrastive learning is employed to enhance the visual encoder’s ability to perceive fine-grained disease details. In fine-tuning stage, the clue injection module we proposed significantly enhances the disease-oriented perception capability of the large language model by effectively incorporating the robust zero-shot disease perception. Finally, through the cross-modal clue interaction module, our model effectively achieves the multi-granular interaction of visual embeddings and an arbitrary number of disease clue embeddings. This significantly enhances the report generation capability and clinical effectiveness of multi-modal large language models in the field of radiology reportgeneration. Experimental results demonstrate that our proposed pre-training and fine-tuning framework achieves state-of-the-art performance in radiology report generation on datasets such as IUXray and MIMIC-CXR. Further analysis indicates that our proposed method can effectively enhance the model’s ability to perceive diseases and improve its clinical effectiveness.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/5020_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
N/A
Link to the Dataset(s)
N/A
BibTex
@InProceedings{WanYuh_TRRG_MICCAI2025,
author = { Wang, Yuhao and Sun, Yue and Tan, Tao and Hao, Chao and Cui, Yawen and Su, Xinqi and Xie, Weichen and Shen, Linlin and Yu, Zitong},
title = { { TRRG: Towards Truthful Radiology Report Generation With Cross-modal Disease Clue Enhanced Large Language Models } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15966},
month = {September},
}
Reviews
Review #1
- Please describe the contribution of the paper
Paper proposes a 2 stage strategy for radiology report generation. In the first stage, coarse alignment is done images and report sentences via CLIP style loss. In the 2nd stage, a disease clue injection module and cross-modal clue interaction module is trained to make vision features focus on the specific diseases
Main contributions of the paper are -
- Truthful radiology report generation from radiographs. Paper claims that the method achieved fine-grained alignment between images and text using disease clue injection module.
- Extensive experiments and comparisons are made compared to previous methods to demonstrate higher language generation quality and clinical effectiveness.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
Main strengths of the paper are -
-
Novel methodology to guide image features with disease clues to generate truthful radiology report
-
Extensive experiments and comparisons against methods show better performance
-
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Major weakness
-
Disease clue prompts focus on the common diseases in the dataset, thus it is not clear whether the method performs well for report generation for not so common or rare diseases.
-
It it not clear what is the rationale behind Cross Modal Clue Interaction Module. What is the loss trying to achieve there?
-
Value of Disease Clue Injection and Cross Modal Clue Interaction Module is not clear from the ablation studies since in most cases it improves accuracy only by 0.01-0.04 points.
-
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not provide sufficient information for reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
NA
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Overall the methodology presented in the paper provides some novelty in aligning image features with disease features.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #2
- Please describe the contribution of the paper
The author proposes TRRG, a stage-wise framework for radiology report generation that enhances vision-language alignment through contrastive pretraining, disease clue injection, and a cross-modal clue interaction module. These components collectively improve disease-aware representation and report quality, achieving state-of-the-art results on IU-Xray and MIMIC-CXR in both language generation and clinical efficacy metrics.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper leverages contrastive learning and prompt-based supervision in a novel manner specifically designed for the radiology report generation task.
- The stage-wise approach is intuitive, modular, and validated through extensive ablation studies. The clue injection and interaction modules are well-motivated and enhance fine-grained disease perception.
- The model is evaluated on two standard benchmarks with consistent improvements across a variety of metrics, especially on the large-scale MIMIC-CXR dataset.
- The work targets a high-impact clinical task—automated report generation—which can reduce workload and errors in clinical radiology settings.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- While effective, many components (e.g., prompt engineering, contrastive learning, cross-attention fusion) adapt existing ideas rather than introducing fundamentally new mechanisms.
- The approach is evaluated only on chest X-rays, which limits insight into how well it generalizes to CT, MRI, or other modalities.
- While metrics like BLEU and CheXBERT-derived scores are provided, no expert-based clinical evaluation is presented to judge practical utility or fluency of generated reports.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The author proposes a well-engineered, effective method for a clinically significant task and validates it thoroughly on large public datasets. However, its novelty lies more in integration of ideas than in methodological innovation. The lack of generalization to other modalities, and potential overfitting to disease-centric prompts, slightly reduce its broader impact.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
This paper presents a framework to improve the quality of radiology reports using large language models (LLMs). LLMs are of great interest widely at the moment; this utilizes image data and “clue” (e.g., patient data) to generate a report. It uses a cross-modal method to fuse the visual and “clue” embeddings, and is demonstrqted on the IU-Xray and MIM-CXR datasets.
The ablation studies suggest that both components contribute to the overall performance. While neither task is novel on their own, the combination appears potentially valuable and differentiates this work from many approaches.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The work incorporates federated learning and disease characterization.
The cross-modal approach seems novel, and interesting.
This leveraged zero-shot disease recognition to not just rely on a pair-image report - they perform a random sampling from the report. This method appears to look at information about potential diseases present, rather than a simple presence/absence.
This does a reasonable comparison with state of the art (“R2Gen [4], R2GenCMN [3], PPKED [16], R2GenGPT [31], FGIRG [2], and R2GMMN [22]”), and achieves good scores with BLEU and ROUGE language metrics.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
The clinical effectiveness appears to be a singular metric (CHEXPERT).
It’s unclear whether this will generalize into other diverse datasets.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(6) Strong Accept — must be accepted due to excellence
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The cross-modality approach is very interesting, combined with the number of metrics and ablation study, demonstrating the potential feasibility of this approach. Considering that the goal of many (not all) of MICCAI’s will be to automatically generate clinical reports, this seems like a relevant step forward in the field by combining multiple approaches in the field.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
N/A
Meta-Review
Meta-review #1
- Your recommendation
Provisional Accept
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
All three reviewers suggest to accept this paper. The proposed method is reasonable (e.g., align the image features with disease features) and the comparison is convincible. However, I suggest the reviewers to release the code after acceptance.