List of Papers Browse by Subject Areas Author List
Abstract
With the emergence of large-scale vision language models (VLM), it is now possible to produce realistic-looking radiology reports for chest X-ray images. However, their clinical translation has been hampered by the factual errors and hallucinations in the produced descriptions during inference. In this paper, we present a novel phrase-grounded fact-checking model (FC model) that detects errors in findings and their indicated locations in automatically generated chest radiology reports.
Specifically, we simulate the errors in reports through a large synthetic dataset derived by perturbing findings and their locations in ground truth reports to form real and fake findings-location pairs with images. A new multi-label cross-modal contrastive regression network is then trained on this datsaset. We present results demonstrating the robustness of our method in terms of accuracy of finding veracity prediction and localization on multiple X-ray datasets. We also show its effectiveness for error detection in reports of SOTA report generators on multiple datasets achieving a concordance correlation coefficient of 0.997 with ground truth-based verification, thus pointing to its utility during clinical inference in radiology workflows.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3526_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
N/A
Link to the Dataset(s)
N/A
BibTex
@InProceedings{MahRaz_Phrasegrounded_MICCAI2025,
author = { Mahmood, Razi and Machado-Reyes, Diego and Wu, Joy and Kaviani, Parisa and Wong, Ken C.L. and D’Souza, Niharika and Kalra, Mannudeep and Wang, Ge and Yan, Pingkun and Syeda-Mahmood, Tanveer},
title = { { Phrase-grounded Fact-checking for Automatically Generated Chest X-ray Reports } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15966},
month = {September},
page = {444 -- 454}
}
Reviews
Review #1
- Please describe the contribution of the paper
This paper introduces a novel method for fact-checking radiology reports of chest X-rays. The authors construct a large-scale synthetic dataset of phrase-level findings and their corresponding locations, and develop a multi-label contrastive regression model that verifies the factual consistency of report statements with the associated image.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper tackles an important and relatively under-explored problem in medical vision—fact-checking automatically generated radiology reports. By formulating it as a phrase-level image-text consistency task, the work addresses a key challenge in improving the clinical trustworthiness of report generation systems.
- The proposed method shows strong and consistent performance across multiple datasets. In addition, the introduction of FCScore as an automatic evaluation metric is a valuable contribution, offering a scalable alternative to human judgment for assessing factual consistency.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Figure 1 aims to illustrate the overall pipeline, but the presentation is cluttered and difficult to follow. The naming of data flows as “training” and “inference” is particularly confusing, as they refer to data construction steps rather than actual model usage. Moreover, the diagram lacks visual clarity and logical structure, making it hard to grasp the sequential flow of data and the role of each component.
- The overall organization and presentation of the paper could be significantly improved. Several sections feel disjointed, and inconsistent formatting and dense layout reduce readability and clarity.
- While the proposed task formulation is meaningful, the technical novelty of the method itself is limited. The model largely relies on standard contrastive learning and regression techniques, with few algorithmic innovations beyond dataset construction.
- Please rate the clarity and organization of this paper
Poor
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(2) Reject — should be rejected, independent of rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
While the paper addresses an important problem and shows promising results, issues with clarity, presentation, and limited methodological novelty reduce its overall impact.
- Reviewer confidence
Somewhat confident (2)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
This is a task-driven and well-designed study with modest technical novelty, but it offers practical value in medical image fact-checking and is reasonably suitable for acceptance.
Review #2
- Please describe the contribution of the paper
The authors propose a post-training fact checking method for generated radiology reports. Their method does not only identify errors but also grounds (localizes) them within the image.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The study addresses a topic of importance for deploying AI in real clinical settings.
- It utilizes multiple datasets or models:
- Training and evaluation are conducted using multiple datasets.
- Measurement of error detection across various state-of-the-art report generators.
- Detailed and plausible evaluation of individual components
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The writing of the paper and graphics require refinement. Particularly, the main figure (Fig. 1) and the results section are challenging to read and follow. The order of supporting graphics and tables does not align with how the results are presented, hindering the clarity and coherence of the results section. Additionally,
- Some expressions are lacking previous introduction, such as
- Table 2, variable E.
- Page 4, line 3, variable $F_{iReal}$.
- Mathematical notation could be used more carefully.
- Curly brackets are typically used to represent sets, not lists.
- $D_i \in D = <I_i, R_i>$ could be rearranged for clarity as $D_i = <I_i, R_i> \in D$.
- There are some errors for correction.
- Table 2, the column header “Label <xy, w, h, E>” should be instead “Label <x, y, w, h, E>”.
- The dataset name is Chest ImaGenome, not ChestImagenome.
- The sentence “Let $z_i$ be the vision projection encoder output, and let $z_{f_{ij}}$ for each sample (…) are the real and fake labels per sample.” may be restructured for better readability, and the second verb should be in infinitive.
- Some expressions are lacking previous introduction, such as
- Presence of strong claims which might be revised. For example, the statement “Randomly drawing from this set ensures that a synthetic location generated for Fj is a valid location for some image in the dataset” cannot be guaranteed unless the dataset is registered or the approach accounts for the need of alignment in some manner.
- The results are based on a single metric, and their analysis and/or discussion could be more extensive. For example, comparisons to common clinical efficacy metrics are not included and would be interesting
- While existing semantic metrics (clinical efficacy metrics) do not help with localizing errors, it would still be interesting to check how well the judgements of the proposed method (i.e. whether the generated report is factually correct) aligns with existing metrics
- Metrics to include: GREEN[1], CheXbert-F1[2], RadGraph-F1 [3], RadCliQ [4]
- These metrics should at least be cited and discussed in the paper
- A quantitative or qualititative comparison would be even better
- The writing of the paper and graphics require refinement. Particularly, the main figure (Fig. 1) and the results section are challenging to read and follow. The order of supporting graphics and tables does not align with how the results are presented, hindering the clarity and coherence of the results section. Additionally,
- Please rate the clarity and organization of this paper
Poor
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
References
[1] Ostmeier, Sophie, et al. “Green: Generative radiology report evaluation and error notation.” arXiv preprint arXiv:2405.03595 (2024).
[2] Smit, Akshay, et al. “CheXbert: combining automatic labelers and expert annotations for accurate radiology report labeling using BERT.” arXiv preprint arXiv:2004.09167 (2020).
[3] Delbrouck, Jean-Benoit, et al. “Improving the factual correctness of radiology report generation with semantic rewards.” arXiv preprint arXiv:2210.12186 (2022).
[4] Yu, Feiyang, et al. “Evaluating progress in automatic chest x-ray radiology report generation.” Patterns 4.9 (2023).
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
While the method is well-motivated and would be very valuable for the field, the paper in its current form has many flaws and requires a thorough rewrite. Authors should address the flaws in the writing and presentation. Also, while not absolutely mandatory, we recommend adding more comparisons to existing metrics. However, we still see a lot of value in the method and, in the case of rejection, encourage the authors to submit an updated version to another conference or journal.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The method is well-motivated and would be very valuable for the field. While the paper in its current form is quite confusing (esp. Fig. 1), the authors promised to clarify details and improve the figure. Therefore, I would vote for acceptance, but want to emphasize the importance of improving Fig. 1 and the overall writing of the paper. Please give the paper a thorough proof-read.
Review #3
- Please describe the contribution of the paper
The paper presents a novel approach to fact-checking automatically generated chest X-ray reports by introducing a phrase-grounded fact-checking model (FC model) that evaluates both the correctness of clinical findings and their stated anatomical locations. To train this model, the authors construct a large-scale synthetic dataset comprising over 27 million image-report pairs, generated by systematically perturbing ground truth findings to simulate realistic report errors. The FC model uses a multi-label contrastive regression framework to jointly learn veracity classification and location prediction, enabling it to detect and localize incorrect statements in reports with high accuracy. Through extensive experiments across multiple datasets, the paper demonstrates that the model achieves strong performance in both tasks and correlates closely with ground truth-based evaluations, highlighting its potential as a reliable tool for clinical error detection during inference.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Addresses a real clinical need: The paper tackles an important and timely problem: factual errors and hallucinations in automated chest X-ray reports generated by vision-language models (VLMs). These errors limit the models’ use in clinical settings. The authors offer a solution that works at inference time, which is critical for deployment in real-world clinical workflows.
- Novel problem formulation with grounding: The method operates at a phrase level, checking each finding and its location. This finer granularity makes the model more precise and informative than earlier methods.
- Large, carefully designed synthetic dataset: A major strength is the construction of a 27-million-sample synthetic dataset by perturbing real reports. This lets the model learn the difference between true and false statements in a controlled way. The dataset is also promised to be open-sourced.
- Methodologically sound architecture: The authors propose a multi-label contrastive regression model that is trained using both supervised contrastive learning and regression objectives. It jointly predicts: 1) whether a finding is real or fake and 2) where in the image the finding appears. The model is well-integrated and uses CLIP-like encoders fine-tuned for the task.
- Strong experimental results with SOTA Comparisons: The model achieves high accuracy in real/fake classification and higher IOU for phrase grounding compared to other models. It outperforms both real/fake classifiers and pure grounding models across four datasets.
- High concordance with ground truth: The FC model’s predictions closely match evaluations based on actual ground truth (concordance correlation coefficient of 0.997), suggesting it could be a reliable surrogate when ground truth isn’t available.
- Thorough ablation studies: The paper evaluates different model variants and shows that their full proposed version (with contrastive + regression loss) performs best, which gives credibility to their design choices.
- Clear visualizations and examples: Sample outputs show how the model identifies incorrect findings and points out their incorrect locations. These visualizations help explain the model’s behavior clearly.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Dependence on preprocessing accuracy: The method heavily relies on preprocessing components like finding and location extractors from previous work. Although their accuracies are reported, the FC model’s performance ultimately inherits any errors or biases from these upstream steps. This dependency is not deeply analyzed or discussed in the paper.
- Synthetic training data may limit generalization: The model is trained solely on synthetically generated errors, rather than on real-world errors from clinical settings. While synthetic data allows for controlled supervision, it is unclear how well the model would generalize to the full range of naturally occurring mistakes in real AI-generated reports. The authors mention this in passing but do not explore it experimentally.
- Missing human evaluation or radiologist feedback: The paper does not include any qualitative user study or radiologist assessment to evaluate whether the flagged errors are clinically meaningful. While they are trying to reduce dependency on humans, it is still especially important since clinical decision-making is context-sensitive.
- All errors are treated equally in the current formulation. There is no discussion on whether some errors are more critical than others. A prioritization of clinically important errors could enhance the practical usefulness of the tool.
- Although the model is efficient enough for research, the paper gives no information about inference speed, memory usage, or integration into real-time systems. In clinical settings, where turnaround time is critical, this could be a concern.
- While Figure 1 is generally clear and informative, some changes are needed. Please consider adding a legend or labeling for the red arrows to clarify their purpose, and update the figure caption accordingly to improve the figure’s standalone interpretability. This will help readers better understand the workflow at a glance without having to refer back to the main text. Additionally, please consider clarifying T_{ij}, N_{ij}, and C_{ij} in Equation 1.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(6) Strong Accept — must be accepted due to excellence
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Recommendation: Strong Accept
This paper presents a well-motivated and technically sound approach to phrase-grounded fact-checking for chest X-ray reports, addressing a timely and clinically relevant problem with a novel model and large-scale synthetic dataset. The method is thoughtfully designed, shows strong performance across multiple datasets, and has potential for real-world impact. While there are a few limitations these do not detract significantly from the overall quality. I recommend a strong accept, and encourage the authors to consider the identified weaknesses as directions for future work.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
I initially accepted it with minor comments. No change in decision.
Author Feedback
We thank the reviewers as they all see the clinical significance and novelty of our method. Responses: (R1,2,3): The dataset is already in public on Huggingface (link withheld due to anonymity).
(R1,2,3): Figure 1 will be re-drawn and explained in the final version as: The RED colored paths show training workflow which involves: (a) finding localization, (b) synthetic data generation and (c) FC model training. Step (a) extracts anatomical locations (L) from ground truth images (I), findings (F) from their reports, and collates to generate bounding boxes <x,y,w,h> for findings. In step (b) synthetic perturbations are applied to generate real/fake pairs <F,I,x,y,w,h,E> where E is the real/fake label. In step (c) the FC model is trained using <F,I> as input and <x,y,w,h,E> as output. The GREEN paths indicate inference with FC model taking findings (Fa) extracted from automated reports and their image I as input, and predicting the output location <Lp,Ep>=<xp,yp,wp,hp,Ep> where Lp=<xp,yp,wp,hp> is the bounding box and Ep is the predicted real/fake label. The ORANGE paths indicate evaluation workflow in which the findings in the automated report are localized <Fa,La> using step (a) above, and compared to the predicted findings-location <Fp,Lp> (Fp=Fa if Ep=1) using Equation 4. R1: FC model novelty In our paper, we extend contrastive learning to a new formulation which is fully SUPERVISED and simultaneously applicable to CROSS-MODAL and MULTI-LABEL settings, as opposed to CLIP which is multimodal but not multilabel (strictly diagonal similarity matrix). We also differ from supervised contrastive learning which is unimodal and single label, and standard contrastive learning which is self-supervised/unlabeled and unimodal. While Equation 2 may look similar to a contrastive loss function, the difference in each of these formulations lies in the terms across which the summation occurs and the normalization factors. Besides novel contrastive learning formulation, we also pose the error correction in a regression framework as reflected by the novel combined loss function in Equation 3. All these, together with novel training data used, contribute to the novelty of our FC model.
R2: Single metric used… The paper actually reported 3 metrics, accuracy, MIOU (Table 4) and FC score (Table 5). To capture the relative difference of the error metric between automated and ground truth (A,G) versus automated and predicted from FC model (A,P), we ran 6 different scoring metrics (GREEN, RadgraphF1, RadCliq, CheXBERT, BLEU, and FC score) against 4 datasets and 6 reporting models, resulting in over 144 measurements (6 x 4 x 6). To avoid taking the focus from the main message of Table 5 which indicated that the FC model could serve as a surrogate ground truth at inference, and given space limitations of the paper, we reported the results with FC score which was shown [Ref 11] to have higher sensitivity to errors. The trends were stable for all error measures, indicating small absolute differences between (A,G), and (A,P) and did not alter the conclusions. Specifically, the concordance correlation coefficients were 0.998 for GREEN (as it attends to location as well), 0.999 for Radgraph F1 an 0.998 for CheXBERT. Hence we chose the FCScore as a conservative estimate of our FC model’s surrogate ground truth performance. This discussion and references will be added to the paper. Arxiv version will have all results. R2: Strong claims .. Since the bounding boxes are in normalized coordinates relative to the image, and we pick among the valid finding locations, what we said was that the synthetic location picked will be valid for some image in the dataset. R3: The FC model was applied to real-world AI-generated reports as shown in Table 5. The model can address 72 clinically significant findings and their identity/location errors. The FC model has 151,934,726 parameters which easily fits in 1 GPU server or even in CPU RAM for inference under 1sec.
Meta-Review
Meta-review #1
- Your recommendation
Invite for Rebuttal
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
all reviewers agree the method is well-motivated and the paper addresses an important problem. All reviewers also note that the paper in its current form is a bit difficult to follow. The authors have adequately addressed reviewers’ concerns and all reviewers are happy to recommend acceptance. if paper is accepted, recommend authors to follow closely to suggestions to improve the manuscript.
Meta-review #3
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
A timely and practically impactful study that, despite modest technical novelty and presentation flaws, convincingly advances phrase-level fact-checking in radiology reports.