List of Papers Browse by Subject Areas Author List
Abstract
Automated chest X-ray report generation has great potential to improve healthcare efficiency, but rigorous validation is essential for safe clinical adoption. Existing evaluation metrics focus mainly on report-level scores, failing to provide actionable insights for clinicians.
In this paper, we present SPEC-CXR (Safety-centered Performance Evaluation in Clinical Report for Chest X-Ray), an evaluation framework that integrates entity-level performance assessment with report-level error analysis using a large language model (LLM). In our approach, the LLM extracts and classifies entities—radiological findings and differential diagnoses—from both generated and reference reports based on a carefully curated entity set. Generated reports are then evaluated on entity presence, location, severity, and prior comparison, yielding structured outputs to calculate detailed entity-level scores (F1 for presence and accuracy for location, severity, and comparison).
Our entity-level evaluation shows 91.8% accuracy compared to human evaluation for presence detection and 0.777 Kendall’s tau-b correlation for report-level evaluation. Furthermore, our entity-level performance analysis uncovers critical limitations of current state-of-the-art report generation models across diverse entities, highlighting the urgent need for rigorous, safety-oriented evaluation metrics.
Our framework is publicly available and usable: https://github.com/lunit-io/spec-cxr.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3344_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/lunit-io/spec-cxr
Link to the Dataset(s)
N/A
BibTex
@InProceedings{LeeJun_SPECCXR_MICCAI2025,
author = { Lee, Jung Oh and Cho, Junwoo and Kim, Junha and Dillard, Laurent and Sonsbeek, Tom van and Setio, Arnaud A. A. and Lee, Hyeonsoo and Yoo, Donggeun and Kim, Taesoo},
title = { { SPEC-CXR: Advancing Clinical Safety through Entity-Level Performance Evaluation of Chest X-ray Report Generation } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15966},
month = {September},
page = {594 -- 604}
}
Reviews
Review #1
- Please describe the contribution of the paper
he paper addresses a critical challenge in evaluating AI-generated chest X-ray (CXR) reports: the lack of clinically meaningful, entity-specific validation. Current metrics (e.g., BLEU, ROUGE) aggregate errors across entire reports, treating trivial formatting mistakes as equivalent to life-threatening diagnostic inaccuracies (e.g., misclassifying a malignant tumor as benign). This gap hinders safe clinical adoption, as existing methods fail to prioritize medical context or granular error analysis.
The authors propose a dual-level evaluation framework combining (1) report-level error analysis (holistic quality) and (2) entity-level assessment using Large Language Models (LLMs).
The authors idea behind using LLM is to extracts and classifies clinical entities (findings/diagnoses) into structured JSON outputs, evaluating four attributes: presence (TP/FP/TN/FN), location, severity, and prior comparison.
The authors argued that via a structured approach they would standardize error detection and enhance reproducibility.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The paper offers 3 main strengths:
-
The framework’s use of predefined entity sets ensures consistent identification and classification of medical terms, minimizing variability that often arises from human interpretation. This consistency is further reinforced through structured JSON outputs which eliminate ambiguity and enforce a uniform format for evaluation. Such standardization is critical for reproducibility, enabling seamless integration with electronic health records and reducing errors in cross-institutional audits.
-
Unlike traditional metrics like BLEU or ROUGE, which treat all errors equally, this framework breaks down inaccuracies into clinically meaningful categories. For instance it distinguishes between high-stakes errors (e.g., misclassifying a “malignant lung nodule” as benign) and minor mistakes (e.g., a typo in measurement units). This granularity allows researchers to pinpoint weaknesses such as poor severity detection or location inaccuracies and prioritize model improvements. By linking errors to specific clinical outcomes, the method bridges the gap between technical validation and real-world diagnostic safety.
-
The framework’s automation transforms resource-heavy manual reviews into efficient, large-scale processes. This scalability is particularly impactful in clinical settings, where real-time validation of AI outputs ensures compliance with guidelines.
-
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
The paper fails at integrating adaptive entity definitions, severity-weighted scoring, and clinician validation loops to align technical metrics with patient care priorities. Here are some of the paper limitations:
-
The reliance on predefined entity sets introduces limitations in adaptability and inclusivity. For instance, emerging pathologies or rare conditions may be excluded from the predefined list, leading to incomplete error detection.
-
The framework’s dependence on LLMs introduces risks of hallucinated outputs and biased evaluations. Training data biases could further skew classifications, such as underdiagnosing conditions prevalent in underrepresented populations. Moreover, error scoring treats all mistakes equally! For example what would happen if a trivial typo could be weighted the same as a catastrophic misclassification, such scenario masking critical risks in clinical settings.
-
Finally, the framework evaluates entities in isolation. The proposed approach lack contextual prioritization. For instance a missed “pneumothorax” (requiring urgent intervention) is scored identically to a missing “mild atelectasis” (often incidental).
-
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not provide sufficient information for reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
To strengthen the paper’s contribution and differentiation from existing work, consider the following revisions:
-
Highlight how your framework’s structured, rule-based error detection (predefined entities + confusion matrices) differs from ER2Score’s (ER2Score: LLM-based Explainable and Customizable Metric for Assessing Radiology Reports with Reward-Control Loss) focus on explainability and customizable weighting.
-
Compare with the domain-specific metric: Note that while prior work assesses semantic similarity, authors framework provides granular, attribute-specific error breakdowns (location, severity, etc.), which are critical for clinical feedback.
-
Clarify Novelty in Contributions: Combining report-level coherence with entity-level validation is not totally new. The authors are invited to position the work as a pragmatic alternative to existing methods, optimized for specific clinical needs (e.g., compliance with reporting guidelines).
-
Demonstrate how the proposed framework’s entity-level error rates correlate more strongly with clinician assessments than framework’s that aggregate scores. Showcase advantages in detecting severity/location errors, which generic similarity metrics might overlook.
-
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The paper proposes a structured, entity-level framework to address the critical gap in evaluating AI-generated CXR reports, offering strengths like standardized JSON outputs, granular error categorization, and scalability. However, major weaknesses limit its readiness: (1) Overlap with existing works (e.g., ER2Score’s clinically meaningful metrics, RadGraphF1’s entity relations) without clear differentiation in novelty; (2) Rigid entity definitions and LLM biases/hallucinations risk misclassifying novel findings or critical errors (e.g., weighting malignant vs. benign misclassifications equally); (3) While the framework has potential, its incremental technical contribution and insufficient validation against clinical benchmarks (e.g., severity-weighted scoring, diverse datasets like RadEvalX) weaken its impact.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
-
After reconsideration, I acknowledge that the paper introduces a novel capability: entity-level, multi-attribute metrics that previous evaluation frameworks do not provide. This contribution advances the field of report-generation quality assessment by letting researchers quantify model performance on specific clinical concepts.
-
Although SPEC-CXR can already isolate errors by entity, it still lacks a weighting scheme that reflects clinical impact. For example, under the current scoring rules a missed tension pneumothorax counts exactly the same as a simple unit-spelling mistake. Introducing an importance-based layer would fix this mismatch, yet its absence does not detract from the framework’s core novelty and can reasonably be deferred to future work.
-
However, on LLM reliability, the authors reply is less persuasive: the reported 90% extraction accuracy for presence is encouraging, yet comparison performance remains weak and no bias analysis is provided. Strengthening those aspects would increase confidence in clinical use.
-
Review #2
- Please describe the contribution of the paper
- The authors propose a new evaluation framework named Safety-centered Performance Evaluation in Clinical Report for Chest X-Ray (SPEC-CXR) that integrates report-level error analysis with entity-level performance assessment using a large language model (LLM);
- The authors first present a comprehensive set of key entities including radiological findings and differential diagnoses, curated by expert radiologists;
- To enable more clinically relevant report-level evaluation, the authors define a structured output in JSON format based on the entity set above, and convert the original free-text reports into this structured form using large language models (LLMs);
- Based on the structured output described above, SPEC-CXR also enables better entity-level performance evaluation.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The authors address a critical challenge: how to properly evaluate AI-generated medical reports. Traditional NLP-based metrics may overlook the clinical context, and entity-based metrics such as RadGraph (Jain et al), which do not use a predefined entity set, can result in variability in the extracted categories of entities across individual reports;
- To reduce the NLP-based metrics’s limitation, the authors first present a radiologist-certified entity set and transform the free-text report to a structure form based on the entity set;
- With such a expert-curated entity set, the categories of entities are fixed, enabling better entity-level evaluation;
- According to Table 1, the proposed metrics align more closely with human evaluations compared to other metrics;
- According to Table 2, the presented entity set demonstrates better coverage of the MIMIC-CXR test set and aligns more closely with human evaluations, even though it does not have the largest number of categories compared to other proposed sets;
- The proposed metrics indeed show some limitations of current SOTA AI medical report generation methods.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
The literature reivew in the introduction part is a bit thin. Some related works on structured medical report generation / entity-based evaluation appear to be missing, such as Flexr (Keicher et al) and Prior-RadGraphFormer (Xiong et al), the latter of which specifically presents a fixed set of entity categories for entity-level evaluation.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
For reproducibility of the paper, the anonymized link seems to be empty. It would be nice if the authors could consider citing a bit more relevant literature to strengthen the overall related works (introduction) section.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Overall, the authors propose an effective evaluation metric for the critical task of properly assessing AI-generated medical reports. The experiments are robust, and the results are promising. However, the anonymized link appears to be empty, and some relevant related works seem to be missing.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Accept
- [Post rebuttal] Please justify your final decision from above.
The authors resolve my concerns.
Review #3
- Please describe the contribution of the paper
SPEC-CXR extends existing evaluation methodologies to produce entity-focused assessment of AI-generated radiology reports. The paper does a good job of introducing the limitations of other approaches to assess NLP generation, such as GREEN, RadGraph, and LLM-RadJudge. In contrast to these previous frameworks, SPEC-CXR evaluates specific clinical entities to provide insight into particular areas of strength or weakness in AI-generated reports. It also includes a more nuanced understanding of report accuracy by including attributes like presence, location, and severity.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The SPEC-CXR framework shows strong alignment with expert evaluations (Kendall’s tau-b), indicating effectiveness for clinical relevance.
- A comprehensive entity set is used, improving report coverage (81.9%) and correlation with human judgement.
- The framework uses Pydantic and JSON formatting to make sure the evaluations are interpretable and reproducible.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The LLM evaluation accuracy for comparison attributes is low compared to other metrics like presence (75% vs 91%). It’s unclear whether or how model instruction could be improved.
- The generalizability of SPEC-CXR to other clinical settings could be better tested, i.e., for deployment in real-world diverse clinical environments (different report-writing styles, imaging protocols, etc.) See DOI: 10.1007/s00330-023-10235-9 for a review showing that 6% of deep learning studies in radiology performed external validation, and models often exhibited degraded performance when aplied to external data sets.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
While the paper presents a well-motivated and technically sound framework for evaluating chest X-ray report generation, its validation is limited to only two datasets, making it unclear whether the approach is generalizable. Additionally, the reliance on LLMs without prospective clinical testing leaves uncertainty about real-world applicability and robustness across institutions.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
We appreciate the reviewers’ constructive feedback. The reviewers find that using the predefined entity set makes the evaluation process consistent, reducing the variability that can be introduced by human interpretation (R1, R2). Also, creating a structured JSON output can eliminate the ambiguity and enhance the reproducibility (R1, R3) while maintaining the alignment to human evaluations (R2, R3). You can now find the code via the link in the abstract of our main paper. Before addressing individual comments, we would like to clarify that the main contribution of our work lies in providing entity-level evaluation, rather than report-level. While most prior methods rely on aggregated metrics or global error counts, they cannot directly support analysis such as “this model’s pneumothorax F1 score is 0.68”. SPEC-CXR enables fine-grained, per-entity assessment across multiple attributes, making it not only an evaluation metric but also a diagnostic tool for understanding failure modes of report generation models at the clinical concept level.
(R1) Treats all mistakes equally: Our framework shifts focus from report-level assessments to granular entity-level evaluations, enabling users to pinpoint model weaknesses often overlooked in broader assessments. While report-level weights could be important, our primary contribution is demonstrating the ability to measure entity-level performance. Although applying importance-based weights to different entities in our scoring system would be straightforward, truly meaningful weighting would require consideration of clinical context (indication or reason for exams), which extends beyond our paper’s scope.
(R1) Entity set is static: We address the adaptability and inclusivity concerns of predefined entity set through two key design features: (1) Our set includes “other” categories per anatomical region (e.g., “other mediastinal abnormality”), to capture diverse and rare findings not explicitly listed. It ensures they still contribute to our error detection framework. (2) Our architecture allows straightforward customization (e.g. add “COVID-19 pneumonia” under “Infectious pulmonary disease”).
(R1) LLM hallucinations: we agree that hallucination is a problem not unique to LLMs but for all machine learning models. Given the structured framework, we are not bound to a particular LM and the framework benefits directly from improvements from the community.
(R1, R2) Lack of comparison with prior works and originality: While there is some conceptual overlap with existing methods that assess multiple entities in reports, our approach uniquely enables entity-level benchmarking by consolidating all diverse expressions of the same entity. This level of standardized, fine-grained evaluation is not addressed by existing alternatives. ER2Score provides categorized error count, but it still can’t compute the detection performance of a certain chest abnormality (e.g. F1 score of opacity), which is essential in real-world adoption of CAD system in clinical scenarios.
(R3) Accuracy on “comparison” evaluation is low: During manual review, we found that assessing comparison statements is more challenging due to linguistic ambiguity. For instance, when a report begins with “compared with 2024-06-23 CXR,” the LLM incorrectly flags all subsequent statements as comparisons, even those standing independently. The LLM also sometimes missed comparison-implying adjectives like “improved” or “worsened.” Nevertheless, we included these results because comparison represents a critical factor in radiology report evaluation literature. We believe sophisticated prompt tuning can mitigate the issue.
(R3) Limited validation: We agree that for adoption in diverse real-world clinical environments, SPEC-CXR requires broader validation and clinical studies to ensure generalizability. We leave such studies for future work and intend to clearly state this as a limitation in the revised manuscript.
Meta-Review
Meta-review #1
- Your recommendation
Invite for Rebuttal
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
all reviewers agree the paper is well written, and the experiments well structured with promising results. Authors have also adequately addressed reviewers concerns for acceptance
Meta-review #3
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A