Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Medical image interpretation and report generation are essential for physicians to identify diseases and make assessments. Major efforts in image-to-report generation require heavy language model training, which still suffers from producing reports with factual errors. In this study, we present RadAlign to demonstrate that a concept-based vision-language model can improve both predictive accuracy and report factual correctness without extensive language model training. Our key innovation is aligning visual features with medical diagnostic criteria in a shared representation space. Such alignment introduces core knowledge supervision and creates interpretable intermediate diagnosis results for LLMs to refine report generation. We also propose a novel cross-modal retrieval mechanism to provide additional clinical context of history cases for enhancing report generation accuracy. This unified approach achieves superior disease classification on MIMIC-CXR (average AUC: 0.885) and enables accurate report generation (GREEN score: 0.678 vs. SOTA: 0.634). RadAlign also demonstrates exceptional generalization capabilities, outperforming SOTA foundation and specialized models on the external OpenI dataset (AUC: 0.923 vs. 0.836). Code is available at https://github.com/difeigu/RadAlign

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3308_paper.pdf

SharedIt Link: https://rdcu.be/eHwXd

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04981-0_46

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/difeigu/RadAlign

Link to the Dataset(s)

MIMIC-CXR dataset: https://physionet.org/content/mimic-cxr/2.1.0/ OpenI (Indiana) dataset: https://openi.nlm.nih.gov/

BibTex

@InProceedings{GuDif_RadAlign_MICCAI2025,
        author = { Gu, Difei AND Gao, Yunhe AND Zhou, Yang AND Zhou, Mu AND Metaxas, Dimitris},
        title = { { RadAlign: Advancing Radiology Report Generation with Vision-Language Concept Alignment } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},
        page = {484 -- 494}
}

Reviews

Review #1

Please describe the contribution of the paper

The paper’s main contributions are: it proposes a complex system that aligns visual and textual features, and introduces a cross-modal retrieval-augmented generation mechanism to further enhance the quality and reliability of the generated reports.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper has three major strengths: 1) it proposes a clinically inspired, domain-knowledge empowered framework that aligns visual and textual features for joint classification and report generation; 2) it introduces a cross-modal retrieval-augmented generation approach that enhances report quality and consistency by referencing similar historical cases; 3) the effectiveness of these methods is demonstrated through strong results in both the classification task and the report generation task.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The paper has a few notable weaknesses, primarily related to the experiments: 1) The influence of individual components, especially the cross-modal retrieval-augmented generation system, on report quality is not thoroughly investigated, making it difficult to assess the contribution of this part to the overall performance; 2) The generalizability experiment is not well-designed, as only one baseline model (BioViL) is pretrained on the same MIMIC-CXR dataset as the proposed model, which could explain the low F1 scores (<0.1) when compared to BioViL, limiting the evaluation’s fairness; 3) In Table 2, the proposed method’s results (e.g., AUC in column CM and F1 in column CD) are bolded, but they are not the best compared to other models presented in the table, raising concerns about the significance of the improvements; and 4) The novelty of the metric used (GREEN) in the report generation comparison is unclear, as it is not peer-reviewed, which raises concerns about its reliability and fairness, making it advisable to include traditional metrics like BLEU or ROUGE for a more comprehensive evaluation.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

My recommendation is based on a careful consideration of both the strong contributions and notable weaknesses of the paper. The major strengths include the novel and clinically relevant framework that aligns visual and textual features, the innovative cross-modal retrieval-augmented generation approach, and the strong performance demonstrated across benchmarks in both classification and report generation tasks. These aspects indicate significant potential for real-world clinical applications, which is a clear strength of the paper.

However, the paper also has notable weaknesses, particularly in the experimental design. The lack of thorough investigation into the contribution of individual components, the limited baseline comparison, and the questionable novelty of the evaluation metrics raise concerns about the robustness and fairness of the experiments. These issues diminish the overall reliability of the findings and make it difficult to fully assess the impact of the proposed methods.

Given these factors, my overall score reflects a balance between the paper’s strong, clinically motivated contributions and its experimental limitations, suggesting that while the work is promising, there is room for improvement in the evaluation and validation of the proposed methods.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

I initially gave a weak reject due to concerns about the experimental design and evaluation clarity. However, after reviewing the rebuttal, I now recommend acceptance. The authors provided helpful clarifications regarding the generalizability experiment, including the rationale for comparing against both zero-shot and fine-tuned baselines. The paper as a whole presents a promising approach to the MICCAI community. I believe it offers sufficient value to warrant acceptance.

Review #2

Please describe the contribution of the paper

The authors present RadAlign, a concept-based vision-language model capable of both predicting diseases and generating radiology reports from X-ray images. The main contribution is the alignment of visual features with medical diagnostic criteria in a shared representation space. This alignment reduces hallucinations in report generation and yields interpretable classification results.

The framework consists of three main components: (i) a pre-trained large language model (LLM) extracts diagnostic criteria and associated concept tokens from historical radiology reports; (ii) a classification model, based on a pre-trained vision-language model (ViL), predicts disease classes from input X-ray images using the criteria generated in (i); and (iii) a report generation module, where the outputs from (i) and (ii), along with historical reports containing similar concept tokens as those predicted for the query image, are used as prompts for a pre-trained LLM to produce a radiology report.

The proposed method compares very favorably to state-of-the-art approaches and demonstrates strong generalization capabilities.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

A creative and novel idea with a highly appealing framework that enforces alignment between visual features and medical diagnostic criteria. It enhances explainability and mitigates hallucinations, and is likely to generalize well to other applications. The results are impressive and competitive with state-of-the-art baselines. The method will likely be highly valuable to the community, especially if code and trained/prompted models are made publicly available.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

*The paper lacks ablation studies. Given the complexity of the proposed method, the authors should evaluate the contribution of each component to the overall performance. For example, how does the inclusion of “diagnostic criteria and associated concepts” affect classification accuracy compared to a standard vision-language model? How much do the “retrieved reports” contribute to the quality of the generated reports? The included “Ablation studies” in Section 4.2 appear to focus more on tuning two hyperparameters than on analyzing the importance of different components. *The authors should also explain how their method is similar to and different from the compared methods, especially ChatCAD which supports both classification and report generation. It is currently unclear what is novel in the proposed method compared to prior research. A clearer articulation of the similarities, differences, and specific contributions would strengthen the paper. *It is also unclear from the text or from Fig. 1 whether the predicted visual concept tokens are fed into the report generation module. Ideally, they should be, otherwise, the report generator appears to rely only on the final class, general diagnostic criteria, and retrieved historical reports. That setup seems likely to hallucinate symptoms, as it only has access to the final class and general knowledge about how diseases affect different anatomical regions. If the predicted visual concept tokens are included, that would mitigate this risk, but if so, this needs to be clearly stated in the text and reflected in Fig. 1. *Finally, the following conclusion from Section 4.2 is not well supported: “This differential scaling highlights how RadAlign’s unified vision-language alignment effectively leverages enhanced LLM reasoning capabilities based on recognized medical concepts, while ChatCAD’s multi-model pipeline lacks alignment, introducing inconsistencies that limit the benefits of more powerful LLMs.” It is unclear how the authors can conclude that it is ChatCAD’s (claimed) lack of alignment that limits the effectiveness of stronger LLMs. Additional evidence or justification is needed to support this assertion.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

*Equations 2, 3, and 6: The similarity function sim is inconsistently formatted—appearing in both italic and upright fonts. This should be standardized for clarity. *Table 2b: It would be helpful if the authors could comment on the compared method PCAM’s relatively strong generalization on the external dataset, despite its weaker performance on the MIMIC-CXR dataset.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The framework achieves very impressive results compared to state-of-the-art methods and shows strong generalization capabilities. To me, the idea appears both novel and creative. I also believe the method is highly generalizable, and that both the core concept and the accompanying code and trained/prompted models could be very valuable to the research community.

What holds me back from a stronger recommendation is a lack of clarity regarding the model architecture, specifically, what inputs are fed to the report generator (as discussed under weaknesses). Additionally, the paper does not clearly articulate its main contributions compared to previous state-of-the-art approaches, particularly the method ChatCAD. Finally, each component of the framework should be thoroughly evaluated in an ablation study to better understand their individual contributions to the overall performance.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have adequately addressed most of my main concerns in the rebuttal. They clarified that visual concept tokens are not used as prompts to the LLM due which clarifies my concern regarding input flow. They also provided component-wise performance contributions, which addresses my earlier point about the lack of a proper ablation study. The comparison with ChatCAD is now better motivated, though some claims, particularly around model misalignment and scaling, still feel somewhat speculative. That said, I would like to emphasize that it is not entirely clear from the rebuttal whether the authors intend to revise the manuscript to reflect these explanations. I strongly encourage the authors to incorporate these points directly into the camera-ready version to improve transparency and clarity for future readers. Based on the methodological strength and novelty, I am recommending acceptance.

Review #3

Please describe the contribution of the paper
The paper introduces a cross-modal retrieval-augmented generation system that enhances report reliability by grounding predictions in similar historical cases. The key contributions include:
1. A unified framework that bridges the gap between classification accuracy and detailed report generation through vision-language concept alignment
2. A novel approach to medical report generation that mirrors a radiologist’s actual workflow, combining visual feature recognition with LLM-based reasoning
3. Superior performance in both classification and report generation benchmarks
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Clinically-Inspired Architecture: RadAlign uniquely mimics the radiologist’s workflow (diagnosis using criteria -> reporting), making it more intuitive and potentially trustworthy than black-box models. It uses structured diagnostic criteria for concept-based diagnosis.
2. Advanced Vision-Language Alignment: It employs fine-grained concept mapping, aligning learnable visual tokens with specific medical diagnostic criteria using a specialized contrastive loss, significantly outperforming general vision-language models.
3. Interpretable Intermediate Representations: The model generates attention maps linked to visual concepts, highlighting relevant anatomical regions for specific conditions, thus providing transparency into its reasoning process and increasing clinical trust.
4. Novel RAG Approach: It uses an innovative retrieval-augmented generation method, retrieving similar past cases based on visual concepts to ground report generation in clinical examples, reduce hallucinations, and provide evidence-based context.
5. Strong Empirical Results: RadAlign demonstrates superior performance over state-of-the-art methods in both classification (AUC) and report generation (GREEN score) across datasets like MIMIC-CXR and OpenI, showing particularly strong generalization ability.
6. Multiple LLM Compatibility: The framework is flexible and robust, proven compatible with various large language models (like ChatGPT, Claude, Llama), allowing it to leverage future LLM advancements.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Dependency on Expert Knowledge: RadAlign relies heavily on expert-provided diagnostic criteria, making it difficult to scale to new medical domains or imaging modalities without significant expert input. Alternative approaches require less expert knowledge engineering. Potential Bias in Diagnostic Criteria: The paper doesn’t address potential biases within the extracted diagnostic criteria, which could stem from evolving medical knowledge, varying radiology practices, or biases present in the training data, affecting performance across different patient demographics. Computational Complexity Considerations: The paper lacks a thorough analysis of computational requirements (training time, inference latency, memory usage) compared to other approaches, hindering assessment of its practicality in resource-constrained clinical settings. Limited Clinical Evaluation: The paper lacks evaluation by practicing radiologists to assess clinical utility and report quality beyond automated metrics.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Despite its identified weaknesses, the paper is strongly recommended for acceptance because:
- Novel Clinical-Inspired Methodology: RadAlign mirrors the diagnostic process of radiologists, addressing a critical gap in current AI approaches for medical report generation.
- Superior Empirical Performance: It achieves state-of-the-art results in classification and report generation, demonstrating meaningful improvements.
- Interpretability: The model’s concept-based approach offers interpretable intermediate results, promoting clinical trust and adoption.
- Hallucination Mitigation: The retrieval-augmented generation effectively tackles hallucination, a serious issue in medical report generation.
- Real-World Potential: It shows strong generalization to external datasets and compatibility with various LLMs, suggesting robust real-world applicability.
While limitations like disease scope and dependency on expert knowledge exist, they are outweighed by the benefits of interpretability, clinical alignment, and hallucination mitigation. The paper significantly contributes to bridging the gap between predictive accuracy and reliable report generation in medical AI, influencing radiology report generation and medical image analysis more broadly.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank the reviewers for their valuable comments. We made all responses to major aspects in model efficiency, evaluation and method differences.

On Approach: The text-based prompting is a key strength of our method to significantly reduce hallucinations through the fine-grained diagnostic criteria. We explored using visual concept tokens as prompts to the LLM (R1 q3) and found this is ineffective since the LLM couldn’t properly interpret these tokens without a prior fine-tune. Unlike standard fine-tuning approaches, our design can be adaptable across domains (such as brain imaging) by leveraging the LLM’s inherent concept generation capabilities (Sec 3.1) (R1 q2, R2 q1). RadAlign strongly differs from comparable methods (R1 q2) by introducing a single lightweight model integrating three key components: identifying fine-grained concepts, applying diagnostic criteria, and retrieving similar reports—all within a unified framework.

On Evaluation Metrics: We use GREEN score (published in EMNLP 2024) as it is purposely defined to evaluate clinically-related errors like a human verifier (section 4.2) (R2 q4). Both BLEU and ROUGE-L metrics are weak at discerning the semantic differences. Given two statements: “pleural effusion is present” and “pleural effusion not present”, where the first is the ground truth. BLEU/ROUGE-L assigned both a score of 0.75/0.57. In contrast, GREEN score assigned 1 to the first and 0 to the second. Thus GREEN score aligns with our objective of focusing on the correct semantic content rather than a surface-level text similarity (R3 q4).

On ChatCAD’s Limitation (R1 q4): ChatCAD’s three prompt models (classifier, segmentor, and report generator) are trained separately for different learning objectives. Thus, these models are “misaligned” and can produce conflicting information (e.g., the caption model highlights cardiomegaly while the segmentation model focuses on the lung diaphragm). In Table 1, RadAlign as a unified architecture improves a lot (0.648->0.678) from gpt4o-mini and gpt4o, while ChatCAD remains almost the same (0.633->0.634).

On Experiment Setup (R3 q2): In Table 2, we deliberately included strong zero-shot baselines like CLIP (general) and biomedCLIP (medical). Despite their extensive training data, biomedCLIP still underperformed on external datasets like OpenI. This zero-shot testing setup fairly evaluates generalizability across the models. For direct comparison, we included BioViL, PCAM, ChatCAD, and LABO, which were fine-tuned on the MIMIC-CXR.

On LLM Prompt Contributions: For ablation studies (R1 q1, R3 q1), our records on gpt4o showed improvements: disease class prediction has a GREEN score of 0.609, the addition of diagnostic criteria improved it to 0.647, and the cross-modal RAG system with a total gain to 0.678. Each component contributes meaningfully to the overall system performance (～4% gained per component).

On Potential Bias: We had implemented a thorough quality control for the LLM-generated criteria by removing disease label contamination, eliminating notable language associating conditions with demographic groups, and ensuring descriptions aligned with current medical understanding (R2 q2). We conducted a cross-center evaluation on MIMIC-CXR and OpenI datasets (Table 2) to reveal model generalization across centers.

On Computational Efficiency (R2 q3): RadAlign achieves its efficiency through its lightweight architecture of only four trainable components (Fig 1). Training requires just under 6 hours using 8 RTX 8000 GPUs. In inference, RadAlign has avg 31.4 ms with a memory allocation of 534.53 MB. For reference, a simple ResNet50 has 36.5 ms and a memory allocation of 124.35 MB. Competing approaches require much longer training times due to multiple specialized models and complex architectures.

Pointed out by R3, the bold font in Table 2a was a typo, and the best results should be highlighted. We will correct it in the final version.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

All reviewers agreed to accept the manuscript, and the rebuttal addressed most reviewers’ doubts.

back to top

RadAlign: Advancing Radiology Report Generation with Vision-Language Concept Alignment

Author(s):