Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Despite significant advancements in adapting Large Language Models (LLMs) for radiology report generation (RRG), clinical adoption remains challenging due to difficulties in accurately mapping pathological and anatomical features to their corresponding text descriptions. Additionally, semantic agnostic feature extraction further hampers the generation of accurate diagnostic reports. To address these challenges, we introduce Medical Concept Aligned Radiology Report Generation (MCA-RG), a knowledge-driven framework that explicitly aligns visual features with distinct medical concepts to enhance the report generation process. MCA-RG utilizes two curated concept banks: a pathology bank containing lesion-related knowledge, and an anatomy bank with anatomical descriptions. The visual features are aligned with these medical concepts and undergo tailored enhancement. We further propose an anatomy-based contrastive learning procedure to improve the generalization of anatomical features, coupled with a matching loss for pathological features to prioritize clinically relevant regions. Additionally, a feature gating mechanism is employed to filter out low-quality concept features. Finally, the visual features are corresponding to individual medical concepts, and are leveraged to guide the report generation process. Experiments on two public benchmarks (MIMIC-CXR and CheXpert Plus) demonstrate that MCA-RG achieves superior performance, highlighting its effectiveness in radiology report generation.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0297_paper.pdf

SharedIt Link: https://rdcu.be/eHwTI

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04971-1_36

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{XinQil_MCARG_MICCAI2025,
        author = { Xing, Qilong AND Song, Zikai AND Zhang, Youjia AND Feng, Na AND Yu, Junqing AND Yang, Wei},
        title = { { MCA-RG: Enhancing LLMs with Medical Concept Alignment for Radiology Report Generation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {380 -- 390}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes a knowledge-driven framework that explicitly aligns visual features with distinct medical concepts to guide the report generation. The authors also propose an anatomy-based contrastive learning procedure to improve the generalization of anatomical features and a feature gating mechanism to filter out low-quality concept features. Extensive experiments have shown that the method proposed in this paper improves the quality of medical report generation.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The method proposed in this paper is somewhat innovative and the experimental results show that the method is meaningful in improving the quality of the generated medical reports.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The authors use the Transformer architecture, but do not provide its specific settings, or discuss what parametric quantity it is relative to the pre-trained model used. Furthermore, the text encoder may already have enough power to distinguish text features in these different domains, so are two Transformers necessary? Is it possible to share a single Transformer? What is the improvement compared to sharing parameters?… These situations were not validated in the ablation experiments. Overall, the authors fail to explain well why a two-branch structure is necessary, and do not explore its advantages and disadvantages over a single branch.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

This paper proposes to align the LLM with medical concepts, which makes sense. The proposed method is somewhat innovative, but the authors provide insufficient evidence to show the motivation and significance of their use of a two-branch structure.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The response to Q1 explains the parameters and model, and the response to Q2 has a certain basis. Overall, it dispelled my doubts.

Review #2

Please describe the contribution of the paper

The paper introduces Medical Concept Aligned Radiology Report Generation (MCA-RG), a framework enhancing radiology report generation (RRG), especially for the chest-Xrays, by explicitly aligning visual features extracted from medical images with specific medical concepts derived from pathological and anatomical entities. The approach integrates anatomy and pathology concept banks, anatomy-based contrastive learning, and a pathology-anatomy matching loss to prioritize clinically relevant regions. A feature gating mechanism further filters out low-quality concept features, enhancing the clinical relevance and accuracy of the reports generated by leveraging Large Language Models (LLMs).
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Novel Approach: MCA-RG presented a novel method for aligning medical concepts (anatomy and pathology) explicitly with visual features from radiology images. Unlike previous methods relying on conflated features, MCA-RG separately enhances anatomical and pathological features, making it a unique contribution to multimodal learning in chest X-ray report generation.
2. Feature Gating as a Countermeasure for Noisy Attention: MCA-RG proposed a feature gating mechanism that scores attention maps, filtering out low-quality or uncertain concept embeddings. By selectively using only the strongest, highest-confidence concept features, the feature gating step refined the input to the language model, reducing noise and enhancing both the interpretability and reliability of the final report.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. Potential Hallucination in GPT-4 Knowledge Enrichment: MCA-RG uses GPT-4 to expand each medical concept with additional explanations. While this can enrich the semantic content, large language models have been known to introduce factual errors or hallucinations. If the MCA-RG is intended for clinical use, the paper should detail how authors validated these expansions or how incorrect knowledge might have been filtered out.
2. Limited Usage of Evaluation Metrics: The reported metrics in the paper focus on classical lexical scores (BLEU, ROUGE-L, METEOR) and a clinical efficacy metric introduced in the WarmStart paper, which used CheXpert labels. Although these are common benchmarks, they omit other potentially more informative (clinically) metrics such as RadGraph F1, RadCliQ[1], and RaTEScore[2].
3. Incomplete SOTA Comparisons: The paper excludes certain non-LLM baselines (e.g., CXR-Mate[3]) and omits newer LLM-based methods (e.g., MAIRA-2[4], CXR-LLaVA[5], LLaVA-Rad[6]), raising questions about how MCA-RG stacks up against the SOTA in RRG. While the authors do compare to MiniGPT-Med and MedDR, those models are not specifically tailored for chest X-ray report generation. If the authors are also interested in benchmarking against general-purpose medical models, it would be helpful to include additional models such as MedImageInsights[7], which is the latest generalist model that can perform well in Chest-Xray report generation. Including these models can help the reader see how MCA-RG is effective.
4. Selective Inclusion of CheXpertPlus Results: The table 1 presents only a partial set of models for the CheXpertPlus dataset: some baselines from MIMIC-CXR are left out on CheXpertPlus. The table 2 also tested only a partial set of models. There is no description in the paper for this partial selection of the models.
[1] RadCliq: Evaluating progress in automatic chest X-ray radiology report generation. Yu et al., Patterns (N Y). 2023. [2] RaTEScore: A Metric for Radiology Report. Zhao et al., EMNLP 2024. [3] CXRMate: Longitudinal data and a semantic similarity reward for chest X-ray report generation. Nicolson et al., Informatics in Medicine Unlocked 2024. [4] MAIRA-2: MAIRA-2: Grounded Radiology Report Generation. Bannur et al., arXiv 2024. [5] CXR-LLaVA: CXR-LLaVA: a multimodal large language model for interpreting chest X-ray images. Lee et al., European Radiology, 2025. [6] LLaVA-Rad: Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation. Chaves et al., arXiv, 2024. [7] MedImageInsights: Medimageinsight: An open-source embedding model for general domain medical imaging. Codella et al., arXiv 2024.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

The WarmStart paper references a model formally named CvT2DistilGPT2, yet the current submission simply calls it “WarmStart.” It would be clearer to acknowledge WarmStart as the paper or approach, while referring to its actual model by the original name CvT2DistilGPT2.

RadGraph-XL[1] is a more recent and comprehensive resource than the original RadGraph. The authors’ decision to rely on RadGraph may hinge on version compatibility, the timing of their experiments, or the latter’s simpler tag set. Still, it would strengthen the paper to address RadGraph-XL’s improvements or at least discuss how MCA-RG might incorporate them in future work. [1] RadGraph-XL: A Large-Scale Expert-Annotated Dataset for Entity and Relation Extraction from Radiology Reports. Delbrouck et al., ACL Findings 2024.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper introduces an interesting approach. However, both the experimental design and the results need improvement.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Reject
[Post rebuttal] Please justify your final decision from above.

Significant issues still persist. Primarily, missing the key state-of-the-art methods (e.g., CXR-Mate, MAIRA-2, CXR-LLaVA, LLaVA-Rad), with incorrect justification during the rebuttal regarding CXR-Mate requiring multiple images. CXR-Mate can be used with a single image. Also, the description for the GPT-4 validation was not detailed and clinical evaluation metrics (RadGraph, RadCliQ, RaTEScore) feedback was also insufficient (only compared with a single model).

Review #3

Please describe the contribution of the paper
- Proposed the MCA-RG image encoding framework to enhance the alignment between LLMs and chest X-rays with medical concepts.
The authors parse medical reports into (finding–anatomy–existence) triplets using RadGraph and construct a pathology concept bank and an anatomy concept bank. Based on these two concept banks, a Transformer Decoder is used to align image features (encoded with ResNet) and concept text features (encoded with Clinical BERT). The aligned features of both concept types are then fed into LLMs.
- Introduced anatomical contrastive learning to optimize the consistency of anatomical image feature representations. After decoding anatomical image features using the Transformer Decoder, the authors apply an MLP for secondary feature mapping. Positive and negative sample pairs are constructed to increase the similarity of the same anatomical structures across different patients, thereby eliminating inter-patient anatomical variability.
- Incorporated a pathology-anatomy matching loss to enhance the discriminability of image features under different concepts. Based on the existence attribute in the triplets, the authors calculate the similarity between pathological image features and multiple anatomical features, encouraging the model to focus more on anatomically relevant regions that are indicative of pathological findings.
- Introduced gated attention entropy to filter low-quality image features.
In the feature selection stage, the authors compute the entropy of each image feature and filter out features with insufficiently concentrated attention, thus removing low-quality representations.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The alignment mechanism between medical images and pathology-anatomy concepts demonstrates strong novelty. Traditional MLLM encoders typically focus on global image features. In contrast, the authors decompose these complex global features into region-level features aligned with specific medical concepts. This approach enables the model to more effectively capture and understand abnormal regions within the images.
- Contrastive learning is employed to reduce anatomical variation across different patients.
By designing contrastive learning for the same anatomical structures across different patients, the authors enhance the consistency of anatomical feature representations. This allows the model to better focus on pathological abnormalities while mitigating misleading signals caused by anatomical variability.
- A pathology-anatomy matching loss is used to emphasize pathology-relevant regions.
The authors introduce a pathology-anatomy matching loss that constrains pathological image features within their corresponding anatomical structures, thereby promoting more focused attention on pathological regions.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Although the use of pathological and anatomical image features enables the LLM to more effectively localize lesions, information loss inevitably occurs during the encoding process compared to the original images. Since only two types of image features are input into the LLM, it remains uncertain how well the encoding process preserves detailed lesion information—such as the size and edge characteristics of lung nodules—which are crucial for accurate and descriptive report generation.
- Regarding the proposed gating mechanism, if the pathological region occupies a relatively large area within the anatomical structure—such as significant cardiomegaly, multiple pulmonary nodules, or large areas of pulmonary opacity due to pneumonia—it raises the concern that the attention may become dispersed across the large pathological area. As a result, the gating mechanism might filter out these image features due to perceived low attention concentration, potentially leading to the omission of key pathological characteristics in typical disease cases.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
I believe the experimental design of this paper is relatively well-constructed. Through ablation studies and comparative experiments, it demonstrates that the proposed method effectively enhances the alignment between LLMs and medical concepts in medical images. Current general-purpose or medical-specific LLMs/LVMs still exhibit significant deficiencies in the perception of medical concepts, so the method proposed in this work is worthy of reference for improving the perception capabilities of LVMs. However, due to the space limitations of the article, I still have some questions about certain details, which are crucial to understanding the value of this method and its potential for further extension. Therefore, I would like to accept the paper after clarifying these issues and ensuring there are no critical flaws:
1. After extracting image features through the Transformer decoder, how is it ensured that important lesion details in the medical images are not excessively lost?
2. For pathological features covering large lesion areas (such as diffuse lung opacities or multiple pulmonary nodules), will they be filtered out by the gating mechanism?
3. In the pathology-anatomy matching loss, how is the dispersion of attention on pathological features avoided when computing similarity with anatomical regions—such that attention remains focused on lesions within the anatomical areas, rather than being attracted to the general anatomical features?
4. In the anatomical contrastive learning component, how are samples selected to reduce the variability in anatomical structures across different patients?
5. KiUT is a well-known alignment method based on medical concepts—should it also be included in the list of models used for comparison?
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.
1. The author/authors answer all my questions. However, Figure 4 indicates that the model captures the key features of pneumonia. It cannot prove that the feature map retains more detailed information inside the lesion area. Also, no results show that the learnable extra tokens can capture detailed information. This is still a weakness, but the alignment method in this manuscript is still meaningful for the medical report generation task. The author has answered the questions about the gating mechanism.
2. I agree with R1’s comments that the dual-branch Transformer structure has not been well explained, and its superiority has not been proven, which should be enriched. However, the NLG and CE metrics are sufficient to indicate the model’s advantages in the R3 comments, which are also widely used (R2Gen, KiUT, RadFM, …).
I suggest “Accept” (with low confidence)

Author Feedback

We thank the reviewers for their constructive comments and appreciations of our strengths such as ‘the method is meaningful’ (R1), ‘demonstrates strong novelty’ (R2) and ‘a unique contribution to multimodal learning’ (R3). We address each reviewer’s concerns point by point:

To R1: Q1. Transformer: We use the DeTR module [5] (4M parameters) for both anatomical and pathological feature alignment, which consisting of 4 transformer layers. We use Bio_ClinicalBERT (108M) as the pre-trained text encoder and the pretrained LLaMA model (7B) for final report generation. Q2. Two branches: we emphasize that text features may conflate medical concepts, while our goal is to capture accurate anatomical and pathological features. Our two-branch design separates these into distinct feature spaces, facilitating both enhanced feature representation and reduced semantic confusion. Ablation results show the dual-transformer setup outperforms the shared version (Macro F1: 0.335 vs. 0.317).

To R2: Q1. Information loss: Our feature enhancement ensures that tokens for selected medical concepts capture fine-grained visual information, as demonstrated by the attention maps in Fig.4 showing accurate grounding of lesions. To further mitigate information loss, we have introduced extra learnable tokens to encode additional image details beyond the selected concepts. Q2. Gating for large lesions: Large-coverage lesions will not be filtered out by the gating mechanism, as an adjustment mapping for each medical concept (Eq.6) is learned to refine entropy values before gating. According to our observations, low-quality features exhibit far more diffuse attention than large lesions. This mapping enables the model to retain meaningful features from large lesions while filtering out low-quality ones. Q3. Avoiding attention dispersion: The combined loss (Eq.8) penalizes overemphasis through the pathology prediction term, and hence prevent overly focusing on anatomical regions. Q4. Samples selection in anatomical contrastive learning: Samples are randomly selected to minimize the effect of anatomical variability. Q5. Include KiUT for Comparison: We will include KiUT for comparison in the revision.

To R3: Q1. Errors of GPT-4: The explanations from GPT-4 were reviewed and rectified by experienced clinicians. Q2. Limited metrics: For additional evaluation on MIMIC-CXR, our method outperforms WarmStart on RadGraph F1 (0.158 vs. 0.149), RadCliQv1 (1.367 vs. 1.391; lower is better), and RaTEScore (0.440 vs. 0.432). Q3. Insufficient comparisons: In Table.1, we include three non-LLM methods—R2Gen, R2GenCMN, and WarmStart. For recent LLM-based methods, we include AMC (ACL 2024), XrayGPT (BioNLP 2024), ORID (WACV 2025), and the widely referenced MiniGPT-Med (2024). MedDR is included following [18]. As MiniGPT-Med wasn’t trained for report generation, we retrained it for fair comparison. While CXR-Mate uses multi-image inputs, MCA-RG achieves comparable performance with single images, demonstrating effective feature extraction. Though MAIRA-2, CXR-LLaVA, LLaVA-Rad, and MedImageInsights report stronger results, they depend on large datasets or detailed annotations often unavailable in clinical practice. Our method is annotation-efficient, lightweight, and readily transferable to other domains like abdominal X-rays. We will include comparisons and discussions of the suggested models in the revision. Q4. Inclusion of Results: The partial comparisons are mainly due to space limits. For CheXpertPlus, we prioritized models with usable open-source code to enable fair re-evaluation. For Table.2, we selected methods that report both macro and micro scores for a complete comparison. Q5. Reference: We will correct the cited model name to CvT2DistilGPT2. Q6. Future work: MCA-RG accurately identifies pathologies and affected areas. In the future, we plan to extend it to abdominal and brain domains using the RadGraph-XL dataset. Q7. Reproducibility: We will release the code on GitHub.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

Please respond to reviewers, particularly on the comparison to the current methods based on SOTA foundational models since radiology report generation is an active field.
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
1. The idea to leverage medical concepts with LLM makes sense to me.
2. I am aware that R3 mentioned the paper didnot compare with all the SOTA. In my opinion, we should not require the authors to compare with all the SOTA methods, especially those are just on arxivs, which havenot been peer reviewed.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

Despite one dissenting review, the overall merit of the paper merits acceptance. The rebuttal addressed most technical concerns.

back to top

MCA-RG: Enhancing LLMs with Medical Concept Alignment for Radiology Report Generation

Author(s):