Abstract

Automatic medical report generation (MRG) holds considerable research value and has the potential to significantly alleviate the workload of radiologists. Recently, the rapid development of large language models (LLMs) has improved the effectiveness of MRG. However, numerous challenges still need to be addressed to achieve highly accurate medical reports. For instance, most existing methods struggle to interpret image details, lack relevant medical knowledge, and overlook fine-grained cross-modality alignment. To overcome these limitations, we propose a knowledge-guided vision-language alignment framework with contrastive learning and LLMs for medical report generation. The designed method leverages visual representations, relevant medical knowledge, and enhanced features to generate accurate reports via the LLM-based decoder. To improve the integration of medical-related information, we introduce the Knowledge Injection Module, which enhances the model’s feature representation capabilities while unlocking domain knowledge in LLMs. Inspired by the contrastive learning scheme, we introduce the Contrastive Alignment Module to align the visual features and textual information. Additionally, the Cross-Modality Enhancement Module can retrieve similar reports for the input images to boost diagnostic accuracy. We conduct extensive experiments on two popular benchmark datasets, including IU X-Ray and MIMIC-CXR. The results demonstrate that our proposed method achieves promising performance compared with state-of-the-art frameworks.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1713_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/syysha0k/KACL

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ShaYuy_Contrastive_MICCAI2025,
        author = { Sha, Yuyang and Pan, Hongxin and Meng, Weiyu and Li, Kefeng},
        title = { { Contrastive Knowledge-Guided Large Language Models for Medical Report Generation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {110 -- 119}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a framework for Medical Report Generation (MRG) that significantly improves the alignment between medical images and their corresponding textual descriptions using large language models (LLMs). The framework adopts a modular architecture that integrates the Knowledge Injection Module (KIM), the Contrastive Alignment Module (CAM), and the Cross-Modality Enhancement Module (CEM). The KIM enhances the model’s feature representation capabilities and enriches the LLM’s understanding of relevant clinical conditions, enabling contextual reasoning over interrelated medical entities. The CAM strengthens cross-modal alignment between visual and textual features, improving the semantic consistency of the generated reports. Meanwhile, the CEM mimics common clinical practice by retrieving the most similar past reports for a given input image and refining them through attention mechanisms to support the current report generation—particularly beneficial for ambiguous or rare cases. Experiments conducted on the IU X-Ray and MIMIC-CXR datasets show that KLA outperforms or matches state-of-the-art methods in most Natural Language Generation (NLG) metrics (e.g., BLEU, METEOR, ROUGE-L) and Clinical Efficacy (CE) metrics such as precision.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors propose a vision-language architecture that leverages LLaMA-3 and ViT for robust medical report generation. The framework incorporates a Knowledge Injection Module (KIM) to infuse domain knowledge from medical knowledge graphs, thereby enhancing contextual understanding. A Contrastive Alignment Module (CAM) is employed to improve the alignment between visual features and textual descriptions through contrastive learning. Additionally, the Cross-Modality Enhancement Module (CEM) leverages cross-modal retrieval to incorporate contextually similar reports. The model demonstrates high performance on the IU X-Ray and MIMIC-CXR benchmarks, achieving state-of-the-art results on key natural language generation (NLG) and clinical efficacy (CE) metrics.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Despite the contribution made the paper also acknowledges and exhibits several limitations: GCN is used but never defined, I guess it is graph convolution network CLS is also used not defined, I guess it means classify token The approach lacks architectural novelty, as it heavily relies on existing pre-trained models such as ViT (Vision Transformer), CLIP, and LLaMA. Of course, this is not uncommon—most LLM-based frameworks depend on pre-trained architectures. However, when comparing this work to the research in [1], it becomes difficult to clearly identify the technical contributions of this paper, as the proposed architecture is very similar to that in [1]. Please clearly state what fundamentally differentiates your approach from [1]. [1] Li, Y., Wang, Z., Liu, Y., Wang, L., Liu, L. and Zhou, L., 2024, October. Kargen: Knowledge-enhanced automated radiology report generation using large language models. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 382-392). Cham: Springer Nature Switzerland. Not sure how the long-term memory mechanism for contextual reasoning across varied and temporally-linked clinical cases is handled. Does querying a large external knowledge base via cross-modal retrieval, as done in CEM, not pose challenges in terms of speed, computational efficiency, or operational scalability—especially when considering deployment in clinical settings? Furthermore, knowledge graphs like those used in KIM require constant updates to remain clinically relevant and safe for real-world applications.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    The implementation details are clearly described, and the comparisons are thorough. It would be valuable if the code and prompt libraries were release.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The approach is very interesting as it leverages pre-trained LLMs to generate radiology reports. This is a trending topic and therefore requires more methodological rigor to add novelty and value to the approach, which currently seems somewhat lacking.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The concerns raised have been addressed by the authors. Despite some weaknesses, the paper deserves to be discussed at the conference



Review #2

  • Please describe the contribution of the paper

    The paper presents KLA, an automatic report generation framework based on large language models that includes several components, including a knowledge injection module (KIM), a contrastive alignment module (CAM), and a cross-modality enhancement module (CEM). KIM uses a knowledge graph of chest diagnoses to impart knowledge into KLA, CEM aligns report text and image features, and CEM uses an external database of reports and finds the most relevant ones, used as auxiliary information during training. The proposed method is tested against several sota methods using two public datasets, showing on-par with or better performance.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Automatic report generation is desirable and an important topic.
    • The combination of KIM, CAM and CEM is seemingly novel, and improves model performance.
    • Comparisons to several sota baselines - the proposed model in general showing top results.
    • Ablation studies show the contribution of the different components KIM, CAM, and CEM.
    • Manuscript is well written.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • In the CAM module, the authors seemingly used a generic pre-trained BERT model for report text encoding, although there are several LLMs that are specifically pre-trained on the medical domain (e.g. CXR-BERT, CheXBERT) and medical VLMs that can be used for image+report pairs (e.g. BioVil, MedCLIP, BioMedCLIP).
    • Unclear if and how the classes of the reports from the external database R_E are used. The top K most relevant reports are identified, but does the class (e.g. sick or healthy) come in to play? If not, it seems it should be used to find the top relevant reports.
    • Table 2 would benefit from additional comparative information such as number of model parameters and calculation FLOPs when including different modules.
    • Section 3.4 (the ablation study) does not discuss the CEM module but should.
    • Does not include any discussion on negative pairs that are of the same class (e.g. report and image from different cases that are both healthy). Are they truly negative? How does that affect contrastive training in CAM?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • Please define abbreviation KLA on first use, or clarify if it’s simply a name.
    • Only define abbreviations on first use (e.g. KIM, CAM, CEM).
    • Font size in fig 1 unacceptably small. Please adjust.
    • Please use bold lettering in table 2 to denote the top results.
    • Please use up- or down arrows following the NLG and CE metrics in tables, to indicate if a high or low value is desirable.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The manuscript tackles an important and interesting topic of automatic report generation using novel ideas on how to inject medical knowledge and utilize existing reports. The paper is well written with sound and relevant experiments, sota comparisons, and ablation studies. However, there are some details in the method that need clarifications, and some analysis/discussion of results are missing.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have addressed the main concerns of myself and the other reviewers, incl. that the used LLM was indeed pretrained on medical data, and clarified the novelty of the modular components in their model. Some less important concerns and comments got the response that they would investigate this in future work.



Review #3

  • Please describe the contribution of the paper

    This work proposes an automatic medical report generation (MRG) framework based on a vision model, incorporating knowledge injection, a contrastive alignment module, and a cross-modal enhacenment module. The approach demonstrates strong performance in generating medical reports aginst using visual information alone and achieves competitive results compared to various state-of-the-art methods for MRG. The integration of knowledge from reports database, and a single report is innovative and well structured within the proposed pipeline.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The model effectively enhances visual features by leveraging textual representations derived from knowledge graph, reports database, and a text prompt. The integration of these additional modules is innovative and contributes to improve the medical report generation by providing richer contextual information to the visual features.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • It is unclear from Figure1, and also from the text (e.g., regarding the ViT), which components of the model were fine-tuned and which were kept frozen. Clarifying this visually would complement the textual description and improve overall understanding. Additionally, Figure 1 would benefit from the inclusion of the two proposed loss functions to better illustrate how the final model was trained. In addition, was not clear what ground truth was used, how it was incorporated into the training process?
    • Is unclear how the TextPrompt is derived from the Contrastive Alignment Module (CAM), given that this module is described in the text as aligning visual representations with those extracted from the report (like CLIP). The role of CAM and its connection to the generation of the TextPrompt should be clarified.
    • Throughout the paper (conclusions, results, and even the highlights in the introduction section) the authors claim that their method outperforms the state-of-the-art. However, this claim is not strongly supported by the results. For instance, compared to BoostRRG, the performance gains are marginal (e.g., a difference of only 0.002), which is likely not statistically significant. Similarly, on the MIMIC dataset, the proposed method does not show a clear advantage over either BoostRRG or PromptMRG. In fact, for the CE metrics, the method is outperformed by PromptMRG. Therefore, while the proposed model may be considered competitive with the state-of-the-art, it would be more appropriate to present it as such, rather than claiming it clearly surpasses existing approaches. And adding discussion about differences w.r.t these methods.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Some minor remarks:

    • In the abstract the method is referred to as “CMK”, while throughout the paper it is called “KLA”. Additionally, the acronym “KLA” is never explicitly defined.

    • Several parts of the text are repetitive. For instance, the claim that five modules in the KLA model are proposed is mentioned at the end of the introduction, in the introduction highlights, and again at the beginning of Section 2.1.

    • The KIM could be better explained. Specifically, it would help to clarify who the nodes represent, how the names are encoded to then apply the crossattention.

    • How the alpha parameter is used to weigh the losses. The value of this hyperparameter is not detailed in the Experimental Setup section to improve reproducibility.

    • Section 3.3 states that the result was 0.383 on ROUGE score, while the corresponding table shows 0.385.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    In addition to their methodological contirbutions. The paper positions as a state-of-the-art method by comparing with other recently and similar proposed approaches. Therefore, it is important to consider this work within the specific research area in which it was developed.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank all reviewers for their valuable insights. The code will be available.

R1-1 Pre-trained BERT. In CAM, we adopted a BERT pre-trained on medical-related datasets, but we did not highlight this in the manuscript. We will clarify the description of BERT. R1-2 R_E’s information. In the CEM, R_E is the top-K most relevant reports related to the input image extracted from the external database. We mainly used text-based content in these reports, including the individual’s health status, disease type, diagnosis description, etc. Results show that CEM improves the accuracy and stability of generated reports. R1-3 Benefit from FLOPs. In the future, we will explore how parameters and FLOPs of the proposed modules affect model performance. R1-4 Discussion of CEM. Due to space limitations, we have only included some CEM results in Tab2. We will include more discussion of CEM in the next version and report detailed results on GitHub. R1-5 Discussion of negative pairs. In the future work, we will investigate how positive and negative pairs affect model performance.

R2-1 Definition of GCN & CLS. GCN and CLS refers to graph convolution network and classify tokens, respectively. R2-2 Lacks architectural novelty. LLMs were utilized as the decoder, which is not the mainly innovation of our work. Our main contributions are proposed three modules: KIM, CAM, and CEM. R2-3 Different from KARGEN (KAR). KAR is an excellent work that we discussed and cited in our paper (Ref-15). KAR proposed an MRG model based on the knowledge graph and feature fusion. In our work, we also used knowledge graphs to improve the model’s performance. However, there are differences in implementation details, such as in defining the knowledge graphs. Additionally, several studies have found that knowledge graphs can improve the model performance for report generation, such as KGAE (Liu,2021@neurips), and CGT (Li,2022@cvpr). But the KAR was published in 2024. Notable, the feature fusion proposed in KAR was not included in our work. Our proposed CAM and CEM were not reported in KAR. Therefore, we believe that our proposed method is not similar to KAR. R2-4 Long-term memory. In this work, we aim for the model to generate high-quality responses to input images. The long-term memory performance may depend heavily on the capacity of employed LLMs and may not necessarily link to our proposed modules. Next, we will conduct research on this topic. R2-5 Speed & Deployment. The external information used by CEM can be pre-encoded and stored in vector databases, such as Faiss and Chroma, which offer very fast retrieval speeds. So, model deployment and efficiency do not pose challenges. R2-6 KIM update. Our model is trained on the MIMIC CXR and IU X-Ray, which are widely used by many researchers. So, the safety and medical-relevant of features extracted by KIM can be guaranteed.

R3-1 Fig1. we will modify Fig1, add markers for the trainable and frozen modules, and add the locations of loss functions. Specifically, the CLIP and LLMs were frozen, while ViT, BERT, cross-attention, and projects needed to be trained. L_cam should be applied at the end of the CAM, while L_llm should be applied to the output of the LLMs decoder. The “Report” in the CAM refers to GT, which should be used to calculate the L_llm with the model output results. The L_cam was calculated based on the relationship between the image-text pairs. We will modify the description of Fig1 to make it easier to understand. R3-2 TextPrompt. I apologize for this mistake. CAM and TextPrompt are two independent modules. To enhance visual appeal, we put the TextPrompt behind the CAM, but there should be no arrow from the CAM to the TextPrompt. We will fix it in the next version. R3-3 Result description. The result description needs improvement. We will revise it and include more discussions with SOTA methods in the final version.

Other minor issues will be addressed in the final submission.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Three reviewers recommended acceptance after rebuttal, citing the paper’s modular framework for medical report generation and strong empirical results.

    Reviewer #1 highlighted the integration of KIM, CAM, and CEM as a novel approach that improves report generation. Strengths included competitive performance, clear ablations, and good writing. Concerns about module clarity, BERT vs. domain-specific encoders, and metric presentation were addressed in the rebuttal, leading to acceptance.

    Reviewer #2 praised the architecture’s use of LLaMA-3, ViT, and cross-modal retrieval. While noting the reliance on pre-trained components and similarity to prior work (e.g., Kargen), the reviewer found the method well-motivated and empirically strong. Clarifications on technical contributions and deployment feasibility resolved initial concerns.

    Reviewer #3 supported the framework’s use of external knowledge sources and contrastive alignment but noted unclear module interactions, minor inconsistencies, and marginal performance gains. The rebuttal clarified these points, and the reviewer upheld their acceptance.

    In sum, the paper was accepted for its well-structured, effective approach to clinical report generation, with remaining issues deemed minor and addressable.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The authors have sufficiently addressed the reviewer’s comments. Additionally, I recommend the authors to reduce the use of abbreviations (where possible). At current state, excessive use of abbrevations slightly disrupts the flow of reading.



back to top