Abstract

Automated radiology report generation holds significant research value as it has the potential to alleviate the heavy burden of report writing for radiologists. Previous studies have incorporated diagnostic information through multi-label classification to assist in report generation. However, these methods treate visual and diagnostic information equally, which overlooks the difference in the importance of both when generating different types of words. This can lead to errors in report generation. We propose the Image-Tag Adapter framework (ITAdaptor), which dynamically balances the contributions of visual and diagnostic information in the decoder, ensuring both are fully utilized during the report generation process. The model introduces two novel modules: Cross-Modal Knowledge Enhancement (CMKE) and Image-Tag Adapter (ITA). CMKE leverages pre-trained CLIP to retrieve similar reports from a database, assisting in the diagnosis of query images by providing relevant disease information. ITA adaptively fuses the visual information from the input images with the diagnostic information from the disease tags to generate more accurate reports. For training, we propose a strategy combining reinforcement learning and knowledge distillation, optimizing iteratively to extract knowledge into the ITAdaptor. Extensive comparative experiments on the IU-Xray and MIMIC-CXR benchmark datasets demonstrate the effectiveness of our proposed approach.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1619_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{DinShu_ITAdaptor_MICCAI2025,
        author = { Ding, Shuaipeng and Fan, Mengnan and Li, Mingyong and Wang, Chao},
        title = { { ITAdaptor: Image-Tag Adapter Framework with Knowledge Enhancement for Radiology Report Generation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {358 -- 368}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a framework for chest x-ray report generation. The framework includes two major modules: Cross-Modal Knowledge Enhancement (CMKE) and Image-Tag Adapter (ITA). CMKE employs pretrained CLIP to retrieve similar reports from the training database, and ITA introduces a disease classification branch and fuses classification tags using cross-attention. The proposed framework is evaluated on two public benchmarks.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The topic of chest x-ray report generation is of interest to the MICCAI community.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Experiments lack critical metrics. The authors claim that their framework improves disease reporting by design. Yet only NLG metrics, which measure language quality, are reported. Clinical efficiency (CE) metrics, such as recall and precision, are missing.
    • The training procedure is vague. It is difficult, if not impossible, to link the described training steps and losses with the proposed framework and modules. I don’t understand how the model is trained in its current form of presentation.
    • The entire methodology is conceptually similar to [1]. Retrievial-based report generation is not new, either. The original contributions of this paper are unclear.
    • Please see additional comments for minor points.

    [1] Jin H, Che H, Lin Y, et al. Promptmrg: Diagnosis-driven prompts for medical report generation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 38(3): 2607-2615.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • Eqn. (8): How is c_t used?
    • Please consider examine \gamma_t to validate your design.
    • What is the time cost associated with the iterative teacher-student distillation?
    • Implementation: 10-5 and 10-4 are not proper presentation of scientific format.
    • Fig. 2 and Fig. 3: texts are too small and thin to discern.
    • Fig. 3: What is the baseline method here?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • Lack of critical metrics.
    • Unclear methodology presentation.
    • Unclear contribution.
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper proposes ITAdapter, a novel framework for automated radiology report generation that dynamically balances visual and diagnostic information using an Image-Tag Adapter (ITA) and Cross-Modal Knowledge Enhancement (CMKE) module. It uses a three-stage training strategy combining reinforcement learning and knowledge distillation to optimize the model’s diagnostic accuracy and report quality.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    It is a good insight that the proposed CMKE uses pretrained CLIP to retrieve similar reports from a radiology report database to enhance image features and support disease classification. This adds a sort of RAG feature in the clinical report generation.

    The design of ITAdapter is well-motivated, as it dynamically adjusts the weighting between visual features and diagnostic (disease tag) features during report generation through an adaptive fusion gate. This mirrors real clinical reasoning, where abnormal findings rely more on diagnosis tags, while location and normal descriptions depend more on visual input.

    The three-stage training strategy looks good, which involves a pipeline of cross-entropy pretraining, reinforcement learning with BLEU-based rewards, and knowledge distillation from a self-trained teacher model.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The paper does not compare against recent multimodal foundation models like GLoRIA, or MedCLIP, which offer strong vision-language alignment and have been evaluated for radiology tasks.

    While CMKE uses CLIP to retrieve similar reports, there is no qualitative analysis of whether the retrieved reports are actually helpful or clinically relevant. How much retrieved text contributes to error correction or report accuracy?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The adaptive Image-Tag Adapter (ITA) and Cross-Modal Knowledge Enhancement (CMKE) components are practically motivated and demonstrate good performance across two standard benchmarks. The training strategy combining cross-entropy loss, reinforcement learning, and knowledge distillation is well-designed and contributes to overall improvements.

    However, several limitations still need to be addressed.

    1. Considering the existing popular pre-trained multimodal models such as GLoRIA and MedCLIP, including benchmarks against these methods would enhance the empirical rigor of the paper and better position its contribution within the rapidly evolving landscape of medical vision-language models.
    2. The paper does not provide an in-depth exploration of the usefulness of retrieved reports, nor does it analyze common failure cases or types of generation errors. Such insights would have strengthened the claims regarding improved diagnostic relevance and model reliability.
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have addressed the main concern in their rebuttal.



Review #3

  • Please describe the contribution of the paper

    The paper proposes ITAdapter, a report generation framework that dynamically fuses visual features and diagnostic tags using an adaptive gating mechanism. It introduces two main components: (1) CMKE, which retrieves similar reports via CLIP to enhance disease classification, and (2) ITA, which balances visual and diagnostic cues during decoding. A three-stage training strategy (cross-entropy, RL, KD) further refines the model. Results show state-of-the-art performance on IU-Xray and MIMIC-CXR. (3) For a better feature alignment, a three-stage training is employed, which is a combination of reinforcement learning and a distillation that utilizes iterative optimization.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed adaptive fusion mechanism (ITA) for visual and diagnostic features is well-motivated and directly addresses a standard limitation in existing methods, where all modalities are treated equally throughout generation. The Cross-Modal Knowledge Enhancement (CMKE) component effectively integrates semantically relevant prior reports, enriching the diagnostic feature space and improving disease tag predictions. The architecture is modular and thoughtfully designed. The combination of retrieval-based knowledge injection, tag-guided generation, and reinforcement learning shows a strong understanding of the domain challenges. The model achieves state-of-the-art results on both IU-Xray and MIMIC-CXR datasets. Improvements are particularly evident in BLEU-2/3/4 and ROUGE-L. Ablation studies are comprehensive. Each major component (CMKE, ITA, training strategy) is evaluated independently and in combination on the MIMIC-CXR dataset, showing measurable and interpretable improvements at each step. Including visual attention maps provides qualitative validation of the model’s interpretability and localization ability. These visualizations confirm that the network focuses on semantically relevant regions in the image while generating text. Clinical workflows inspire the method and model radiologists’ decision-making behavior, lending intuitive plausibility and potential for clinical integration.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    No expert evaluation is conducted to assess the clinical correctness or usefulness of the generated reports. Radiology report generation is a high-stakes application where BLEU or ROUGE alone cannot ensure safety or utility. The whole pipeline’s computational complexity is not addressed. Given that the model requires a retrieval phase and involves multiple optimization steps (including reinforcement learning and knowledge distillation), deploying or training in resource-limited environments may be challenging. A discussion on training and inference costs would improve the paper’s applicability.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents a solid and well-motivated method for medical report generation, combining CLIP-based retrieval and adaptive fusion in a clinically intuitive way. It achieves state-of-the-art results and is supported by thorough analysis. While some components are based on prior ideas, their combination and application to this problem are novel and compelling. The work is methodologically sound and relevant to the MICCAI community

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I maintain my original recommendation to accept the paper. The work offers a solid contribution, with a well-motivated methodology and convincing results. The authors have adequately addressed the raised concerns.




Author Feedback

We sincerely appreciate the reviewers’ valuable comments. We will make our code publicly available upon the paper’s acceptance. Response to R1 Q1: Other pretrained multimodal models were not considered. A1: We agree that it is meaningful to include comparisons with other pretrained multimodal models. However, our experiments indicate that the model’s performance is not sensitive to the choice of pretrained model. This is because radiology reports typically follow specific writing patterns, and similar samples are theoretically very frequent. Q2: The role of the retrieval report was not analyzed in depth. A2: We believe this is necessary for the interpretability of the module. In the revised version, we will supplement the comparison before and after introducing the retrieval report, including: 1) Alignment score (the portion of radiograph-report pairs whose cosine similarity of features is larger than 0.5 after min-max normalization); 2) Disease classification accuracy. Response to R2 Q1: The clinical efficiency (CE) metric is missing. A1: We will incorporate the CE metric in the revised version to make our approach more compelling. Q2: Discussion on training and inference time. A2: 1) We retrieve text data for each query image before training rather than performing retrieval in every epoch. Additionally, to balance performance and efficiency, we set the retrieval count to 12. 2) Compared to single-stage training, the increase in training time does not exceed 30%, as our approach does not restart training from scratch like traditional distillation learning but instead continues training on top of pretrained weights. 3)RL and KD are integrated only during training, so inference time remains identical to that of the original network. Response to R3 Q1: The CE metric is missing. A1: We will incorporate the CE metric in the revised version to make our approach more compelling. Q2: The training procedure is unclear. A2: Explanation of methods we apologize for any confusion caused and will address this in the final version by providing more details. The training strategy in this study follows a progressive three-stage pipeline: 1) Stage1: The model is first trained with cross-entropy loss to learn basic text generation and multi-label classification capabilities. 2) Stage2: Building on the pretrained model, reinforcement learning is employed to further improve the quality of the generated sequences. 3) Stage3: Unlike conventional KD (which retrains from scratch), we use the pretrained model as the student and RL-tuned model as the teacher, preserving knowledge while avoiding costly retraining. Additionally, we introduce a dynamic teacher-student update mechanism. Q3: The contributions are unclear. A3: While both our method and [1] improve disease classification accuracy by incorporating retrieved knowledge, The key contribution of our proposed ITAdapter lies in dynamically adjusting the weight balance between visual features and diagnostic features (top-5 disease tags) in the decoder based on the type of word being generated. We argue that abnormal findings rely more on diagnostic tags, while location descriptions and normal observations rely more on visual inputs. In contrast, [1] simply concatenates the visual features with 14+4 prompt tokens and feeds them into the decoder without such dynamic weighting. Minor Points: 1) The c_t is finally projected onto the vocabulary distribution through a fully connected layer and a softmax function. 2) The time cost of the iterative teacher-student distillation process is less than 30% of the original cost. 3) We will correct the incorrect representations and revise Figures in the revised version to ensure clarity. 4) The baseline method shown in Figure 3 is described in the Implementation Details section. [1] Jin H, Che H, Lin Y, et al. Promptmrg: Diagnosis-driven prompts for medical report generation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 38(3): 2607-2615.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top