Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Medical report generation has made notable progress, but most studies focus on chest X-rays, leaving CT report generation largely underexplored. This task poses unique challenges, including sparse diseased regions due to high-dimensional volumes, imbalanced distributions of normal and abnormal samples leading to biased predictions, and excessive template sentences that may obscure critical findings. Recently, large language models (LLMs) have demonstrated strong instruction-following capabilities, producing reliable outputs when guided by well-designed prompts, which provides a promising approach to address these issues. To this end, we propose Dia-LLaMA, a framework adapted from LLaMA2-7B for CT report generation with diagnostic guidance prompts. To enhance the focus on diseased areas, we introduce a disease-aware attention module to capture disease-specific information. Furthermore, we propose a disease prototype memory bank to capture common disease patterns, providing a reliable reference during diagnosis. Experiments on a large-scale chest CT report dataset demonstrated that our method outperforms previous approaches, achieving state-of-the-art results in both clinical efficacy and natural language generation metrics. The code is available at https://github.com/zhi-xuan-chen/Dia-LLaMA.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3319_paper.pdf

SharedIt Link: https://rdcu.be/eHwWf

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04981-0_14

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/zhi-xuan-chen/Dia-LLaMA

Link to the Dataset(s)

CTRG Chest dataset: https://github.com/tangyuhao2016/CTRG

BibTex

@InProceedings{CheZhi_DiaLLaMA_MICCAI2025,
        author = { Chen, Zhixuan AND Luo, Luyang AND Bie, Yequan AND Chen, Hao},
        title = { { Dia-LLaMA: Towards Large Language Model-driven CT Report Generation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},
        page = {141 -- 151}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper introduces Dia-LLaMA, a novel framework that adapts LLaMA2-7B for CT report generation by incorporating diagnostic information as guidance prompts. The key contributions include:

A disease-aware attention (DAA) module that enhances the perception of local diseased regions in CT volumes by effectively capturing disease-level features A disease prototype memory bank (DPM) that captures common representations of normal and abnormal samples across different diseases, providing reliable reference for disease diagnosis A diagnostic text prompt (DTP) approach that emphasizes critical disease information by embedding it into prompts for the LLM A comprehensive solution to address three key challenges in CT report generation: Difficulty in capturing sparse diseased areas in high-dimensional CT volumes Data imbalance between normal and abnormal cases resulting in biased diagnoses Template-dominated reporting that may overwhelm critical abnormality information
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Novel application area: The paper addresses CT report generation, which is significantly underexplored compared to chest X-ray report generation. This is a valuable contribution as CT volumes present unique challenges due to their high-dimensional nature and sparse abnormalities. Innovative technical approach: The disease-aware attention module provides a more targeted approach to identify abnormalities in CT volumes by capturing disease-specific information, improving upon previous methods that used pooled patch features for disease diagnosis. Addressing data imbalance: The disease prototype memory bank effectively mitigates the challenge of data imbalance by providing common disease representations and using contrastive learning to ensure distinctiveness between normal and abnormal samples. The paper demonstrates improved F1 scores, particularly for diseases with fewer abnormal samples. Strong empirical results: The method achieves state-of-the-art performance in both clinical efficacy metrics (precision improved by 4.5%, recall by 7.2%, and F1 score by 7.8%) and most natural language generation metrics (BLEU-1 by 7.2%, BLEU-4 by 20%, and METEOR by 4.3%) compared to existing approaches. Comprehensive ablation studies: The paper includes thorough experiments to demonstrate the contribution of each component (DAA, DPM, DTP) to the overall framework. The studies validate that each module addresses specific challenges in CT report generation. Clinical significance: Automated CT report generation has high practical value, as it can significantly reduce the workload of radiologists who must examine numerous CT slices and provide comprehensive summaries. The method effectively captures critical abnormality information, which is essential for clinical decision-making.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Limited dataset evaluation: The evaluation is performed on only one dataset (CTRG-Chest-548K with 1,804 CT-report pairs), which might limit the generalizability of the results. Validation on multiple diverse datasets would strengthen the claims about the framework’s effectiveness. Restriction to chest CT: The paper focuses only on chest CT reports, while acknowledging this limitation in the conclusion. CT imaging is used across many body regions with varying characteristics, and the generalizability of the approach to other types of CT scans (e.g., brain, abdomen) remains unaddressed. Comparison with recent multimodal medical LLMs: While the paper compares with various methods, comparison with recent specialized multimodal LLMs for medical imaging like MAIRA-1 (Hyland et al., 2023) or Med-PaLM would further strengthen the evaluation. Current research in this rapidly evolving field has produced models specifically designed for radiology report generation. Computational requirements: The model was trained using two RTX 3090 GPUs for about 16 hours, which indicates relatively high computational requirements. The paper doesn’t address potential optimization for resource-constrained environments, which might limit clinical adoption. Limited discussion on ethical and regulatory aspects: The paper doesn’t extensively discuss the ethical implications, privacy concerns, or regulatory requirements related to automated medical report generation. These are critical considerations for real-world deployment in healthcare settings.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While the paper has limitations, such as evaluation on a single dataset and focus on chest CT only, these do not significantly diminish the value of the contribution. The authors acknowledge some of these limitations in their conclusion and indicate plans for future work to extend the framework to other radiology modalities.

In summary, this paper represents a significant advancement in the field of medical report generation, introducing novel components that effectively address key challenges in CT report generation. The substantial performance improvements over existing methods and the practical significance of the work make it a valuable contribution to the research community.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This paper introduces Dia-LLaMA, a novel framework designed to improve automated CT report generation by addressing two core challenges: disease sparsity and class imbalance. The method integrates a disease-aware attention mechanism and a prototype memory bank to enhance disease-specific representation learning. It also pioneers the integration of LLaMA2-7B into this domain, exploring its underutilized potential in medical report generation. Empirical results on the CTRG-Chest-548K dataset demonstrate that Dia-LLaMA consistently outperforms existing approaches in both clinical relevance and natural language quality.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This work is the first to apply a LLM (LLaMA2-7B) in the context of volumetric chest CT data, which is inherently more complex than 2D medical images The disease prototype bank serves as an effective mechanism to distinguish normal from abnormal findings, helping the model generalize better to rare conditions while offering a degree of interpretability Demonstrates SOTA performance on both NLG and CE metrics
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

The model uses a fixed set of 8 diseases, selected due to data frequency constraints. This restricts the model’s adaptability to unseen diseases and different datasets. Diagnostic prompts are rigid (“The {disease} is [present/absent]”), limiting the flexibility and richness of LLM-driven language generation, especially in ambiguous cases
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Though the idea is quite simple and straightforward, it is really well-motivated and effectively tailored for the unique challenges of CT report generation. The paper demonstrates a strong understanding of the domain-specific constraints in chest CTs, such as sparse abnormalities, data imbalance, and redundant template language in reports—issues that are often overlooked in chest X-ray-focused literature.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper
This paper introduces Dia-LLaMA, a framework leveraging the LLaMA2-7B large language model (LLM) for automated CT report generation. Addressing three key challenges in CT report generation—sparse diseased regions in high-dimensional volumes, data imbalance between normal/abnormal cases, and template-driven sentences overshadowing critical abnormalities—the authors propose:
- A disease-aware attention (DAA) module to extract disease-specific features from CT volumes, enhancing focus on sparse pathological regions.
- A disease prototype memory bank (DPM) to store common representations of normal/abnormal diseases, updated via contrastive loss to mitigate data imbalance.
- Diagnostic text prompts (DTP) to convert diagnostic results into structured instructions for the LLM, ensuring coherent reports with emphasized abnormalities.
  Experiments on a large chest CT dataset demonstrate state-of-the-art performance in clinical efficacy (CE) and natural language generation (NLG) metrics.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The work is among the first to tailor LLMs for CT report generation, introducing domain-specific modules (DAA, DPM) to address CT’s unique challenges (sparse lesions, 3D complexity), bridging the gap between general LLMs and clinical radiology needs.
2. The DPM leverages contrastive learning to distinguish rare abnormal cases from dominant normal samples, while DAA enhances fine-grained feature extraction for sparse lesions in 3D volumes, significantly improving diagnostic accuracy.
3. Thorough ablation studies (Table 2), prompt-type comparisons (Table 3), and qualitative analysis (Figure 3) validate the necessity of each component. The use of both CE (diagnostic accuracy) and NLG (linguistic quality) metrics provides a holistic evaluation.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
Major Weaknesses
1. The study uses CheXbert, pretrained on chest X-rays (CXR), to extract disease labels for CT reports. While the authors note “similarity in content between chest CT and CXR reports,” they do not validate whether CheXbert’s CXR-trained labels are reliable for CT-specific pathologies (e.g., lung nodules with unique CT features). The rationale for selecting 8 diseases from CheXbert’s original 14 (excluding “too rare” ones) is unclear, risking label bias for CT-specific abnormalities.
2. The method does not specify how normal/abnormal prototypes (P_0, P_1) are initialized (e.g., random, data-driven, or medical prior-based), which is critical for their effectiveness in capturing disease representations.
3. The ViT3D encoder processes CT volumes resized to 256×256×64, but the method does not clarify how 3D volumes are sliced into patch features (e.g., number of slices per volume, slice spacing, or 3D patch extraction strategy).
4. Details of LoRA fine-tuning for LLaMA2-7B (e.g., rank, dropout rate) and ZeRO stage 2 configuration are missing, hindering reproducibility.
Minor Weaknesses
1. In Table 3, the definitions of “Token prompt” (special tokens like <POS-l>) and “Feature prompt” (direct prototype input) are not provided in the method section, weakening interpretability of results.
2. Figure 3 provides only one qualitative example of report generation. Additional cases (e.g., rare diseases, complex lesions) with expert-annotated comparisons would better demonstrate clinical utility. The color-coding (green/red) for correctness lacks a defined criterion (e.g., alignment with radiologist reports).
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

See the weaknesses above.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We sincerely thank all reviewers for their time and thoughtful feedback. We greatly appreciate the constructive comments and insightful suggestions, which have helped us better reflect on our work and its broader implications. Below, we provide brief clarifications and remarks in response to the main concerns raised.

Evaluation on chest CT dataset We acknowledge this limitation, as also stated in our conclusion. We appreciate the suggestion and will continue to explore the generalizability of our framework on diverse medical datasets in future work.

Comparison with multimodal medical LLMs Thank you for the suggestion. We agree that including recent multimodal medical LLMs like MAIRA-1 and Med-PaLM would enhance the evaluation. We plan to incorporate a more comprehensive comparison with such models in future work.

Fixed diagnostic prompt template We used a fixed prompt template primarily to validate the feasibility and effectiveness of incorporating diagnostic prompts into report generation. We agree that a more flexible, user-friendly prompt paradigm could enhance interaction and application value, and we aim to explore this direction in follow-up work.

Use of CheXbert-trained labels for CT data We recognize the concern regarding label quality. As noted, we used CheXbert for its strong performance and widespread use in chest X-ray labeling. Given the semantic overlap in thoracic disease descriptions across chest X-rays and CT reports, we found it to perform reasonably well for our task. Nevertheless, we agree that CT-specific labeling tools would be ideal and plan to incorporate such models in future studies.

Initialization of disease prototypes (P₀, P₁) Thank you for this insightful comment. In our current implementation, the prototypes are initialized with a standard normal distribution N(0,1) and refined through contrastive learning guided by ground-truth diagnostic labels. In the future, we plan to incorporate disease-relevant priors, such as initializing prototypes using the mean features of labeled normal and abnormal samples, to further enhance performance.

3D patch extraction strategy We utilize a 3D Vision Transformer (ViT) encoder to extract patch features from the resized CT volumes (256×256×64). This encoder applies a 3D patch extraction strategy, dividing the input volume into multiple 3D patches, each of which is embedded into a patch feature.

LoRA and ZeRO configuration details We appreciate the request for implementation specifics. In our setup, the LoRA configuration uses a rank of 8, lora alpha of 32, and a dropout rate of 0.1. For the Stage-2 ZeRO optimization setup, due to space limitations, we will include the full configuration details in our official code repository to support reproducibility.

Definitions of different prompt types Thank you for pointing this out. In our setting, the “text prompt” refers to the default form, where diagnostic results are expressed as textual phrases. The “token prompt” introduces additional special tokens for LLM to represent diagnostic outcomes. The “feature prompt” incorporates diagnostic information by selecting the corresponding disease prototypes based on the predicted diagnostic results, which are then fed into the LLM alongside the visual features. The key difference among them lies in how diagnostic information is conveyed to the LLM.

Qualitative examples Beyond the example provided in the main paper, two additional qualitative examples are included in our supplementary material. All examples are drawn from the test set of the dataset. The green/red color annotations reflect alignment (or lack thereof) with ground-truth radiologist reports, which we use as a reference standard. Once again, we sincerely thank the reviewers for their thoughtful comments, constructive suggestions, and the positive recognition of our work. We deeply appreciate your support and look forward to incorporating these insights into our future research.

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

Dia-LLaMA: Towards Large Language Model-driven CT Report Generation

Author(s):