List of Papers Browse by Subject Areas Author List
Abstract
Multimodal Large Language Models (MLLMs) have emerged as a promising way to automate Radiology Report Generation (RRG). In this work, we systematically investigate the design space of 3D MLLMs, including visual input representation, projectors, Large Language Models (LLMs), and fine-tuning techniques for 3D CT report generation. We also introduce two knowledge-based report augmentation methods that improve performance on the GREEN score by up to 10%, achieving the 2nd place on the MICCAI 2024 AMOS-MM challenge. Our results on the 1,687 cases from the AMOS-MM dataset show that RRG is largely independent of the size of LLM under the same training protocol. We also show that larger volume size does not always improve performance if the original ViT was pre-trained on a smaller volume size. Lastly, we show that using a segmentation mask along with the CT volume improves performance. The code is publicly available at https://github.com/bowang-lab/AMOS-MM-Solution.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2261_paper.pdf
SharedIt Link: Not yet available
SpringerLink (DOI): Not yet available
Supplementary Material: Not Submitted
Link to the Code Repository
https://github.com/bowang-lab/AMOS-MM-Solution
Link to the Dataset(s)
https://era-ai-biomed.github.io/amos/dataset.html#download
BibTex
@InProceedings{BahMoh_Exploring_MICCAI2025,
author = { Baharoon, Mohammed and Ma, Jun and Fang, Congyu and Toma, Augustin and Wang, Bo},
title = { { Exploring the Design Space of 3D MLLMs for CT Report Generation } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
year = {2025},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15965},
month = {September},
page = {240 -- 250}
}
Reviews
Review #1
- Please describe the contribution of the paper
This paper investigates the task of CT report generation using 3D multimodal large language models (MLLMs), with a focus on systematic exploration of architectural designs and training strategies. The authors validate their proposed framework on the AMOS-MM dataset from the MICCAI 2024 challenge. Additionally, two knowledge-based report augmentation techniques are proposed to enhance generation performance.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- The paper presents a relatively systematic study of architectural combinations and training strategies for 3D MLLMs in the context of CT report generation. This provides valuable engineering insights and practical guidance for the community.
- Two innovative knowledge-based report augmentation techniques are introduced, which contribute to improved automatic evaluation metrics in the CT report generation task.
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- The proposed knowledge-based augmentation methods appear to concatenate additional content to the model-generated reports. While this may improve automatic evaluation metrics, the resulting reports may deviate from real-world clinical report formats. If human evaluation were introduced, it remains uncertain whether these methods would lead to improved or degraded practical report quality.
- The conclusion that “the task is LLM-independent” (as suggested in Table 1) is not sufficiently supported. The study compares different LLMs without controlling for model series or scale, which weakens the claim’s rigor. More controlled experiments (e.g., varying model sizes within the same family) would be necessary to draw robust conclusions.
- The study is conducted on a single dataset (AMOS-MM), which has a limited number of training samples. This limitation reduces the generalizability of the experimental conclusions. Notably, the paper states in the abstract and conclusion: “Our results show that RRG is largely independent of the size of LLMs and freezing the LLMs performs better than fine-tuning or using parameter efficient techniques.” Such strong claims, based on limited evidence, may mislead the community and should be presented with greater caution.
- Several aspects of the writing and experimental reporting lack clarity. For example, the use of segmentation masks in the Additional Experiments section is not well explained. Table 4 provides results for individual knowledge-based augmentation methods, but does not show their combined performance, even though Figure 2 appears to illustrate a combined method (BQ + Naive Normality).
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(3) Weak Reject — could be rejected, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
This paper explores an important problem and presents insightful engineering practices for constructing 3D MLLMs for CT report generation. However, the work suffers from several key limitations: the augmentation strategy may not align with clinical reporting norms; some empirical claims—particularly regarding LLM independence and parameter tuning—are too strong given the limited experimental scope; and the study lacks breadth in dataset validation. While the paper offers some useful observations, the current version risks overgeneralizing from insufficient evidence, which could mislead readers. I therefore lean toward rejection.
- Reviewer confidence
Very confident (4)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
Reject
- [Post rebuttal] Please justify your final decision from above.
The authors’ rebuttal addressed some of my concerns, and I acknowledge their efforts in clarifying certain aspects of the work. However, I still believe the paper falls short in terms of novelty and practical value. The study investigates the design space of 3D MLLMs on a relatively small-scale dataset, primarily exploring combinations of existing components without introducing fundamentally new methodologies. The two proposed knowledge-based report augmentation methods also raise concerns. Based on the illustrations in the paper, the augmented reports appear to be formed by directly appending newly generated content to the original model outputs. This heuristic approach seems primarily aimed at improving the GREEN metric rather than generating clinically coherent reports. In practice, such concatenated reports may be misleading or even harmful—particularly if the BQ module produces content that contradicts the original model output, potentially resulting in internally inconsistent final reports. Given these concerns, I do not find the work suitable for publication in its current form.
Review #2
- Please describe the contribution of the paper
This paper explores the design space of 3D Multimodal Large Language Models (MLLMs) for CT radiology report generation, analyzing key components such as visual input representations, projectors, language models, and fine-tuning strategies. A key finding is that freezing the LLM yields better performance than parameter-efficient or full fine-tuning. The authors also introduce two knowledge-based report augmentation methods that significantly boost the GREEN score by up to 10%.
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
-
The paper systematically investigates key architectural components of 3D MLLMs for CT report generation, including visual input representations, projectors, LLMs, and fine-tuning methods. This modular approach provides clear insights into what matters most for model performance.
-
It reveals that freezing the LLM leads to better performance than both parameter-efficient and full fine-tuning. This finding has significant implications for efficiency and robustness in medical AI applications.
-
The introduction of two knowledge-based augmentation methods that improves report completeness and boosts the GREEN score by up to 10%, demonstrating strong practical value.
-
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
-
The paper does not include examples of generated reports. Qualitative outputs would help assess clinical relevance, fluency, and potential hallucinations.
-
The reported performance improvements are not supported by statistical testing (e.g., confidence intervals or significance tests), which limits the ability to judge whether the differences are meaningful or due to variance.
-
The Binary-based Questioning method depends on a triplet model, but the paper lacks details on its architecture, training procedure, and performance, which limits reproducibility and understanding of its contribution.
-
The augmentation strategies, especially Naive Normality, significantly improve GREEN but reduce other metrics like BLEU and METEOR. This raises concerns that the improvement might be driven by metric-specific tricks rather than genuine clinical quality.
-
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Please see the strengths and weakness.
- Reviewer confidence
Somewhat confident (2)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Review #3
- Please describe the contribution of the paper
The main contribution of the paper titled “Exploring the Design Space of 3D MLLMs for CT Report Generation” is a comprehensive exploration and evaluation of architectural and training choices for 3D Multimodal Large Language Models (MLLMs) in the context of automated radiology report generation (RRG) from CT scans.
Here are the key contributions in detail:
- Modular Exploration of 3D MLLM Design Space
The authors systematically investigate the design space of MLLMs for 3D CT report generation, decomposing the architecture into four core components:
- Visual input representation
- Projection mechanisms
- Large Language Models (LLMs)
- Fine-tuning strategies This modular approach enables a better understanding of the influence of each component on performance.
-
Introduction of Knowledge-Based Report Augmentation Techniques They propose two post-processing strategies to improve report completeness: Binary-based Questioning (BQ): Uses a triplet model to verify the presence or absence of common clinical findings. Naive Normality Augmentation: Automatically adds standard normal findings for organs not mentioned in the initial report. These methods significantly boost the performance on the GREEN metric (up to +10% improvement).
-
Strong Empirical Results on a Benchmark Dataset The approach achieved 2nd place on the MICCAI 2024 AMOS-MM challenge, demonstrating competitive performance across multiple evaluation metrics (GREEN, RaTEScore, BLEU, ROUGE, METEOR).
- Surprising Findings About Model Scaling and Training Freezing the LLMs outperformed full fine-tuning or parameter-efficient methods (LoRA, DoRA). Increasing image resolution degraded performance if the ViT was pre-trained at lower resolution. Model performance was largely independent of LLM size for this specific task.
The paper offers a practical, well-validated, and modular approach to optimizing 3D MLLMs for radiology report generation, while also providing insights that challenge some prevailing assumptions in model scaling and tuning.
- Modular Exploration of 3D MLLM Design Space
The authors systematically investigate the design space of MLLMs for 3D CT report generation, decomposing the architecture into four core components:
- Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The paper strengths lie in its comprehensive methodological investigation, clinically motivated and effective augmentations, robust empirical validation, and its general-purpose framework that opens the door to broader research and deployment in 3D medical imaging contexts. Here is a detailed list of the major strengths of the paper, along with justifications:
-
Systematic and Modular Exploration of the 3D MLLM Design Space: The paper presents a comprehensive and modular investigation of the key components that constitute 3D Multimodal Large Language Models (MLLMs) for CT report generation: visual input representation, projectors, LLM backbones, and fine-tuning strategies. Most prior work focuses on end-to-end performance without deeply analyzing the effect of each design decision. This decoupled evaluation offers practical insights for researchers and developers aiming to build more efficient MLLMs.
- Introduction of Knowledge-Based Report Augmentation Methods: The authors propose two novel post-hoc augmentation methods to enhance report completeness:
- Binary-based Questioning (BQ) using a trained triplet model.
- Naive Normality Augmentation for inserting common normal findings.
- These methods are lightweight, model-agnostic, and can be plugged into any RRG pipeline.
- They significantly improve the GREEN score (e.g., from 0.366 to 0.470), demonstrating their practical utility.
- The augmentation is clinically motivated, aiming to reduce omission of normal findings, a common issue in generated reports.
-
Demonstration of Clinical Feasibility and Challenge Validation: The model achieved 2nd place in the MICCAI 2024 AMOS-MM challenge, evaluated on a hidden test set. This validates the approach on a real-world, standardized benchmark in a competitive setting. It also shows generalizability across different organs (chest, abdomen, pelvis) and across various metrics (GREEN, RaTEScore, BLEU, ROUGE, METEOR).
-
Surprising and Insightful Empirical Findings: Freezing LLMs yields better performance than fine-tuning or using parameter-efficient techniques like LoRA or DoRA. Increasing input resolution can harm performance if misaligned with the ViT’s pretraining resolution. Performance is mostly LLM-size agnostic (2B to 8B), contrary to common assumptions. These findings challenge conventional practices and can help optimize future MLLM systems with lower compute cost and greater interpretability.
- Use of 3D CT Data in MLLM Frameworks: The work extends prior 2D-based MLLM approaches (like LLaVA) to 3D CT volumes using adaptations such as AnyResolution and spatial pooling techniques. Handling volumetric medical data is computationally intensive and underexplored in the MLLM field. Their approach allows for effective integration of high-resolution 3D imaging into text generation workflows.
-
- Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
The paper lacks architectural novelty, building mainly on existing MLLM frameworks like LLaVA and M3D. The proposed report augmentation methods rely on heuristics and external tools (e.g., GPT), which may limit generalizability. Clinical validation is absent, with no expert radiologist evaluation to assess real-world utility. Experiments are limited to a single dataset from a narrow geographic region, raising concerns about generalization. Finally, improvements on the GREEN score do not consistently translate to gains in other metrics or clinical relevance.
- Please rate the clarity and organization of this paper
Satisfactory
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The authors claimed to release the source code and/or dataset upon acceptance of the submission.
- Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
N/A
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.
(4) Weak Accept — could be accepted, dependent on rebuttal
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
This paper provides a valuable and systematic exploration of the design space for 3D MLLMs in CT report generation, supported by strong empirical results and a competitive placement in the MICCAI 2024 AMOS-MM challenge. Its modular framework and novel report augmentation techniques (Binary-based Questioning and Naive Normality) offer practical insights and measurable gains on the GREEN score. However, the work is incremental in terms of architectural novelty, relies on heuristics, and lacks clinical expert validation. Despite these limitations, the paper makes a solid technical contribution to a timely and underexplored area, justifying a weak accept for its empirical rigor and potential for impact.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.
N/A
- [Post rebuttal] Please justify your final decision from above.
N/A
Author Feedback
We sincerely thank the reviewers for their valuable insights and suggestions. Q1: Concerns on the limited improvements of metrics (BLEU/METEOR) and real-world clinical improvement (R1-3). BLEU, ROUGE, and METEOR are lexical metrics that measure surface-level semantic similarity but don’t account for clinical accuracy and are only reported to follow common practice. For example, the two findings are clinically the same: Pred: “The liver surface is smooth, with coordinated size and proportion of each lobe.” GT: “The liver is normal in size and shape, with coordinated proportion of liver lobes.”, but the semantic similarity scores are BLEU: 0.44, ROUGE: 0.59, METERO: 0.55. In contrast, we get a score of 1 for GREEN reflecting the same clinical meaning. Our methods improve all clinical metrics (GREEN/RaTEScore). We have made this clearer in the final version by adding a header in each table to specify metric types. Additionally, we respectfully argue that our BQ method corresponds to clinical improvements because it ensures that no important finding is missed by performing additional inferences for common findings, guaranteeing that the final report is complete. This is aligned with clinical formats where templates are followed to report findings about each specific area. Naïve normality ensures that all normal findings are explicitly mentioned. Q2: Experiments are limited to a single dataset (R1, R3). The employed AMOS-MM dataset is a well-recognized benchmark with a validation set of 400 cases from two different hospitals, which has enough statistical power to draw conclusions. To the best of our knowledge, the only 3D CT report generation dataset larger than AMOS-MM is CT-RATE, which has 25,692 chest CT scans, and we are computationally limited for that. We have made our code public, which can be quickly adapted to other datasets. Moreover, it is important to highlight that AMOS-MM is more diverse than CT-RATE in terms of CT regions, since it includes reports for the Pelvis, Chest, and Abdomen, while CT-RATE is mainly for Chest. Q3: Comparing different LLMs without controlling for model series or scale weakens the claim of “LLM-independent.” (R1)
Our claim is that the task is LLM-independent in terms of series, and our results support that since it shows that LLMs from different series that perform differently on established NLP benchmarks all perform the same for report generation . Moreover, to more objectively measure impact on scale, we ran experiments for larger Phi3 versions (Small: 7B, Medium: 14B) and our conclusion is the same. This will be added to the final version. Q4: Improvements are not supported by statistical testing (R2) The reported improvements are based on 3 different organs across 2 clinical metrics. We further did a paired t-test on GREEN, which resulted in a p<0.01, confirming that our results are statistically significant. This is added to the final version. Q5: Some parts lack clarity or detail (R1, R2) In the final version, we now provide more details on how we used segmentations masks in the “Additional Experiments” section, and more details about the training pipeline for our triplet model. Regarding R1-W4’s statement, the third row in Table 4 is the combined performance. To make it clearer, we explain that in the caption of the table in the final version. Q6: The proposed report augmentations rely on heuristics and external tools like GPT (R3) Our heuristics for BQ (questions we ask the triplet model) will all be public and are developed to be general and adaptable. We used external tools only in BQ, where GPT 3o-mini is used only to transform findings into triplets, but this task can be done by any open-source LLM. The actual triplet model is Phi 3-mini. Q7: The paper doesn’t include examples of generated reports (R2) Will added it to the final version. The qualitative results show that our report augmentation methods result in reports that are more complete and catch findings that are missed.
Meta-Review
Meta-review #1
- Your recommendation
Invite for Rebuttal
- If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.
N/A
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Reject
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
The reviewers’ suggestions are mixed. I tend to reject it. The Knowledge Augmentation is quite heuristic and can potentially introduce bias and even misinformation to the final reports. It is better to design as RAG.
Meta-review #3
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A