Abstract

Automated radiology report generation (RRG) aims to produce detailed textual reports from clinical imaging, such as computed tomography (CT) scans, to improve the accuracy and efficiency of diagnosis and provision of management advice. RRG is complicated by two key challenges: (1) inherent complexity in extracting relevant information from imaging data under resource constraints, and (2) difficulty in objectively evaluating discrepancies between model-generated and expert-written reports. To address these challenges, we propose $\mu^2$LLM, a $\underline{\textbf{mu}}$ltiscale $\underline{\textbf{mu}}$ltimodal large language models for RRG tasks. The novel ${\mu}^2$Tokenizer, as an intermediate layer, integrates multi-modal features from the multiscale visual tokenizer and the text tokenizer, then enhances report generation quality through direct preference optimization (DPO), guided by GREEN-RedLlama. Experimental results on four large CT image-report medical datasets demonstrate that our method outperforms existing approaches, highlighting the potential of our fine-tuned $\mu^2$LLMs on limited data for RRG tasks. All code, data, and models will be publicly available in our official repository: \url{https://github.com/Siyou-Li/u2Tokenizer}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0308_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Siyou-Li/u2Tokenizer

Link to the Dataset(s)

CT-RATE-Chinese dataset: https://huggingface.co/datasets/SiyouLi/CT-RATE-Chinese CT-RATE-Mini dataset: https://huggingface.co/datasets/SiyouLi/CT-RATE-Mini

BibTex

@InProceedings{LiSiy_µ2_MICCAI2025,
        author = { Li, Siyou and Qin, Pengyao and Wu, Huanan and Nie, Dong and Thirunavukarasu, Arun J. and Yu, Juntao and Zhang, Le},
        title = { { µ2 Tokenizer: Differentiable Multi-Scale Multi-Modal Tokenizer for Radiology Report Generation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15964},
        month = {September},
        page = {2 -- 11}
}


Reviews

Review #1

  • Please describe the contribution of the paper
    1. This paper proposes the ${\mu}^2$Tokenizer, which dynamically extracts visual features from 3D medical images and enables the construction of ${\mu}^2$LLM, a multimodal large language model tailored for CT report generation.
    2. It designs a DPO algorithm optimized with the GREEN metric, which further enhances the performance of the 3D MLLM on the CT report generation task.
  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper effectively leverages the similarity between video data and 3D medical data to adapt and improve LinVT for the CT report generation task.
    2. Extensive experiments are conducted on multiple datasets, demonstrating the reliability and robustness of the proposed method.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The motivation stated in the abstract—that it is difficult to objectively evaluate reports—seems loosely connected to the proposed methodology. The paper appears to assume that GREEN is already a suitable evaluation metric and therefore applies DPO optimization based on it, which may weaken the original motivation.
    2. The contribution in the introduction claims to achieve dynamic CT feature extraction via question prompting. However, this approach seems less meaningful in the context of report generation, where the diversity of questions is limited. It would be more appropriate to explore this method in medical VQA tasks, such as the VQA subset in CT-RATE.
    3. Although the paper emphasizes multi-scale processing, this is not reflected in the experiments. In Implementation Details, all inputs are resized and cropped to a fixed dimension of 8×32×256×256, which contradicts the motivation for multi-scale modeling.
    4. The paper still uses a relatively basic form of Relative Position Encoding (RPE), without comparison to more advanced techniques like RoPE, which have become standard in recent works.
    5. The descriptions of the DTS and DMTP modules are unclear. It would be beneficial to explicitly illustrate these components in the model architecture figure.
    6. In Experiment - Baselines and Evaluation Metrics, line 2, “LLMs” should be corrected to “MLLMs”.
    7. In Table 2, the ablation study lacks clarity. It is not specified whether the modules are added incrementally or in different combinations. The current set of comparisons is insufficient. It is recommended to include more combinations to better demonstrate the contribution of each component.
  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Consistent with the major weaknesses section.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    During the rebuttal process, the authors still failed to convincingly demonstrate their claimed “dynamic CT feature extraction via question prompting” and “multi-scale processing.” Moreover, their explanation regarding the incompatibility of RoPE with the model remains unconvincing.



Review #2

  • Please describe the contribution of the paper

    The paper proposes a well-defined pipeline for the fusion of visual and language tokens. The paper leverages the existing LinVT method as a key component within the proposed fusion pipeline. Specifically, it adapts and integrates the LinVT architecture as the core fusion module to perform the cross-modal token interaction. The paper enhances the LinVT architecture through three key customizations: it replaces absolute positional embeddings with relative positional encodings to better capture spatial relationships; it introduces a differentiable soft weighted sum instead of hard top-k selection for smoother end-to-end training and more nuanced token importance; and it implements a dynamic pooling mechanism, moving away from a fixed size to allow for content-adaptive aggregation of selected tokens.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Building upon the efficient LinVT backbone, this paper introduces several well-motivated customizations, including relative positional encoding, a differentiable soft top-k selection, and dynamic pooling, to enhance performance on the target task. The paper is clearly written and easy to understand, presenting a well-defined pipeline that incorporates state-of-the-art optimization scheme. The authors provide good evidence for their contributions through comprehensive evaluations, including rigorous comparisons against existing methods and ablation studies that demonstrate the effectiveness of each proposed modifications.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Despite highlighting the potential for distortion and loss of anatomical detail due to resizing in CT images, the proposed method employs the same resizing and cropping scheme to achieve a standard input dimension. This apparent contradiction requires further justification regarding why the chosen standard dimension and resizing strategy minimize the very issues raised.

    Additionally, while individual ablation studies demonstrate the impact of each customization, incremental ablations, where modifications are added sequentially, would provide a more nuanced understanding of their interactions. Such analysis could reveal potential counteracting effects between different components and offer a clearer picture of their combined contribution to the overall performance.

    Finally, the variable performance gains across different evaluation metrics make it challenging to definitively assess the individual contribution and significance of each proposed component, preferably we shall have a more in-depth analysis or potentially a more consistent pattern of improvement across all metrics to solidify the impact of each customization.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents two primary contributions: the design of a comprehensive pipeline encompassing image processing, tokenizer fusion, and DPO optimization; and the customization of the LinVT module. While the technical novelty of individual components may be incremental, building upon existing established works, the overall effectiveness and completeness of the proposed approach are noteworthy. Considering the promising results and the cohesive integration of these elements, I am inclined towards acceptance.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Authors addressed some of my questions, going through all reviews, I keep my initial rating as weak accept



Review #3

  • Please describe the contribution of the paper

    The authors introduced mu^2Tokenizer, an intermediate processing layer for understanding CT and the queried question. A policy-based learning method for fine-tuning an LLM is also based on direct preference optimization.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The mu^2Tokenizer presents an interesting approach to integrating the question directly into the CT embeddings. Notably, the patch-wise CT contains many tokens, and this mechanism can offer greater focus. 2) The justification for differentiable token selection is well-written and explains the need for “soft” top tokens. 3) The reinforcement learning mechanism with GREEN is a good application of current language-generation methods to improve model performance, and the justification as a policy is warranted (i.e., current metrics do not capture medical nuance). 4) The authors compared their model against several state-of-the-art vision-language models across multiple datasets, demonstrating that their method outperformed the others. 5) The ablations were extensive and highlighted the different contributions towards the performance of the author’s final model.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) Relative positional encoding is an interesting concept; however, based on my interpretation of the methods, relative positional encoding was additional parameters—specifically, by incorporating a learnable bias term. Please correct me if I misinterpreted. However, if the authors truly used a relative positional encoding like RoPE. Then, there may be some aspects that need to be clarified. ViT breaks images and flattens them into a sequence. This flattening disrupts the spatial locality inherent in image features (Tian, 2024). The motivation for relative positional encoding should be better explained and rationalized. 2) Please reframe from using the word significantly if no statistical test was completed (i.e., page 7: significantly surpassing). 3) Table 2 includes BLUE, do you mean BLEU? In addition, this metric is not included in Table 1. There are also some other typos like page 3 “shwon”. – Work Cited Tian, Keyu, et al. “Visual autoregressive modeling: Scalable image generation via next-scale prediction.” Advances in neural information processing systems 37 (2024): 84839-84865.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    1) Figure 1 would benefit from adding the T, K, H, W, and N; this would help guide the reader better. 2) On page 4, you mentioned an attention score; is this scaled attention? 3) Tables 1 and 2 can benefit from reorganizing the metrics. ROUGE-1, METEOR and BERTScore are the traditional evaluation metrics, and GREEN is the more semantic one. It is unclear why GREEN is in the middle; reorganizing the metrics meaningfully would benefit the reader.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper presents an engaging application of concepts from pure text generation to vision-language generation. This, alongside the proposed intermediate mu^2Tokenizer, significantly enhances the methodology. The authors’ results further corroborate this. Clarifying certain sections of the paper would elevate it.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The paper introduces a promising approach to CT report generation. GREEN is a great way to incorporate a different loss into a training method.

    While certain aspects of the experimental evaluation and ablations could be improved, the overall contribution is intriguing and relevant to the field. The authors addressed many of my concerns in the rebuttal, although the question regarding the use of RoPE remains insufficiently clarified. Nevertheless, their commitment to releasing the codebase is appreciated and will help verify these details after submission.




Author Feedback

We thank all reviewers (R1, R2, R3) for their constructive feedback and valuable suggestions. We are encouraged by the positive comments acknowledging our contributions in integrating multi-scale visual information with language guidance via the µ²Tokenizer, and the novel application of DPO optimized by GREEN. We address the key concerns below:

(1) R1 - Motivation and use of GREEN for optimization appears disconnected.

GREEN is a clinically motivated metric that is trained to align with human experts’ preferences. While a fully objective evaluation remains difficult, using the best available metric in the training loop is a pragmatic approach to improve report quality. In other words, our original motivation is not weakened by this choice; rather, it drives us to embrace a domain-specific metric to guide the model when human-like judgment is hard to encode. We will revise the paper to make this clear.

(2) R1 - Limited applicability of prompting in report generation.

Although prompts in RRG are often generic (e.g., “Generate a report”), our framework treats them as meaningful context to guide image feature extraction. By incorporating the question into the tokenization process, the model learns to emphasize relevant anatomical regions—such as focusing on the abdomen when prompted about abdominal findings. This improves image-text alignment and enables more clinically coherent report generation, as detailed in Section 2.1.

(3) R1&R2 - Multi-scale modeling vs fixed input resizing.

While we resize CT volumes to a fixed 8×32×256×256 shape for training efficiency, our multi-scale modeling refers to how features are processed internally—not input resolution. The µ²Tokenizer can handle varying slice counts and applies dynamic multi-scale pooling (DMTP) using different kernel sizes (e.g., 1×1, 2×2, 4×4), with learned weights for scale selection. This enables the model to retain multi-scale representation despite fixed input dimensions.

(4) R1&R3 - Clarity of architecture and RPE mechanism.

We agree that RoPE could potentially offer advantages. We did not include RoPE in the current implementation primarily due to compatibility with our base architectures and a focus on validating the new tokenizer modules first. In the revised paper, we will make it clear that more sophisticated positional encoding schemes could be substituted. We will also clarify why we chose RPE: it directly addressed the issue of capturing local 3D structure in our initial experiments, and was straightforward to integrate into the LinVT-based framework. If the camera-ready process permits, we will consider running a comparative experiment with RoPE to quantify any gains in our setting.

(5) R1&R2 - Ablation study clarity.

In Table 2, we first demonstrate the individual contributions of each component by adding them to the baseline (+RPE, +DTS, +DMTP). Subsequently, all three components were combined (µ2LLM-1B(SFT)), and finally, we further trained the model using DPO. We admit that the description of the ablation is insufficient and confusing. We initially prepared two versions: the current addition version and the subtraction version (w/o). The confusion arises because we accidentally included a sentence from the subtraction version (“incorporating DTS … GREEN Score by up to 0.2 points.”). We will revise the wording and will include both the addition and subtraction versions in the final paper to provide a more comprehensive view.

(6) R2&R3 - Metric inconsistency and typos.

We corrected the BLEU spelling, and reorganized Tables 1–2 to group metrics meaningfully (traditional vs LLM-based). Typos such as “shwon” have also been fixed.

(7) R1&R2 - Reproducibility.

We will release our code, model weights, and data preprocessing scripts upon acceptance, enabling full reproducibility.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Three reviewers evaluated the paper, resulting in an overall acceptance. Reviewer #1 acknowledged the contributions of the ${\mu}^2$Tokenizer and DPO with the GREEN metric but maintained a reject recommendation, citing unresolved concerns about the validity of the dynamic feature extraction and multi-scale processing claims, lack of comparison with advanced RPE methods, and clarity issues in module descriptions and ablations. Reviewer #2 gave a weak accept, highlighting a well-structured pipeline with meaningful customizations to LinVT and comprehensive evaluations, though noting inconsistencies in metric improvements and suggesting incremental ablations and resizing justification. Reviewer #3 recommended accept, recognizing the novelty of the query-integrated tokenizer and policy-based optimization, while suggesting clarifications on RPE and minor edits. The rebuttal partially addressed concerns, and the accepting reviewers found the overall contributions and empirical results sufficient for acceptance.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top