Abstract

The increasing complexity of medical imaging data underscores the necessity for multimodal intelligent systems capable of integrating diverse data representations for comprehensive and precise analysis. In the domain of 3D CT scans, the generation of accurate and clinically meaningful medical reports requires both volumetric contextual information and the fine-grained spatial details inherent in 2D slices. To address this challenge, we propose a framework that employs a pretrained 2D self-supervised learning encoder, initially trained on CT scan slices integrated with a 3D aggregator. By combining the rich, high-resolution information from 2D slices with the spatial coherence of 3D volumetric data, our approach maximizes the complementary strengths of both representations. Experimental results demonstrate that our method outperforms existing baseline approaches in both report generation and multiple-choice question answering, highlighting the critical role of multidimensional feature integration. This work underscores the transformative potential of multimodal intelligent systems in bridging complex imaging data with practical clinical insights, ultimately improving radiological diagnostics and patient care.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2618_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/serag-ai/SAMF

Link to the Dataset(s)

N/A

BibTex

@InProceedings{HosAbd_From_MICCAI2025,
        author = { Hosseini, Abdullah and Ibrahim, Ahmed and Serag, Ahmed},
        title = { { From Slices to Volumes: Multi-Scale Fusion of 2D and 3D Features for CT Scan Report Generation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {273 -- 282}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper addresses the challenge of accurate report generation from 3D CT scans by proposing a framework that leverages both 3D volumetric information and the rich, high-resolution details from 2D slices. Extensive experiments were conducted to demonstrate the effectiveness of the proposed method in both radiology report generation and multiple-choice question answering. A key contribution is the Slice-Attentive Multi-Modal Fusion (SAMF) module, which enables effective multidimensional and multimodal feature integration.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Clinically Relevant Motivation: While many vision-language models have been proposed, how to effectively mimic the radiologist’s workflow for report generation remains underexplored. This paper takes a meaningful step in that direction by combining 2D and 3D methods, offering valuable insights for future research.
    2. Well-Structured Paper: The paper is well-organized, presenting detailed descriptions of the proposed SAMF and Ao2D modules, the datasets used, and extensive results from ablation studies and comparisons with baseline methods.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Low reproducitivity: there is no code/demo model could be accessed. The reviewer would recommend the author to open source the code/sample code to prove the reproducibility.
    2. For the SAMF: a) How do you ensure positional consistency between the 2D and 3D features during projection? For example, how is spatial/temporal correspondence maintained when fusing slice-level and volume-level features? b) Why was a shared projection dimension d_proj chosen, and how sensitive is the system to this hyperparameter? c) What is the structure of the prompt fed into the SAMF fusion module? Any example can be provided?
    3. For A02D: In Ao2D, how are text tokens aligned with 2D slice embeddings? Since, in my understanding, no supervision exists between words and slices, how is the softmax over tokens interpreted?
    4. 3D Aggregator Architecture: Can you elaborate on the architecture used for the 3D aggregator? Is it a transformer-based temporal encoder, 3D CNN, or something else?
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Novelty, paper completeness and reproducibility

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper proposes a framework for radiology report generation from 3D CT scans that merges both 2D slice-based and volumetric representations. Specifically, a 2D encoder (pretrained via self-supervised learning on CT slices extracted from three anatomical planes) generates slice-level embeddings, which are then aggregated by a “3D aggregator” to capture volumetric context. The “Slice-Attentive Multi-Modal Fusion” (SAMF) fuses the 3D aggregator output, slice features, and textual prompts/tokens into a single representation for a large language model. Additionally, “Attention Over 2D Slices” (Ao2D), further enhances text-to-slice alignment. Experiments on the CT-RATE dataset show improvements in lexical-based evaluation metrics (BLEU, ROUGE-L, METEOR, BERT-F1) for radiology report generation, as well as performance gains in multiple-choice VQA.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Fusion of 2D and 3D Perspectives: By combining slice-level detail with the 3D aggregator’s volumetric context, the framework more closely resembles how radiologists analyze scans, effectively uniting local slice detail with an overarching 3D viewpoint.

    Effectiveness of Ao2D in Report Generation and VQA: The Ao2D demonstrates a performance boost in both tasks: report generation and VQA. This underscores the versatility and potential utility of the 2D slices and text alignment.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Lack of Clinically Focused Evaluation Metric: The paper reports BLEU, ROUGE, METEOR, and BERT-F1 for the generated reports, but omits any clinically relevance based metrics such as RadGraph-XL[1], RaTEScore[2], and GREEN[3]. Thus, the result does not guarantee the effectiveness of the model’s ability in clinically accurate report generation.

    Incomplete Result Reported: Although BLEU-1 and BLEU-4 are presented, the rationale for highlighting these two metrics more prominently than others (e.g., BLEU-2, BLEU-3) is not clarified. Additionally, some baseline models do not have all metrics reported, limiting direct comparisons.

    VQA Result Not Comprehensive: The multiple-choice VQA evaluation does not provide results for CT-Chat’s accuracy, precision, recall, or F1, making it difficult to benchmark improvements. Furthermore, the paper does not compare against other established medical multimodal LLMs (e.g., LLaVA-Med[4] and MedImageInsight[5]), limiting the result’s comprehensiveness.

    No discussion of Model Sizes: There is no mention of parameter size or resource usage, so it is hard to assess the model’s efficiency or feasibility for deployment. Without this information, it is also unclear whether performance gains may simply come from a larger model capacity.

    [1] RadGraph-XL: A Large-Scale Expert-Annotated Dataset for Entity and Relation Extraction from Radiology Reports. Delbrouck et al., ACL Findings 2024. [2] RaTEScore: A Metric for Radiology Report. Zhao et al., EMNLP 2024. [3] GREEN: Generative Radiology Report Evaluation and Error Notation. Ostmeier et al., EMNLP Findings 2024. [4] LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day. Li et al., NeurIPS 2023. [5] MedImageInsights: Medimageinsight: An open-source embedding model for general domain medical imaging. Codella et al., arXiv 2024.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    What is the size of the pretraining dataset (how many CT scans)? typo: besst –> best in page 6

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the work is interesting, the experiment design and discussion need improvement to be convincing for the effectiveness of the approach.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper
    1. A comprehensive framework for radiology report generation, designed to support both CT report generation and visual question answering.

    2. A pretrained 2D self-supervised learning encoder, initially trained on individual CT scan slices, is integrated with a 3D feature aggregator to capture volumetric context.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The framework maximizes the complementary strengths of both 2D and 3D representations through a 3D aggregator, which preserves volumetric and temporal relationships between adjacent CT slices.

    2. A warm-up training strategy is employed, where the model is first trained in a self-supervised manner to retain the temporal coherence between embeddings of adjacent slices.

    3. The proposed method demonstrates strong performance across multiple benchmarks, highlighting its effectiveness.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. In Equations (3) and (4), the variables F{2D} and F{3D} are not clearly introduced. They may correspond to 2D and 3D image features, but earlier sections use z{2D} and z{3D} to represent such features. This inconsistency could lead to confusion and should be clarified.

    2. The warmed-up training strategy is an interesting idea; however, the implementation details are lacking. The authors mention the use of InfoNCE loss but do not provide further explanation on how the warm-up phase is conducted or how it contributes to model performance.

    3. The purpose of Table 3 is not entirely clear. It is uncertain why the effect of freezing different model components is being studied, and the rationale behind these experimental setups is not well explained. Furthermore, in the Ao2D and SSL sections, fewer cases are analyzed when Ao2D is not used, making the comparison less comprehensive.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    refer to strengths and weaknesses

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We sincerely thank the reviewers for their professional suggestions and for their appreciation of our work. Our answers to the questions (Q) from the reviewers (R) are as follows: R1-Q1: The project code is now accessible at github.com/serag-ai/SAMF. R1-Q2: (a) In the SAMF framework, positional consistency between 2D and 3D features is implicitly preserved through attention-driven dynamic alignment. This attention mechanism enables soft positional alignment by weighting 2D slices according to their relevance to the aggregated 3D representation, effectively suppressing spatially discordant slices while enhancing features that contribute to better 3D understanding. (b) A shared d_proj ensures that 3D, 2D, and textual features reside within the same embedding space. (c) The text prompt is tokenized and encoded into embeddings, which are then fused with visual features through attention. To further support our approach, we plan to conduct additional experiments in future works. R1-Q3: Ao2D module aligns text tokens with 2D slice embeddings through cross-modal attention, where text and visual features are projected into a shared latent space and interact via learned attention weights. SoftMax over tokens dynamically computes relevance scores based on feature similarity. R1-Q4: The 3D aggregator is a lightweight spatial pooling-based architecture. It processes the stack of 2D-encoded sagittal slices by pooling features across the slice dimension. R2-Q1: In our previous work, we experimented with the GREEN score but observed inherent biases favoring shorter texts, which limited its reliability for our evaluation. For this study, we prioritized content verification using the Llama Score to better assess the generated reports. We appreciate this insight and will incorporate a discussion of these limitations in the final version. R2-Q2,Q3: As stated in the manuscript, several of the models used for comparison in our study are relatively recent and have only been released as preprints. This has presented challenges in terms of reproducibility, as many of these works do not provide public code or pretrained weights. To ensure a consistent evaluation across all models, we selected metrics that are commonly reported across the majority of prior work. Regarding the absence of certain metrics, the same reasoning applies. We excluded comparisons with MedImageInsight and LLaVA-Med, as they target different tasks and lack support for 3D volumetric data, respectively. R2-Q4: Thank you for your feedback on this. We use Phi-3v (4B) as our baseline VLM with ViT-B/16 as our slice-based encoder. The final fine-tuning stage required approximately 16 hours on a single NVIDIA A100 GPU with 80 GB of RAM. We will include more detailed explanations in the final version of the paper. R3-Q1: Thanks – we will correct both in the final version. R3-Q2: Due to our main contribution focusing on merging 2D slice-based and volumetric representations, we chose not to emphasize this aspect. However, we hypothesized that warming up the aggregator before final fine-tuning may enhance performance by preventing random weight initialization and providing the aggregator with preliminary insights into CT scans. The warm-up phase, conducted after pretraining the slice-based encoder but before final fine-tuning, employs the InfoNCE loss to maximize mutual information across different views (sagittal, axial, coronal) of a single CT scan. As shown in Table 3 (rows 2 and 6), this SSL warm-up yields a slight improvement in model performance. R3-Q3: The purpose of freezing different components was twofold: (1) to isolate the informational contribution of each component to the final prediction, and (2) to empirically evaluate its individual roles in the VLM. By freezing specific modules during training, we can assess how performance changes when relying solely on other components; for instance, freezing the slice encoder reveals the aggregator’s standalone representation quality.




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top