Abstract

Medical imaging plays a crucial role in diagnosis, with radiology reports serving as vital documentation. Automating report generation has emerged as a critical need to alleviate the workload of radiologists. While machine learning has facilitated report generation for 2D medical imaging, extending this to 3D has been unexplored due to computational complexity and data scarcity. We introduce the first method to generate radiology reports for 3D medical imaging, specifically targeting chest CT volumes. Given the absence of comparable methods, we establish a baseline using an advanced 3D vision encoder in medical imaging to demonstrate our method’s effectiveness, which leverages a novel auto-regressive causal transformer. Furthermore, recognizing the benefits of leveraging information from previous visits, we augment CT2Rep with a cross-attention-based multi-modal fusion module and hierarchical memory, enabling the incorporation of longitudinal multimodal data.



Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2185_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2185_supp.pdf

Link to the Code Repository

https://github.com/ibrahimethemhamamci/CT2Rep

Link to the Dataset(s)

https://huggingface.co/datasets/ibrahimhamamci/CT-RATE

BibTex

@InProceedings{Ham_CT2Rep_MICCAI2024,
        author = { Hamamci, Ibrahim Ethem and Er, Sezgin and Menze, Bjoern},
        title = { { CT2Rep: Automated Radiology Report Generation for 3D Medical Imaging } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduces the first method for generating radiology reports for 3D medical imaging, focusing on chest CT volumes. It establishes a baseline using an advanced 3D vision encoder and introduces a novel auto-regressive causal transformer. Additionally, it enhances the method with a cross-attention-based multi-modal fusion module and hierarchical memory to incorporate longitudinal multimodal data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. A report generation framework for 3D medical imaging. Most recent multimodal works are from the computer vision field, which could not be directly used in 3D medical images.
    2. A training set of 20,000 patients and a validation set of 1,314 patients.
    3. Hierarchical memory to incorporate longitudinal multimodal data. This is in line with clinical scenarios and is rarely studied in related work.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    No compare with SOTA multimodal frameworks.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. No compare with SOTA multimodal frameworks. I understand that there is no directly similar work for the report generation of 3D medical images. However, there are lots of multimodal method for the natural or 2d medical images [1]. LLava-med achieves great performance in x-ray report generation, I think the author could replace their 3D vision encoder in LLava-med as a great baseline. Also, some researchers also adapt it into video inputs [2] which may be simliar to adapt into 3D medical images. I think it is important to compare with at least one recent works to show the effectness of the proposed method.

    [1] Li, Chunyuan, et al. “Llava-med: Training a large language-and-vision assistant for biomedicine in one day.” Advances in Neural Information Processing Systems 36 (2024). [2] Lin, Bin, et al. “Video-llava: Learning united visual representation by alignment before projection.” arXiv preprint arXiv:2311.10122 (2023).

    1. The F1 score seems low of CE metrics, which indicates the poor diagnosis performance.

    2. Is it possible to release the dataset of 25,701 non-contrast 3D chest CT volumes and corresponding reports? I think it will greatly advance research in this field.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    No compare with SOTA multimodal frameworks.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    In this paper the authors present a novel CT-to-report model. The proposed model is extended in a way to also input the previous report and CT volume. The proposed model and the extended model are evaluated on a large-scale CT report dataset. The quantitative and qualitative results are presented, indicating that the proposed methods can effectively generate reports from the input CT images.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The model proposed in the paper generates reports from CT images, which is rarely researched in related works.
    2. An intuitive model is proposed to encode the CT volume and to generate the report. Rational memory and memory-driven conditional layer normalization to enhance the decoding process
    3. The proposed method incorporates the previous CT images and reports to enhance the report, which is meaningful in clinical routines.
    4. The method is evaluated on a large-scale CT dataset.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The authors claim their method to be the first CT2Rep, however another CT to report method, also named as CT2Rep, is already proposed in 2022 (https://www.mdpi.com/2075-4426/12/3/417)
    2. The explanation of the models is sloppy, same statements appear multiple times in section 1 and 2.
    3. In Fig. 1, does X1:12 mean the distinct patches as in subsection 2.1? If yes, the spatial size should be (12)2424 (also in section 2.1) and why does it look like the whole CT slices?
    4. In 3D vision feature extractor paragraph, what is the temporal patches? the temporal patch size pt is defined but never used in the paper.
    5. In Transformer encoder and decoder paragraph, the authors indicate that they use conventional/traditional Transformer network. Do they mean vision transformer as in (https://arxiv.org/abs/2010.11929)?
    6. The dataset needs more elaboration: Is the dataset private/public? How are the reports generated?
    7. In section 3.1, the author states that ‘CT-Net is the first and only model for 3D chest CT volumes classification’, which is not true.
    8. In section 3.2, the authors use CT-Net output features as input to the 3D volume transformer, what is the transformer exactly?
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The statement of the dataset used in the paper is unclear. Since multiple large models are involved, training the proposed method will be time-consuming.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The naming of ‘longitudinal data’ is confusing as longitudinal mostly indicates spatial location.
    2. Please refine the explanation properly and reduce the repetitive sentences.
    3. Be careful with the definitive statements (the first… the only… the first and only) in the paper and use citations to support such statements when possible.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is novel and the presented task is rarely researched. However, some expected explanations in the paper are missing and thus rebuttal is needed.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The concerns in my comments are well answered. This study is based on a large-scale private dataset and thus also important because there is no existing public dataset



Review #3

  • Please describe the contribution of the paper

    This work proposes one novel interesting radiology report generation with 3D medical image for chest CT volumes. For the first time, this work shows effectiveness in longitudinal dataset with a cross-attention-based multi-modal fusion module and a hierarchical memory-driven decoder. The proposed novel auto-regressive causal transformer can be befinical for the future radiology report generation tasks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This work introduces one first method to generate radiology reports for 3D medical imaging and show promising performance.
    2. The paper is clearly written and easy to follow.
    3. Evaluation is fair and solid for validating the proposed method.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The author introduces one auto-regressive causal transformer for encoding the visual feature, similarly like Ibrahim Ethem Hamamci etc. “GenerateCT: Text-Conditional Generation of 3D Chest CT Volumes”. However, for 3D feature encoding, the casual transformer is not needed like generation tasks.
    2. In the ablation studies, the author need to compare with 2D report generation results when claiming the effectiveness of 3D structure.
    3. In evaluation section, author can introduce SOTA LLM model for generated report evaluation, like other VLM model evaluation.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Refer to weaknesses.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Major concerns remains in reproducibility of the paper. Author does not claim to release the source code in the paper.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

Dear Reviewers, we appreciate your constructive feedback. Please find our clarifications below:

[R1, R3, R5] Reproducibility, codebase, dataset: We understand the concerns regarding our dataset and reproducibility. From a hospital’s electronic health records, we exported 3D chest CT volumes and corresponding radiology reports written by radiologists in daily clinical practice, receiving ethical approval for both the CT2Rep project and the open-sourcing of our training data. To maintain anonymity, we did not include specific dataset details in our submission. If accepted, we will provide the ethical approval code, data details, an open-source link to the dataset, and our codebase.

[R1] Comparison with LLAVA: We acknowledge the constructive feedback on lack of comparison with multimodal methods like LLAVA. But, LLaVA-Med uses a VQA dataset for training, and generating such a dataset is beyond our scope. Also, LLaVA employs a pretrained CLIP with frozen weights, whereas training a 3D CLIP on our dataset was not finalized. Also, our 3D tokenization results in 4096 tokens, compared to LLaVA’s 50 tokens, highlighting methodological disparities. So, integrating a 3D encoder into LLaVA introduces complexities. But, we recognize the value of such a work and are working on adapting LLaVA for 3D medical images.

[R1] F1 Score: We agree F1 scores in our paper are not yet high enough for clinical application. However, it is important that this is the first paper addressing report generation for 3D medical images. Our primary goal was to establish a foundational baseline.

[R3] Use of causal transformer: The causal transformer models attention between slices, capturing inter-slice dependencies, which is clinically significant as certain pathologies span varying axial locations. This captures essential 3D structural information. We removed the decoder and do not use auto-regressive or MaskGIT training, which are beneficial for generation tasks.

[R3] Comparison with 2D methods: Our radiology reports are written for whole 3D volumes, with pathologies residing on different slices. This variability makes random layer selection and 2D report generation ineffective.

[R3] Evaluation with LLMs: We appreciate the insightful suggestion. Inspired by this idea, we evaluated similarities between generated and groundtruth reports using ChatGPT-4 Turbo. We will include a comprehensive analysis in supplementary.

[R5] Name similarity: We became aware of this after submission. However, it significantly differs from our work as it is not a direct 3D CT-to-report method. It extracts 113 semantic features using Radiomics and SISN and employs a neural network to fill predefined templates. This approach does not generate true reports and only processes tumor-related features, without addressing the wide range of pathologies in CT reports. Our method generates comprehensive radiology reports directly from 3D volumes, covering a broader spectrum of pathologies. But, we acknowledge we could have chosen a different name.

[R5] Fig. 1 and temporal patches: X1:12 represents the first 12 slices patched along x and y dimensions with a patch size of 24, creating 400 patches per slice. For 240 slices, this results in 20 sets of 400 patches. These are reshaped to form a tensor (batch_size×20×400) for the first transformer, capturing attention within each slice. The tensor is then reshaped to (batch_size×400×20) for the second transformer to capture attention across slices, effectively modeling 3D spatial relationships. The temporal patch size refers to treating 12 slices as a single patch.

[R5] Our network: The transformer encoder uses two subsequent transformers with conventional multi-head attention. While similar to the Vision Transformer, it does not utilize CLS. The 3D volume transformer refers to this transformer encoder. The transformer decoder is a language transformer adapted from R2Gen, with modifications like memory-driven conditional layer normalization.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    After the rebuttal, two reviewers recommended acceptance. Considering the significant value of CT in the medical field, this paper is exceptionally meaningful. Additionally, given the major differences between CT images and natural images or videos, I find the demand for the authors to compare their work with models like LLAVA a bit harsh. However, I do suggest that the authors compare their work with more baselines. I also have a concern regarding computational costs. The paper mentions that all experiments were conducted using a single A100 for one week, which seems a bit unusual to me (perhaps I misunderstood). I recommend that the authors provide a more detailed breakdown of the computational costs, including time, memory, and model size.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    After the rebuttal, two reviewers recommended acceptance. Considering the significant value of CT in the medical field, this paper is exceptionally meaningful. Additionally, given the major differences between CT images and natural images or videos, I find the demand for the authors to compare their work with models like LLAVA a bit harsh. However, I do suggest that the authors compare their work with more baselines. I also have a concern regarding computational costs. The paper mentions that all experiments were conducted using a single A100 for one week, which seems a bit unusual to me (perhaps I misunderstood). I recommend that the authors provide a more detailed breakdown of the computational costs, including time, memory, and model size.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The shared concern in the review is the reproducibility of the study due to the use of a private dataset. The authors committed in the rebuttal to release the code and dataset once the paper is accepted. After the rebuttal, the paper received “Accept,” “Weak Accept,” and “Weak Reject” decisions, with R1 (Weak Reject) not providing a final decision. After reading the paper and considering all the reviews, I suggest an “Accept.”

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The shared concern in the review is the reproducibility of the study due to the use of a private dataset. The authors committed in the rebuttal to release the code and dataset once the paper is accepted. After the rebuttal, the paper received “Accept,” “Weak Accept,” and “Weak Reject” decisions, with R1 (Weak Reject) not providing a final decision. After reading the paper and considering all the reviews, I suggest an “Accept.”



back to top