Abstract

Large language models (LLMs) have demonstrated potential across various tasks, including vision-language applications like chest X-ray (XR) report generation (RG) in healthcare. Recent RG approaches focus on optimizing model performance for a single dataset with a single XR modality, often neglecting the critical area of computed tomography (CT) report generation. The challenge is compounded by medical datasets being isolated across different centers, making comprehensive collection difficult. Furthermore, LLMs trained on datasets sequentially can experience catastrophic forgetting. In this paper, we move beyond conventional approaches of training on a single dataset, and focus on improving the overall performance on sequentially collected multi-center datasets. We incorporate four datasets with diverse languages and image modalities for the experiments. Our approach utilizes a minimal number of task-specific learnable weights within an LLM-based RG method for each domain, maintaining the majority of weights frozen to avoid forgetting. Utilizing LLMs’ multilingual generalizability, we align models and facilitate knowledge sharing through a multi-label supervised contrastive loss within the LLM hidden space. We design a 2D-3D adapter for the image encoder to transfer from XR to CT RG tasks. A CT disease graph is established for transferring knowledge from XR to CT RG tasks, using CT’s most relevant XR disease class centers in a triplet loss. Extensive experiments validate our design.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0254_paper.pdf

SharedIt Link: https://rdcu.be/dV17g

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72086-4_17

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0254_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

https://physionet.org/content/mimic-cxr/2.0.0/ https://openi.nlm.nih.gov/faq#collection

BibTex

@InProceedings{Sun_Continually_MICCAI2024,
        author = { Sun, Yihua and Khor, Hee Guan and Wang, Yuanzheng and Wang, Zhuhao and Zhao, Hongliang and Zhang, Yu and Ma, Longfei and Zheng, Zhuozhao and Liao, Hongen},
        title = { { Continually Tuning a Large Language Model for Multi-domain Radiology Report Generation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {177 -- 187}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper aims to improve the overall performance of an LLM-based model for report generation on a sequence of training datasets. This paper shows a way how to move the 2D XR image based report generation to the 3D CT report generation.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper shows a way how to move the 2D XR image based report generation to the 3D CT report generation. The strategy is to frozen the majority of weights to avoid forgetting.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. It not clear how the forgetting can be avoided by just frozen some weights, as the changing of even minimal weights may cause the model having totally different behavior.
    2. The clinical settings are not aligned with the practice, such as the input image sizes are too small, and the general evaluation metric in Table 1 and 2 cannot may not reflect the results discussed in Fig. 3.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    This reviewer suggests the authors pay more attention on the alignment the developed model and the clinical practice.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method and the experiment settings as discussed in the weakness part.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Although the rebuttle is not fully convincing and far more from the clinical practice, but I think this work is interesting.



Review #2

  • Please describe the contribution of the paper

    This paper intended to deal with the continual learning problem in radiology report generation. Parameter efficient tuning techniques like prompt tuning and adapter is used for learning knowledge from specific domain. Multi-label supervised contrastive loss is used for facilitating knowledge sharing.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The continually learning in Radiology Report Generation is quite novel.
    2. The usage of Multi-label supervised contrastive loss seems effective and original in this task.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1.From D_xc to D_ctc, a specific disease graph is needed, which enchanted its difficulty to generalization to other domain.

    1. Using prompt tuning and adapter for continual learning is quite common. [1, 2, 3]. However, I have to mention this work combines these two methods and extend to RG tasks, which increases its novelty.
    2. The compared methods are a bit old. Only ProgPrompt is from last two years, as for EWC(2017) and DER(2020).

    [1] Gao, Qiankun, et al. “A unified continual learning framework with general parameter-efficient tuning.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. [2] Ermis, Beyza, et al. “Memory efficient continual learning with transformers.” Advances in Neural Information Processing Systems 35 (2022): 10629-10642. [3] Zhang, Wentao, et al. “Adapter learning in pretrained feature extractor for continual learning of diseases.” International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer Nature Switzerland, 2023.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    D_xc to D_ctc are two private datasets.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The compared methods are underperformed may due to their less parameter for tuning, especially ProgPrompt. Listing the tunable parameters for all compared methods may be a good idea.
    2. The MIMIC-CXR(D_xe1 in paper) have 5.2k test samples, while the rest three test sets have 1.2k, 1.1k and 0.9k samples. The report score simple added them together may not fare. Listing the results for each datasets may be a more fare way to compare the performance.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Continual learning problem in radiology report generation is an important task. However, the performance of compared methods should be listed in a more cleared way. Two of the datasets is private, and the specific disease graph is needed, result in generalization difficulty.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Rebuttal does not clearly indicate whether the performance will be affected when the graph is unavailable. Or is there an automated graph generation method to improve its generalization.



Review #3

  • Please describe the contribution of the paper

    The paper proposes a new method-based LLM for continuously learning to generate report generation (RG). The method can be adapted to different types of data from multi-center; multi-domain and achieved good performance compared other SOTA models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The following point is interesting: a) Author composes prompt constructions to train model from both input images (f_p); trainable visual tokens (f_v); and a trainable domain token (f_t). The LLM params is frozen while other parts are learned to optimize downstream loss.

    b) To compute f_v, the self-attention in Transformer is also proposed which showed in appendix helps to improving performance.

    c) Some contrastive learning techniques are integrated in the latent embedding of LLM to further advancing model robustness.

    d) The method obtained best performance on 3/4 downstream tasks.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper has following weaknesses points: a) It’s quite heuristic on which parts of model will be trained, frozen in continuous learning setting. For e.g., the projection layer P (yellow ones in Fig.1) is trained in the first dataset -> frozen in the next one -> trained again in next ones, which in generally makes it hard to design whether should we train or frozen given a new datasets. Is there any underline guidance to address it?

    b) The paper mentioned about the forgetting issues of LLM when learns on a new task; then they proposed to frozen LLM weights and only train some other modules. However, in experiment settings, downstream tasks indeed are the same, only input data might be different. Thus it raised a concern whether frozen LLM is a good option or fine-tuned LLM is a better one. In short, authors should conduct experiments for both trained (with adaptor) and frozen LLMs to see the impact on final results.

    c) The performance on the CiDEr datasets is not good as three others. Can author explain the reason for this phenomenon?

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Because the authors are using LLM for their model, they should provide the code to help re-produce the best-reported performance; otherwise, it’s really hard to build from scratch.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    As mentioned in Section 6, author should

    (a) conduct experiments for frozen LLM and trained LLM in your applications. Given the same tasks, it’s possible that properly fine-tuned LLM (with adaptor) can further advance model performance.

    (b) Regarding a ‘lightweight’ 2D-3D adaptor to transform between XR to CT, the author can leverage current strong models trained on large-scale medical images such as [1] to have stronger and robust feature representations (invariant for both MRI, CT, etc).

    (c) I have a question on computing log likelihood output in Eq.(1). How can you adjust the output layers of LLM which originally is designed to generate a sequence of tokens become a classification scores? Which layer did you add or adjust to produce a single prediction score for each input data?

    [1] LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical Imaging via Second-order Graph Matching, NeurIPS 2023.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    In general, I think this is a good work with important applications on continuous learning across different data collected multi-site hospitals or multi-modal data. Although my concern is about the frozen LLM as discussed above.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Thanks authors for your feedback. I will keep my previous score rating, mostly due to the unclear reason for frozen LLM during continual learning.




Author Feedback

We sincerely thank the reviewers for their valuable comments and suggestions. We appreciate the encouraging comments like “Continual learning (CL) in the report generation (RG) task is novel and important. Contrastive loss in LLM hidden space is novel and interesting” (R1&R3), and “move the RG task from 2D XR to 3D CT is interesting” (R4).

Q(R4): How do we avoid forgetting? A: We employ task-specific weights to differentiate between tasks, a strategy that has proven effective in CL and avoids forgetting (ICLR[27], MICCAI[a], TIP[b]). Specifically, we learn minimal “distinct and task-specific” weights for “q, f_t, P, LN” for each dataset/domain.

Q(R4): The input image size and evaluation metrics? A: The commonly used image size in RG for XR and CT slices is 224*224 [c,d], while BLEU, ROUGE, and CIDEr are prevalent evaluation metrics for RG [c,d]. To ensure comparability with prior studies, we maintain consistency in both input shape and the metrics. Our experiments (Table+Fig.) can demonstrate the effectiveness of our method, confirmed by a radiologist.

Q(R4): How does our method align with clinical settings? A: We focus on CL for RG, aligning with clinical settings where multi-domain data at scale can only be acquired sequentially (not simultaneously) for LLM. We tailor the CL pathway to match real clinical applications (from online to private, from XR to CT), and meet the clinical needs of RG for both XR (using 1 or 2 inputs) and CT.

Q(R3): Guidelines for the tuned parameters? A: Reuse P for similar tasks (both English+XR). For dissimilar tasks (English to Chinese, XR to CT), tuning P is necessary.

Q(R3): Fine-tune LLM on RG task. A: We aim to optimize the overall performance on multi-domain RG tasks (English&Chinese, XR&CT). Tuning the multi-lingual LLM on the 1st dataset D_XE1 causes overfitting in an English setting, deteriorating the subsequent performance. Keeping the multi-lingual LLM frozen maintains its generalizability as a backbone for CL on multi-domain RG tasks, and reduces computational costs.

Q(R3): Log-likelihood in Eq.(1)? A: Generating subsequent tokens is to predict the indices of words in a given vocabulary, like classification. The LLM outputs are probability vectors for possible word indices, where log-likelihood is optimized by equivalent cross-entropy.

Q(R1): Baseline methods. A: To the best of our knowledge, no prior work on CL in RG for comparison. DER and EWC are commonly used baselines. ProgPrompt, identified as the leading baseline on ICLR[e] and Arxiv[f] recently (Apr. 2024), is a competitive option for comparison.

Q(R1): Generalizability of our method using disease graph. A: In lungs, our method can learn new modalities (e.g. PET/MRI) by using the domain token and pulling similar findings to the disease’s class centers of CT (e.g. tumor). If no similar findings exist, L^SC allows for free exploration of disease’s class centers in the feature space, while keeping distance from other classes. In a broad medical context for CL, our method can map disease graphs used in screening to detailed exams with additional labels (e.g. XR to CT, fundus photo to OCT).

Q(R1): Detailed info. A: The number of training parameters in ProgPrompt matches ours, as they use an MLP layer for prompt reparameterization. Other methods use much more training parameters by tuning encoder E. Details on each dataset will be included in the supplementary material.

[a]Adapter learning in pre-trained feature extractor for continual learning of diseases,MICCAI,2023 [b]Task-Specific Normalization for Continual Learning of Blind Image Quality Models,TIP,2024 [c]Radiology report generation with a learned knowledge base and multi-modal alignment,MIA,2023 [d]Medical-VLBERT: Medical Visual Language BERT for COVID-19 CT Report Generation With Alternate Learning,TNNLS,2021 [e]Scalable Language Model with Generalized Continual Learning,ICLR,2024 [f]Q-Tuning: Queue-based Prompt Tuning for Lifelong Few-shot Language Learning,2024




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Satisfactory rebuttal

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Satisfactory rebuttal



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviewers agree that the problem addressed in this paper is an important task in radiology report generation, with potential applications in continuous learning across diverse data sources from multi-site hospitals or multi-modal data. However, they raise concerns regarding performance comparison, the use of private datasets, and the potential for misalignment in clinical settings. The authors’ rebuttal addresses most of these concerns, but responses regarding the effect of performance without the graph, the graph generation method, and the reasoning behind freezing the LLM during continual learning remain unsatisfactory.

    While minor concerns remain, I find the problem statement to be relevant and the proposed solution to exhibit a certain level of novelty. The rebuttal has adequately addressed most of the concerns raised by the reviewers. Therefore, I am inclined to accept this paper and provide the MICCAI community an opportunity to further discuss the clinical usefulness of this work.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The reviewers agree that the problem addressed in this paper is an important task in radiology report generation, with potential applications in continuous learning across diverse data sources from multi-site hospitals or multi-modal data. However, they raise concerns regarding performance comparison, the use of private datasets, and the potential for misalignment in clinical settings. The authors’ rebuttal addresses most of these concerns, but responses regarding the effect of performance without the graph, the graph generation method, and the reasoning behind freezing the LLM during continual learning remain unsatisfactory.

    While minor concerns remain, I find the problem statement to be relevant and the proposed solution to exhibit a certain level of novelty. The rebuttal has adequately addressed most of the concerns raised by the reviewers. Therefore, I am inclined to accept this paper and provide the MICCAI community an opportunity to further discuss the clinical usefulness of this work.



back to top