Abstract

We propose a general pipeline to automate the extraction of labels from radiology reports using large language models, which we validate on spinal MRI reports. The efficacy of our method is measured on two distinct conditions: spinal cancer and stenosis. Using open-source models, our method surpasses GPT-4 on a held-out set of reports. Furthermore, we show that the extracted labels can be used to train an imaging model to classify the identified conditions in the accompanying MR scans. Both the cancer and stenosis classifiers trained using automated labels achieve comparable performance to models trained using scans manually annotated by clinicians.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1510_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1510_supp.pdf

Link to the Code Repository

https://github.com/robinyjpark/AutoLabelClassifier

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Par_Automated_MICCAI2024,
        author = { Park, Robin Y. and Windsor, Rhydian and Jamaludin, Amir and Zisserman, Andrew},
        title = { { Automated Spinal MRI Labelling from Reports Using a Large Language Model } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    In this work, the authors proposed a pipeline for medical report labeling based on LLM. First, prompts are designed to generate a summary of the diseases based on the conditions. Second, the LLM is asked to answer the given questions about binary classification. Extensive experiments have indicated the effectiveness of the proposed method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The proposed method can achieve a close performance compared with the GPT4 model.

    2) The overall paper is clearly presented and easy to follow.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) The major concern is the lack of novelty. Specifically, the proposed pipelines are constructed by various existing techniques. First, the idea of facilitating QA tasks based on text summarization has been studied in existing works, such as:

    [a] All You May Need for VQA are Image Captions, in NACCL 2022

    [b] From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language Models, in CVPR 2023.

    In these works based on vision language QA, some detailed captions are first generated. Then the answers are obtained based on these captions, which is similar to this paper.

    Second, the fientuning method is LoRA, which is directly from existing works.

    2) The effectiveness of the proposed labeling pipeline is validated on image classification. Therefore, it seems that the overall task can be regarded as a vision language analysis study. However, the comparisons and discussions with some related methods are missing, such as:

    [c] LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day, in NeurIpS 2023.

    3) According to Table 3, the performance gain compared with directly using GPT4 is not clear (<2%).

    4) Table 5 only presents the classification results using the labels generated by the proposed method. It is not clear whether the performance will change a lot by training the image classification models with the labels from other methods mentioned in Table 4.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please address the concerns in the weakness section.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    There are several concerns regarding this work, including the lack of novelty, the lack of comparison with related vision language models, and the limited performance gain compared with the GPT 4.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The main contribution of the paper is the general pipeline to analyse radiology reports using large language models (LLMs). By doing this, each report will be labelled as positive or negative case for the specific health condition. The extracted labels can be used for training a classifier to classify medical images associated with those reports. Experimental results show that the weakly-supervised approach achieves a comparable performance when compared with models trained using a large amount of data labelled by medical experts.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The strengths of the paper include:

    1) Clear motivation: The lack of labelled data slows down the progress of a lot of researchers in this area. The motivation of using LLMs to analyse and extract the label of the specific condition is a smart idea.

    2) Solid methodology: When working with LLMs, there are a lot of challenges due to the extremely large size of the models. The authors have clearly explained the methodology and the rationale behind the design, as well as some other alternative solutions.

    3) Encouraging results: The experimental results demonstrate the effectiveness of the proposed method and such a weakly-supervised approach can potentially benefit other related problems as the pipeline is quite generic.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper is well-written in general and most of the areas are quite strong. The only weakness to me is the evaluation of the CancerData in Table 5. It will be interesting to see the performance of the previous method, if any, from the literature.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    As the authors will release the source code, the pipeline should be reproducible. Having said that, using and fine-tuning LLMs requires a lot of computation which may require high-end machines to run such a model.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The proposed methodology is solid and well-motivated in general. I would suggest the author consider if there are any additional related work that can be included in the experimental results on MRI classification to further highlight the effectiveness of the proposed method.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well-motivated, with a strong methodology and encouraging experimental results to support it.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Thank the authors for the effort in addressing the questions. I read all reviews and the rebuttal, and belive more clarification on the novelty could have been further improving the quality of the work. I understand the authors highlighted the application is novel, but at the same time the technical challenges will be lower in this case if there is limited novelty in the LLM. As a result, the score is slightly lowered although I am still positive about this work.



Review #3

  • Please describe the contribution of the paper

    A methodology to derive labels form medical reports using LLMs is introduced, to reduce the requirement of manual labelling of medical image data. Resulting automatically labelled data is used to train machine learning classifiers that demonstrate good performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main contribution lies in a proof-of-principle demonstration to label medical data automatically from reports to derive datasets to train machine learning models. Using local open-source language models shows the feasibility of the approach towards training classifiers with good performance.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper describes a general methodology that may be hard to reproduce exactly due to lack of access to data, specific fine-tuned models and details on the codes models used.

    The verification of the labels are limited to a smaller manually annotated subset of the data. While reasonable within the limitations of such a project, one has to rely on this smaller dataset being sufficient to provide the labels for a much larger dataset for evaluating classification.

    Comparison of classifier models is limited, but maybe also not the point of the paper.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    As neither data or specific models are not available it will be quite hard to reproduce the results from the paper. Nevertheless, I expect the general methodology to be relatively easy to reproduce on similar problems.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    How has the subset for calibrating and testing the labels been chosen? It seems important that from this data the general performance of the labeller for later classification training can be asserted, compared to having a fully manually labelled dataset (which of course is expensive). A discussion of the impact of an uncertainty in the labels for the classifier and possible remedies to reduce this risk would be very useful.

    In particular the perfect EER/AUROC for Z-SFT Stenosis needs some more consideration, also given the above limitation. It seems unlikely it can hold for the full dataset.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall the paper demonstrates a methodology worth discussing at MICCAI. The limitation on the uncertainty in the auto-generated labels is of course a concern, but hard to fully test. Nevertheless, the methodology is worth discussing and raising awareness towards paving the way to improving medical datasets for machine learning.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The paper should be accepted due to the interesting general application, showing that it can actually work for the specific case considered. My main concern on the sub-set classification vs. the total set is partially addressed by the response. I am still somewhat concerned that validation may be hard to trust without much more data, but this is well beyond scope and the paper can serve as a strong basis to explore this further.




Author Feedback

We thank the reviewers for their comments and considerations for strengthening the paper.

R1: We agree that it would be beneficial to compare against other related works, specifically noting that a baseline would be useful to evaluate the model’s performance on cancer detection. Since submission, a newly published paper [1] on detecting bone metastasis in the thoracolumbar spine reported a per-slice F1-score of 0.72 on axial CT scans, which we surpass (0.74), achieving a similar F1-score to orthopaedic residents (0.73).

R3: We would like to clarify that the manually labeled sets were a randomly chosen subset of the full data, which we split into testing and calibration sets using stratified sampling. We currently force the pipeline to output a positive or negative label without uncertainty. Since we run our LLM locally, we can get probabilities from our LLM to predict uncertainty, which we will explore in future work.

R4, W1: “Method is simply a combination of existing techniques”. We do not claim that LLM summarization and LoRA are novelties of our work. However, our application of them is novel. We do not believe that R4’s references are directly comparable to our work as they are visual language studies that take both text and image as inputs. R4 states that text summarisation has been studied in references [a] and [b], but these focus on generating questions from single-sentence captions of natural images rather than summarising long-form text. In contrast, we demonstrate that open-ended text summarisation can be used to extract structured information from multi-paragraph radiological reports that contain specialised jargon and negation rules atypical in natural language (see Figures 2-3 in Supplementary Materials).

R4, W2: “Lack of comparison to existing large vision-language models”. This is an interesting future direction, but it appears to be out of scope for this paper. Our aim was to demonstrate a proof-of-concept that providing vision models with LLM-generated pseudo-labels can achieve performance comparable to models trained with human annotations, rather than creating a new visual question answering (VQA) system. Reference [c] is a medical visual language model in the same subject area as our submission, and we will cite and discuss it in the camera-ready. While it is an interesting study demonstrating multimodal VQA capabilities, it is distinct from our work in that it takes both image and text as inputs and aims to be a standalone visual chatbot.

R4, W3: “Gains in labeling performance were minimal against GPT-4”. GPT-4 is a closed-source model that requires payment per token and the upload of sensitive medical data to remote servers for processing. Despite marginal performance gains, our method is more privacy-preserving and allows for local inference at negligible cost.

R4, W4: “Table 5 only presents classification results using labels generated by own method.” We would like to clarify that only the first and fourth rows of table 5 report results of models fully trained using our labels. Row 2 performs inference using SpineNetV2, which is trained using human annotations, and row 3 uses SpineNetV2 to extract encodings and train an SVM using our report-generated labels. We treat rows 2-3 as baselines for our stenosis classification method (row 4). We will make this clearer in the table in the camera-ready version.

[1] Motohashi, M., Funauchi, Y., Adachi, T., Fujioka, T., Otaka, N., Kamiko, Y., Okada, T., Tateishi, U., Okawa, A., Yoshii, T., Sato, S.: A New Deep Learning Algorithm for Detecting Spinal Metastases on Computed Tomography Images. “A New Deep Learning Algorithm for Detecting Spinal Metastases on Computed Tomography Images.” Spine 49.6 (2024): 390-397.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper shows a simple but useful application of LLMs. The current LLMs are pretty good at answering yes/no questions for major disease classes like cancer or stenosis. In the example given, there is even a hint through the terms such as stenosis. But even if there isn’t GPT4 is pretty good at saying yes or no. So this together with the fact that the performance improvement is marginal, reduces the value of the paper. Researchers are already using these methods for labeling datasets. If this was demonstrated on a large number of findings and for nuanced versions of findings, for example what kind of stenosis, etc. then it may be more interesting.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper shows a simple but useful application of LLMs. The current LLMs are pretty good at answering yes/no questions for major disease classes like cancer or stenosis. In the example given, there is even a hint through the terms such as stenosis. But even if there isn’t GPT4 is pretty good at saying yes or no. So this together with the fact that the performance improvement is marginal, reduces the value of the paper. Researchers are already using these methods for labeling datasets. If this was demonstrated on a large number of findings and for nuanced versions of findings, for example what kind of stenosis, etc. then it may be more interesting.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Although MR1 questioned the reviewers’ decision, considering the principle that a meta-reviewer does not serve as an additional reviewer, I am inclined to accept the decision of 2 out of 3 reviewers. While R4 raised well-founded concerns and provided a high-quality review, I tend to agree with the author’s rebuttal. I believe the practical value of the paper is also commendable. However, I do not have direct research experience with LLMs.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Although MR1 questioned the reviewers’ decision, considering the principle that a meta-reviewer does not serve as an additional reviewer, I am inclined to accept the decision of 2 out of 3 reviewers. While R4 raised well-founded concerns and provided a high-quality review, I tend to agree with the author’s rebuttal. I believe the practical value of the paper is also commendable. However, I do not have direct research experience with LLMs.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The impact of this work mostly lies in “demonstrating LLM can summarize multi-paragraph radiological reports that contain specialised jargon and negation rules atypical in natural language”. The technical novelty is very limited, however the clinical potential with this validation could be inspirational.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The impact of this work mostly lies in “demonstrating LLM can summarize multi-paragraph radiological reports that contain specialised jargon and negation rules atypical in natural language”. The technical novelty is very limited, however the clinical potential with this validation could be inspirational.



back to top