Abstract

We propose MRScore, an innovative automatic evaluation metric specifically tailored for the generation of radiology reports. Traditional (natural language generation) NLG metrics like BLEU are inadequate for accurately assessing reports, particularly those generated by Large Language Models (LLMs). Our experimental findings give systematic evidence of these inadequacies within this paper. To overcome this challenge, we have developed a unique framework intended to guide LLMs in evaluating radiology reports, which was created in collaboration with radiologists adhering to standard human report evaluation procedures. Using this as a prompt can ensure that the LLMs’ output closely mirrors human analysis. We then used the data generated by LLMs to establish a human-labeled dataset by pairing them with accept and reject samples, subsequently training the MRScore model as the reward model with this dataset. MRScore has demonstrated a higher correlation with human judgments and superior performance in model selection when compared with traditional metrics. Our code is available on GitHub at: https://github.com/yunyiliu/MRScore.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1151_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/yunyiliu/MRScore

Link to the Dataset(s)

https://github.com/yunyiliu/MRScore

BibTex

@InProceedings{Liu_MRScore_MICCAI2024,
        author = { Liu, Yunyi and Wang, Zhanyu and Li, Yingshu and Liang, Xinyu and Liu, Lingqiao and Wang, Lei and Zhou, Luping},
        title = { { MRScore: Evaluating Medical Report with LLM-based Reward System } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15003},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors proposed MRScore, an innovative automatic evaluation metric for the generation of radiology reports. They developed a framework to guide LLMs in evaluating radiology reports. The data generated by LLMs was used to train a MRScore model. MRScore demonstrated a higher correlation with human judgments compared to traditional metrics.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) The paper addressed a real issue: it is difficult to judge the quality of reports generated by LLMs, especially using traditional metrics; 2) The paper utilized ChatGPT to generate/label training data automatically. This was a good way to easily generate large amount of data without intensive manual labors; 3) The paper compared the proposed MRScore to other common scoring methods and showed superior performance.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper could use a lot more concrete examples. I will give two examples here: 1) A very important part of the paper was to use LLMs to create/grade a training dataset with reports of various qualities. However, without any concrete example of how this was done (for example, what prompts were used and what a high v.s. low quality report looked like), it was difficult to judge the merit of this approach; 2) The proposed MRScore was better correlated with human judgment compared to other NLP scoring metrics. It would be interesting to see some examples of reports that MRScore was able to judge correctly while the other metrics failed to do so. I would also like to see some intuitive explanation on why MRScore was better at judging reports.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Similarly as what I mentioned above, since this paper involved a lot of utilization of LLMs such as ChatGPT, in order for the readers to better understand the paper and its merits, it would be great if the authors could add more details and examples. For example, what kind of prompts were given to ChatGPT to guide it through the report grading process, how reports were divided into high/medium/low quality, and how the proposed MRScore performed better than other metrics on a few report samples.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I found the application very interesting (providing a new scoring method to judge reports generated by LLMs). At the same time, I don’t find the methodology very complex/highly innovative.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The abstract introduces MRScore, an innovative automatic evaluation metric designed specifically for assessing radiology reports, especially those generated by Large Language Models (LLMs). It highlights the limitations of traditional NLG metrics like BLEU in-accurately evaluating such reports and provides experimental evidence within the paper to support this claim.

    To address these challenges, the authors developed a unique framework in collaboration with radiologists, aligning with standard human report evaluation procedures. This framework aims to guide LLMs in generating reports that closely resemble human analysis, enhancing the quality and relevance of the generated content.

    A key aspect of MRScore’s development involved utilizing data generated by LLMs to create a human-labeled dataset consisting of accept and reject samples. This dataset was then used to train the MRScore model as a reward model, allowing it to assess and score the quality of generated reports more effectiv

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. MRScore demonstrates a higher correlation with human judgments and outperforms traditional NLG metrics in model selection tasks. This highlights MRScore’s potential as a more reliable and human-aligned evaluation metric for assessing the quality of radiology reports generated by advanced language models.

    2. Overall, MRScore represents a significant advancement in the field of automatic evaluation metrics for radiology reports, offering a promising solution to the limitations of existing NLG metrics in this context.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Paper is not written well.
    2. There are lot of spelling mistakes like
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Paper is not written well.

    1. There are lot of spelling mistakes like
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. MRScore demonstrates a higher correlation with human judgments and outperforms traditional NLG metrics in model selection tasks. This highlights MRScore’s potential as a more reliable and human-aligned evaluation metric for assessing the quality of radiology reports generated by advanced language models.

    2. Overall, MRScore represents a significant advancement in the field of automatic evaluation metrics for radiology reports, offering a promising solution to the limitations of existing NLG metrics in this context.

    3. Paper is not written well.
    4. There are lot of spelling mistakes.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    After refinement paper can be accepted

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors created new model and evaluated them with traditional metrics, MRScore has shown a stronger relationship with human judgments and better results in model selection. Authors also will share the code.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This study introduces MRScore, an innovative metric designed for evaluating automated radiology report generation and model is developed in collaboration with professional radiologists which has significancy in the applicability of the methods in to the real clinical scenario.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The authors explained clearly their method. The authors should have discussed the superiority of their methods with the Gemma Bit 7 which is the closest p value and correlation in also 6.4 million trainable parameters.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors should have mentioned about the generated dataset and its complexity regarding the original dataset.They have mentioned they will put the code in github. They should have release the generated data.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors should include the comparison of their model with the Gemma bit 7. They should include the their paired dataset. The authors should discuss the figure in 3 in a detailed method.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I found it very well organized and original and they gave their attention to make it smooth and understandable. It is clinically needed and should be adapted into the other generative LLM research in radiology. The authors should also include the how they integrate the new error based system into the their scoring. It is not easily understandable from the point of the new error types integration into the system.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

N/A




Meta-Review

Meta-review not available, early accepted paper.



back to top