Abstract

Automated generation of radiology reports from chest X-rays has the potential to substantially reduce the workload of radiologists. Recent advances in report generation using deep learning algorithms have achieved significant results, benefiting from the incorporation of medical knowledge. However, incorporation of additional knowledge or constraints in existing models often require either altering network structures or task-specific fine-tuning. In this paper, we propose an energy-based controllable report generation method, named ECRG. Specifically, our method directly utilizes diverse off-the-shelf medical expert models or knowledge to design energy functions, which are integrated into pre-trained report generation models during the inference stage, without any alterations to the network structure or fine-tuning. We also propose an acceleration algorithm to improve the efficiency of sampling the complex multi-modal distribution of report generation. ECRG is model-agnostic and can be readily used for other pre-trained report generation models. Two cases are presented on the design of energy functions tailored to medical expert systems and knowledge. The experiments on widely used datasets Chest ImaGenome v1.0.0 and MIMIC-CXR demonstrate the effectiveness of our proposed approach.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0765_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Hou_EnergyBased_MICCAI2024,
        author = { Hou, Zeyi and Yan, Ruixin and Yan, Ziye and Lang, Ning and Zhou, Xiuzhuang},
        title = { { Energy-Based Controllable Radiology Report Generation with Medical Knowledge } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces a novel method for generating radiology reports using a combination of local and global features from chest X-ray (CXR) images, employing a linear combination of Energy-based Language Models (ELM) and global comprehension techniques. This method ensures the quality of generated reports by incorporating anatomical region-based prior knowledge to mitigate issues like text degradation and information loss prevalent in other models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Novel Formulation: The integration of ELM with global comprehension techniques is novel. This dual approach allows the model to consider both detailed local information and broader context, which is crucial for medical report generation.
    2. Original Data Use: The method leverages pre-trained expert systems in a unique way by using energy functions designed around attributes of these systems, rather than relying on structural modifications or extensive retraining.
    3. Demonstration of Clinical Feasibility: The paper presents a convincing demonstration of the model’s clinical applicability, showing how it can generate more accurate and comprehensive reports than existing methods.
    4. Strong Evaluation: The evaluation methodology is robust, comparing the model against several benchmarks and demonstrating improvements in generating clinically relevant text.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Innovation on Data Use: While the use of pre-trained systems is presented as novel, the concept of leveraging existing models for new applications has been explored in prior studies like Zhang et al. (2020) and Wang et al. (2022), which might reduce the perceived novelty.
    2. Lack of Detailed Algorithmic Transparency: The paper does not sufficiently detail the algorithmic adjustments made to the underlying probability distributions, which could affect the reproducibility and understanding of the model’s improvements.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    • Consider expanding the methodology section with more detailed explanations or step-by-step breakdowns of the algorithms and their modifications. • Include pseudo-code or more comprehensive diagrams to illustrate how the local and global features are integrated. • Provide a deeper comparison with existing methods, possibly through additional experiments or a broader range of datasets. • Discuss potential limitations or biases introduced by the reliance on pre-trained systems and how they might be addressed.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The major factors influencing the score are the innovative integration of local and global features and the clinical applicability of the method. However, the lack of detailed algorithmic transparency and the need for a more robust demonstration of novelty compared to existing literature slightly lower the overall score.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I appreciate the thorough response from the authors. Although they have provided more information on the novelty and methodology, particularly regarding the expert models, the overall score is slightly impacted by the continued lack of detailed algorithmic transparency and the need for a stronger demonstration of novelty in comparison to existing literature.



Review #2

  • Please describe the contribution of the paper

    Authors propose an edit to the emerging deep learning algorithms for automated generation of radiology reports from chest X-rays. The contributions are an energy based controllable report generation method (ECRG). Energy functions are designed and integrated into the report generation models during the inference stage to add more information without find tuning or alterations to the network structure. It is model-agnostic and can be readily used for other pre-trained report generation models.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Figure 1 is a great overview figure which helps set up your contributions and architecture
    • There is a lot of technical detail and a lot to digest in this paper but the methods section is clearly written. All sections are easy to follow and build upon one another well, they are also well documented.
    • The basis and idea of this manuscript is great, retraining or adapting to every specific domain is a major hurdle. Having a way to leverage expertise in different scenarios without having to make major changes to the model itself would make major LLMs more approachable and usable.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • In the introduction, you present two state of the art models, both trained in different ways for X-ray report generation, R2gen and PPKED - in your experiments however you choose to use the RGRG model. Would be great if you would present that model earlier along with the other ones.
    • There is no reasoning stated for choosing RGRG other than that the method is agnostic and therefore any can be chosen. It would be great to have another model shown in your results to prove and support that is is agnostic and improves other models as well as the selected one.
    • Need more detail around the expert model Pexpt - you refer to them in section 2.1 and in figure 1 as pre-trained black-box expert systems but never explicitly define which one was used for the results. I assume that different expert models can lead to different performances as they are a involved in your E(x) calculations
    • Relating to the expert models, there needs to be more discussion about them in general. In the abstract it is said that “utilizes diverse off-the-shelf medical expert models”. More explanation is needed to explain to the reader what that landscape looks like, what are the requirements, and limitations of these expert models that will either make this work successful to use or limit it. Expertise is often the main hurdle of translating regular work to the clinical domain so what is the availability of such models. If the authors feel this was already done, please be more clear or maybe create a subheading for it, it may have gotten lost or was unclear when you were discussing the expert models.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?
    • Where is the code for creating these energy functions - will it be available
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Section 3.2 : “As for the CE metrics that measure the diagnostic accuracy of generated reports.” Something seems to be missing from this sentence.
    • Section 2.2 “Fusion of Anatomical Region-based Prior Knowledge” in this section you’re referring to your result figure 2 which is pages away in your results section. - should not refer to results in your methods
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    There is really great technical detail in this paper and it is communicated in a way that can be followed. The idea of making language models more specific without retraining and restructuring existing models is a great addition to the literature and is an idea that would indeed help solve some of the domain specific problems and hurdles that medical AI faces. However, I think one of the limitations of still relying on what is described as “ off the shelf export models” needs to be better discussed and explored in the manuscript. It is one of the fundamental parts of the work and might face some of the same issues the work is trying to solve - if the author describes/explores that are more, then the manuscript will be well rounded.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I appreciate the detailed response that the authors gave to our questions. I appreciate the detail about the expert models and I hope this detail makes it into the paper because I still feel that there is a lot of talking around them in the original manuscript. Knowing how to get the same outcomes (either through training them themselves or pulling the same pertained models as you) is important for reproducibility. I still feel that there is simple detail lacking such as: as they open sourced? they are pretrained as you say, who trained them? what where the methodology details. I see that R4 also felt the need for better comparison with other models but I also understand that this work requires a lot of explanation and detail and there are confines to the manuscript length.



Review #3

  • Please describe the contribution of the paper

    The authors proposed an energy-based controllable report generation method which uses energy functions, which can be integrated into existing report generation models without further changes to network architecture. These energy functions can be obtained from medical expert models or knowledge. The authors also used an algorithm based on flat histogram simulation for improving report generation efficiency.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The problem statement is promising and the problem formulation is novel - authors have chosen automatic report generation that is yet to be widely explored and challenging.
    2. The model architecture has been explained clearly.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The energy functions could have been explained better - for instance what loss function is used for estimation of Eglobal?
    2. The final stage depends on accurate region detection. Can the authors explain the method’s reliability on this step?
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The loss functions and energy functions are quite complex and requires a set of parameters for which the values could have been provided.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. In Figure 1, what does expert system 1 and 2 indicate?
    2. The metrics for full is less than ECRG (multi) - can the authors explain this better?
    3. Adding a few misclassified/incorrectly predicted sequences would be useful.
    4. Statistical evaluation needs to be performed for ablation study and comparisons.
    5. In page 1, authors have listed a few existing methods - why not compare against those methods too. An explanation/comparisons would be good. Also refer to the points mentioned in weaknesses.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors have chosen a challenging problem statement and provided reliable results to establish the proof of concept. Further experimentation, comparison with SOTA and statistical evaluation is required.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We sincerely thank all reviewers for their constructive comments. Due to strict space limitations, we respond to their main concerns below.

Reviewer#1: Q1: More discussion about the expert models in general. A1: In general, each expert model scores a desired attribute of the generated reports. These attributes can be medical diagnosis information, disease distribution, and pathological relationships, etc. We view the product of these medical expert models as a probabilistic energy model, allowing a flexible combination of various heterogeneous attributes and refining the report generation distribution during the inference phase. To ensure applicability, the pre-trained expert model in the proposed ECRG framework requires accurately reflecting the satisfaction degree of the relevant medical attributes through the score, which is used to calculate the energy function in Equation 2 and thus affects the quality of the generated report. Regarding the limitations of ECRG, first, the performance of the pre-trained model is a significant factor affecting the final report generation results, which also applies to other methods using pre-trained systems. Additionally, although the ECRG framework based on the energy model can flexibly combine various heterogeneous energy functions through linear combination, different energy functions may inhibit each other. As shown in Table 1, the ECRG(full) model may be slightly worse than other models assisted by only one expert model in some metrics. Addressing these problems warrants in-depth research and discussion in the future.

Reviewer#3: Q2: What is the innovation of the proposed method compared to previous studies using pre-trained systems. A2: In essence, our main contribution is the proposal of an Energy-based Controllable Radiology Report Generation (ECRG) framework. Compared with other report generation methods using existing models, the energy-based framework has the following characteristics and advantages: First, each pre-trained expert model only needs to calculate a scalar score as an energy value to reflect the satisfaction of specific attribute. Second, the product of these medical expert systems is viewed as a probabilistic energy model, allowing a flexible combination of various heterogeneous attributes without designing specialized structures. Finally, ECRG circumvents the training process and refines the report generation distribution during the inference stage, which does not require any task-specific fine-tuning and gradient optimization. In addition, we also propose an acceleration algorithm to improve the efficiency of sampling the complex multi-modal distribution of report generation. Regarding prior studies, Zhang et al. (2020) modeled medical knowledge using a graph convolutional neural network. Wang et al. (2022) introduced a medical concepts generation network to predict fine-grained semantic concepts. These methods require designing specific network structures according to the situation and necessitate retraining on large-scale medical datasets, resulting in poor flexibility. Thus, our approach represents a novel technical solution for controllable report generation and provides a new perspective on addressing the challenges of this topic.

Reviewer#4: Q3: Further experimentation and comparison. A3: We select one of the state-of-the-art methods RGRG as the baseline model and present two examples illustrating the use of pre-trained expert models and medical prior knowledge to design energy functions. Ablation experiment results demonstrate the effectiveness of the energy functions designed in these two cases on the baseline model. In fact, the baseline model RGRG achieves better performance than the existing methods listed on page 1. Due to strict space limitations, we are very willing to conduct comprehensive experimental evaluations in the supplementary materials or in future work, including comparisons with more existing methods and a broader range of datasets.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All three reviewers have seen the merit of this paper in terms of novelty of formulation but have still argued that the algorithmic details could have been better. The rebuttal was also seen as convincing by all the reviewers.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    All three reviewers have seen the merit of this paper in terms of novelty of formulation but have still argued that the algorithmic details could have been better. The rebuttal was also seen as convincing by all the reviewers.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I concur with Meta-Reviewer #4 about the paper’s lack of in-depth evaluation and agree with Reviewer #4 on the need for comprehensive statistical analysis. The paper’s experimental scope is too narrow, only comparing with the RGRG model, and should include a broader range of state-of-the-art models to validate its contributions. Additionally, the paper lacks detailed algorithmic transparency, particularly regarding the expert models used in the experiments, which are mentioned but not specified. Expanding on these details will significantly enhance the clarity and impact of the paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    I concur with Meta-Reviewer #4 about the paper’s lack of in-depth evaluation and agree with Reviewer #4 on the need for comprehensive statistical analysis. The paper’s experimental scope is too narrow, only comparing with the RGRG model, and should include a broader range of state-of-the-art models to validate its contributions. Additionally, the paper lacks detailed algorithmic transparency, particularly regarding the expert models used in the experiments, which are mentioned but not specified. Expanding on these details will significantly enhance the clarity and impact of the paper.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Paper has enough novelty and results to justify an accept

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Paper has enough novelty and results to justify an accept



back to top