Abstract

Automatic radiology report generation is a challenging task that seeks to produce comprehensive and semantically consistent detailed descriptions from radiography (e.g., X-ray), alleviating the heavy workload of radiologists. Previous work explored the introduction of diagnostic information through multi-label classification. However, such methods can only provide a binary positive or negative classification result, leading to the omission of critical information regarding disease severity. We propose a Graph-driven Momentum Distillation (GMoD) approach to guide the model in actively perceiving the apparent disease severity implicitly conveyed in each radiograph. The proposed GMoD introduces two novel modules: Graph-based Topic Classifier (GTC) and Momentum Topic-Signal Distiller (MTD). Specifically, GTC combines symptoms and lung diseases to build topic maps and focuses on potential connections between them. MTD constrains the GTC to focus on the confidence of each disease being negative or positive by constructing pseudo labels, and then uses the multi-label classification results to assist the model in perceiving joint features to generate a more accurate report. Extensive experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets demonstrate that our GMoD outperforms state-of-the-art method. Our code is available at https://github.com/xzp9999/GMoD-mian.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1733_paper.pdf

SharedIt Link: https://rdcu.be/dV17H

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72086-4_28

Supplementary Material: N/A

Link to the Code Repository

https://github.com/xzp9999/GMoD-mian

Link to the Dataset(s)

For IU X-Ray: https://drive.google.com/file/d/1c0BXEuDy8Cmm2jfN0YYGkQxFZd2ZIoLg/view?usp=sharing For MIMIC-CXR: https://physionet.org/content/mimic-cxr/2.0.0/

BibTex

@InProceedings{Xia_GMoD_MICCAI2024,
        author = { Xiang, ZhiPeng and Cui, ShaoGuo and Shang, CaoZhi and Jiang, Jingfeng and Zhang, Liqiang},
        title = { { GMoD: Graph-driven Momentum Distillation Framework with Active Perception of Disease Severity for Radiology Report Generation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {295 -- 305}
}

Reviews

Review #1

Please describe the contribution of the paper

In this paper, a new automatic report generation method is proposed. The two main contributions are: 1) It provides information about degree or severity of the symptoms thanks to momentum distillation loss constraints in the classifier that focus it on the confidence of the positivity or negativity. 2) The model is able to capture the potential relationship between pathological features and symptoms thanks to the use of graph attention mechanisms.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

1) The state-of-the-art clarifies perfectly the evolution of the automatic report generation methods, the main milestones in the methodology. Then, it outlines the challenges to be solved. Finally, it points out the main contribution of the paper to these challenges. Although mainly it refers to conference papers. 2) The contributions are clear and well stated. 3) It includes an ablation study to analyse the effectiveness of each module.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

1) In the performance analysis and comparison with other methods it employs natural language generation metrics. However, it would be good to use also metrics to measure the clinical efficacy (https://doi.org/10.1016/j.media.2022.102510) 2) Methods proposed in the literature are public repository papers (have been they peer-reviewed?) or conference paper, although top conferences. Example of possible journal papers for comparison: Yang et al, MIA 2022 (80) and MIA 2023 (86). 3) In the evaluation of the second module, MTD, it is not clear how the severity is evaluated. In the Case Study, in one case the occurrence of the word “mild” is highlighted. It would be good if correct occurrence of words referring severity were measure. Thus, in spite of the ablation study, the effectiveness of the second module is not completely proven.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

It employs to publicly available databases to analyse the performance and to compare the proposed method with other algorithms in the literature. The manuscript extensively describes the setting of the experiment: model, hyperparameters, metrics to analyse the performance.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

The possible improvements of the paper are: 1) In the performance analysis and comparison with other methods it would be good to use also metrics to measure the clinical efficacy (https://doi.org/10.1016/j.media.2022.102510) 2) Methods proposed in the literature are public repository papers (have been they peer-reviewed?) or conference paper, although top conferences. The same apply to the methods used as comparison. A justification that these methods are those with the best performance in the literature should be added. 3) A method to evalulate the performance of the second module in assessing the severity grade should be included.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The challenge that the paper intend to meet are clear. The proposed method to reach is gold is novel. However, there are certain shortcomings in the evaluation. In particular, the effectiveness of the second module in assessing the severity grade is not completely proven. Secondly, the use only natural language generation metrics to evaluate the methods. Thirdly, are the method with the best performance in the literature those that has been chosen as comparison?
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper

The main contribution of this paper is a novel Graph-driven Momentum Distillation (GMoD) framework for improving automatic radiology report generation from radiographic images. The key innovations are:

(1) A Graph-based Topic Classifier (GTC) that uses graph attention to capture the relationships between symptoms and diseases from a constructed topic graph, providing more informative visual representations.

(2) A Momentum Topic-Signal Distiller (MTD) that generates pseudo-labels based on the model’s confidence scores and ground truth, enabling the model to better perceive and describe varying degrees of disease severity instead of just binary presence/absence.

(3) Integrating the GTC and MTD modules into an encoder-decoder architecture that outperforms previous state-of-the-art methods on public radiology report generation benchmarks by better leveraging diagnostic knowledge and perceptions of disease severity.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

(1) The paper proposes a novel framework called GMoD that effectively integrates domain knowledge about diseases and symptoms into radiology report generation in a structured way using graph representations. This knowledge-aware approach shows quantitative improvements over previous methods on benchmark datasets.

(2) The paper makes two important technical contributions - the Graph-based Topic Classifier (GTC) and the Momentum Topic-Signal Distiller (MTD) module. The GTC uses graph attention to capture potential connections between symptoms and diseases, providing more informative visual representations. The MTD uses momentum distillation to guide the model’s perception of different degrees of disease severity instead of just binary presence/absence.

(3) The case studies provide good qualitative examples showcasing how the model can detect and describe subtle observations about disease severity better than previous approaches.

（4） The paper is well-written, with a clear structure and refined presentation.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

(1) I have several concerns about the experiment settings and results, could the authors help me check the following concerns? (a) Do the authors conduct repeated experiments on the two utilized datasets? Usually, we require the authors to conduct repeated experiments with different random states or different sample splits, and then report the average metrics and standard deviation. However, I do not see any statements addressing this concern. The lack of repeated experiments greatly impacts the paper’s technical soundness. (b) Why did you set different baseline methods for comparison between the two datasets? I see that the TopDown method is compared for the MIMIC-CXR dataset, but is ignored for the IU-Xray dataset. Why is this the case? (c) In Table 1, GMoD could not surpass the compared baseline models in many evaluation metrics for the IU-Xray dataset. The authors need to explain why, as this affects your technical soundness.

(2) A key limitation is the reliance on an external knowledge base or ontology of disease topics and their relationships to construct the topic graph. This requirement severely limits the generalizability of the approach to other domains lacking such well-defined ontologies. The authors could discuss strategies to automatically learn these relationships from data when ontologies are unavailable. Besides, could the authors use some strategies to ensure the generated topic embeddings in the first stage, e.g, visualization, human check? It will greatly increase the interpretability and technical soundness.

(3) The evaluation is solely based on common text generation metrics like BLEU, METEOR etc. While useful, these metrics do not directly measure the clinical accuracy, completeness and utility of the generated reports. The authors could consider including human evaluation by radiologists to assess the practical usefulness of their approach.

(4) Small mistake: On page 3, the authors write “top k most similar features”, while on page 7, they write “top K”. These should be unified.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

Please see the listed strengths and weaknesses above.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Please see the pointed concerns of the experiments (in question 6 - weaknesses).
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

This paper proposes a graph-driven momentum distillation for radiology report generation. It uses a graph-based topic classifier to build topic maps and designs a momentum topic-signal distiller to assist feature extract to generate a report. The method has been tested on IU-Xray and MIMIC-CXR datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

(1) The issue the paper focuses on is sound. (2) The idea of creating non-one-hot pseudo-labels instead of binary classification is interesting, as it can avoid extreme categorization and provide confidence levels for the classifications. (3) Ablation studies are conducted to demonstrate the effect of different components of the framework.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

(1) The paper is somewhat difficult to understand, especially why certain modules are effective. For example, why using momentum and self-distillation can help the model grasp the severity of each disease and symptom? (2) The margin of improvement in Table 1 is small. For example, the proposed method and KiUT differ by ~0.005 on BLEU.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

(1) As mentioned above more direct and clear description of the motivation behind the method design necessary. (2) It is suggested to add the statistical significance analysis of the results. (3) Open the code online will contribute to the development of the community.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper designs an interesting method to address a clinically significant problem. However, there are some weaknesses as mentioned above.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Author Feedback

Reviewer #1 We would like to thank the reviewers for their favorable comments on our work. Questions 3 (1) The reviewer mentioned that we used natural language processing evaluation metrics. Using metrics to measure clinical efficacy would make our method more convincing. We will consider incorporating new evaluation metrics in our future work. (2) The reviewer pointed out that the methods mentioned in the paper are mostly from public databases or conference proceedings (even though they are top conferences), and papers in some journals can be considered. We have added references to journal papers, which has further enhanced the quality and credibility of our paper. (3) About why the MTD module works: The MTD module is used to assist the GTC module in jointly assessing the severity. When conducting disease classification tasks, using one-hot labels can only represent the two extreme cases of the presence and absence of the disease corresponding to the radiographic. For example, two radiographs—one showing early-stage pneumothorax and the other showing late-stage pneumothorax may exhibit varying severities in the image. Using one-hot labels would classify both as simply having pneumothorax, thereby losing valuable information. Therefore, we use the MTD module to construct a non-one-hot probability distribution as pseudo label to capture and focus on these differences. The pseudo label is composed of a weighted sum of the true labels and the predictions from the MTD module (see Eqn. 6), The more severe the disease presented in the radiograph, the higher the confidence level of the MTD module for that disease. This is the effect achieved by model learning through pseudo-label constraints. Since the model cannot accurately determine the presence of a disease in the early stages of learning, using only the MTD module’s predictions as pseudo labels may lead the model to learn too much incorrect information, Therefore, we integrate the ground truth labels, enabling the model to learn the severity of each disease while ensuring accurate disease classification.

Reviewer #3 Thanks to the reviewers for their good recognition of our work. Questions 3 (1) (a) The reviewer pointed out that we did not repeat multiple experiments to obtain average values. We mainly did this to maintain consistency with the comparison papers; (b) The original TopDown paper did not present results on the dataset I used. I reproduced the results from reference [13], which only provided TopDown results on the MIMIC-CXR dataset. Therefore, I only compared the paper on the MIMIC-CXR dataset; (c) The reviewer mentioned that our method did not surpass the comparison model in some metrics on the IU-Xray data set. The main reason is: the IU-Xray dataset is relatively small and designed for specific tasks, differing in data distribution and task complexity compared to the larger, more comprehensive MIMIC-CXR dataset. We believe that achieving good results on the MIMIC dataset is more representative. (2)(3) The automated construction of topic graphs and the addition of manual evaluation, as mentioned by the reviewer, are valuable and meaningful research points. We will strive to further investigate these aspects in our future work. (4) Fixed the writing problem.

Reviewer #4 We would like to thank the reviewers for their positive comments on our work. Questions 3 (1) The discussion on why the MTD module is effective can be found in Reviewer #1, point (3). (2) Although there is only a slight improvement in the BLEU-1 metric, the improvement in the BLEU-4 metric is very significant. This is because BLEU-1 mainly focuses on the generation effectiveness of individual words, while BLEU-4 considers the matching of four-grams, since diagnostic reports are generally longer than natural text, the BLEU4 metric is more informative as it more comprehensively demonstrates the improvement in the generation capabilities of our proposed method.

Meta-Review

Meta-review not available, early accepted paper.

back to top

GMoD: Graph-driven Momentum Distillation Framework with Active Perception of Disease Severity for Radiology Report Generation

Author(s):