Abstract

Medical report generation (MRG) has great clinical potential, which could relieve radiologists from the heavy workloads of report writing. One of the core challenges in MRG is establishing accurate cross-modal semantic alignment between radiology images and their corresponding reports. Toward this goal, previous methods made great attempts to model from case-level alignment to more fine-grained region-level alignment. Although achieving promising results, they (1) either perform implicit alignment through end-to-end training or heavily rely on extra manual annotations and pre-training tools; (2) neglect to leverage the high-level inter-subject relationship semantic (e.g., disease) alignment. In this paper, we present Hierarchical Semantic Alignment (HSA) for MRG in a unified game theory based framework, which achieves semantic alignment at multiple levels. To solve the first issue, we treat image regions and report words as binary game players and value possible alignment between them, thus achieving explicit and adaptive alignment in a self-supervised manner at region-level. To solve the second issue, we treat images, reports, and diseases as ternary game players, which enforces the cross-modal cluster assignment consistency at disease-level. Extensive experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets demonstrate the superiority of our proposed HSA against various state-of-the-art methods.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1475_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1475_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Zhu_Multivariate_MICCAI2024,
        author = { Zhu, Zhihong and Cheng, Xuxin and Zhang, Yunyan and Chen, Zhaorun and Long, Qingqing and Li, Hongxiang and Huang, Zhiqi and Wu, Xian and Zheng, Yefeng},
        title = { { Multivariate Cooperative Game for Image-Report Pairs: Hierarchical Semantic Alignment for Medical Report Generation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15003},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors propose a game-theoretic approach to medical report generation. They hypothesize that current SOTA models lack alignment and, therefore, propose a disease-level alignment. Strong comparisons with other methods are well documented.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The introduction of game theory to the medical report generation is an interesting direction that the authors pursue.
    • The comparison experiments are well documented and show strong statistically significant improvements over other methods.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Figure 1 shows a bounding box on the region level. But I don’t see any mention of it or how it was created during the training process. Please explain how the bounding box was created and how that is incorporated into the patches for the region-level alignment.
    • Is the choice of the disease prototype a hyperparameter? If so, will MIMIC have 14 prototypes to capture the disease-level alignment?
    • From Figure 2, it is unclear how the proposed inter-subject interaction is achieved for the disease-level prototype. The inputs take only a single CXR and the corresponding report during training. From what I understand, inter-subject is between subjects and not intra-subject (within patient data). How is this achieved for the disease-level alignment?
    • The CE loss is not mentioned anywhere in the paper, but it has been used in the diagram and equation 5. How is that used, or what does it stand for?
    • It would help if the authors also delineated the mode in which the inference of this model would run.
    • While the experimental comparisons with the other competing methods are well documented, one crucial ablation study needs to be included. The authors only compare with the others by fixing their pre-trained ViT and SciBert for their image and the report encoders, respectively. However, readers would appreciate what would happen if you used much more powerful encoders with strong visual text alignment, such as a CLIP model. How would the model perform, and more so, can the game theory additions of the CE, RA, and CA help? How much more from the BASE in Table 2a would it improve if a more aligned CLIP model was used?
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    My comments have been added with the weakness section. Please refer to the section above.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The methodology section lacks clarity and experimental details, including ablation studies, are missing.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposes a framework that utilizes game theory to construct hierarchical semantic alignment for medical report generation. The proposed framework efficiently achieves region-level, disease-level, and case-level cross-modal semantic alignment, thereby further enhancing the model performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) This work integrates game theory into medical report generation, achieving hierarchical semantic alignment (HSA).

    (2) The HSA is unsupervised, requiring no labor-intensive effort, which is beneficial in clinical practice.

    (3) The model achieves state-of-the-art performance on the IU-Xray and MIMIC-CXR benchmark datasets.

    (4) The authors provide detailed explanations of the method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    First, the IU-Xray dataset was randomly split into training, validation, and testing sets with a ratio of 7:1:2 in the reference paper you mentioned in the manuscript. Therefore, your training, validation, and testing sets may not be the same as those used by the comparison methods. Do you think it’s reasonable to compare with state-of-the-art methods in this way to demonstrate the effectiveness of the proposed method? I believe the authors should apply the methods on the same dataset, or at least validate some of the state-of-the-art models that achieved similar results as listed in the paper.

    Second, if the authors only conducted one experiment, how can they obtain the p-value based on statistical analysis?

    Third, the tables don’t look good without top and bottom borders.

    Fourth, the authors didn’t mention whether they plan to release the code in the future.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please see the main weaknesses of the paper.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Mainly based on the the main weaknesses and the main strengths of the paper.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors addressed most of my comments. For the first comment, the authors stated, ‘We adopted the same setting used in references [14, 6, 37] to split the IU-Xray for a fair comparison, which has been widely utilized by MRG baselines.’ However, I believe the authors should apply the methods to the same dataset, or at least validate some of the state-of-the-art models that achieved similar results as listed in the paper. I changed my score mainly due to all evaluation metrics on the IU-Xray data being better than the baseline, with some even higher by 3%-4%.



Review #3

  • Please describe the contribution of the paper

    The authors propose an approach towards radiology report generation on chest X-rays using a novel alignment method. They propose to image, report, and disease features on three levels (disease-, case-, region-level) using a game-theory-based method treating the alignment as a binary or ternary cooperative game. They evaluate their method on two common report generation datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. A novel and interesting approach using a formulation based on game theory. The formulation seems sound and may also be interesting for other alignment approaches, indicating a high relevance of the proposed work to the community
    2. Several ablation studies on the relevance of model components and additional qualitative evaluation
    3. Statistical significance testing is used to show the relevance of reported results + average over 5 runs is reported
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The paper misses many important and strong baselines on report generation. This includes prominent large models like MAIRA-1 [1] and Med-PaLM M [2] but also smaller models like Prompt-MRG [3], RaDialog [4], METransformer[5], ITA [6], M2 Trans [7], CvT-212DistilGPT2 [8]:

    [1] Hyland, Stephanie L., et al. “MAIRA-1: A specialised large multimodal model for radiology report generation.” arXiv preprint arXiv:2311.13668 (2023) [2] Tu, T., Azizi, S., Driess, D. et al. “Towards generalist biomedical AI”. NEJM AI 1(3) (2024) [3] Jin, Haibo, et al. “Promptmrg: Diagnosis-driven prompts for medical report generation.” AAAI (2024) [4] Pellegrini, Chantal, et al. “RaDialog: A Large Vision-Language Model for Radiology Report Generation and Conversational Assistance.” arXiv preprint arXiv:2311.18681 (2023) [5] Wang, Zhanyu, et al. “Metransformer: Radiology report generation by transformer with multiple learnable expert tokens.” CVPR (2023). [6] Wang, L., Ning, M., Lu, D., Wei, D., Zheng, Y., Chen, J.: “An inclusive task-aware framework for radiology report generation.” MICCAI (2022). [7] Miura, Y., Zhang, Y., Tsai, E., Langlotz, C., Jurafsky, D.: “Improving factual completeness and consistency of image-to-text radiology report generation”. NAACL (2021). [8] Nicolson, A., Dowling, J., Koopman, B. “Improving chest x-ray report generation by leveraging warm starting”. Artificial Intelligence in Medicine 144 (2023).

    1. Low performance compared to recent methods on MIMIC-CXR (which can be considered as the standard benchmark for report generation and on which most baselines are available), while claiming state-of-the-art. Some examples include:
      • BLEU-1: MAIRA-1: 39.2, CvT-212DistilGPT2: 39.2, METransformer: 38.6; compared to the reported 38.6
      • BLEU-4: MAIRA-1: 14.2, CvT-212DistilGPT2: 12.4, METransformer: 12.4; compared to the reported 12.0
      • METEOR: MAIRA-1: 33.3, CvT-212DistilGPT2: 15.3, METransformer: 15.2; compared to the reported 16.3
      • ROUGE-L: MAIRA-1: 28.9, CvT-212DistilGPT2: 28.5, METransformer: 29.1; compared to the reported 28.8
      • CIDEr: CvT-212DistilGPT2: 36.1 , METransformer: 36.2; compared to the reported 28.7
      • CE F1: CvT-212DistilGPT2: 38.4; PromptMRG: 47.6; compared to the reported 37.9
    2. Reported only a subset of CE metrics, complicating comparison with some baselines. Many recent papers report micro- or macro-averaged, while the authors only report example-level metrics. Additionally, it is not clearly stated which type of averaging is used, and example-level averages can only be assumed based on the results reported for the baselines. This limits a fair comparison with current models.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Some training details are provided in the supp. material, but more details may be usefull (e.g. number of epochs, lr scheduler)

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • As described in the weaknesses, many recent baselines are missing. These should be added in the main table (Tab. 1), or it has to be discussed why they are omitted. While a comparison to very large models (like the Med-PaLM M models) may not be fair, these still have to be mentioned.
    • All claims of the proposed method being superior to state-of-the-art have to be removed, at least for the MIMIC-CXR dataset. It should be argued why the proposed method is still useful and worth publishing despite being outperformed by many methods on the standard benchmark MIMIC-CXR. Authors may still highlight the superiority of the IU-XRay dataset; maybe there is a reason why it shows strong performance on the small dataset while the performance on the larger dataset is not as strong compared to baselines. Is the method maybe especially usefull in small-data regimes?
    • The abstract seems very long. Consider shortening it.
    • Fig. 1 is very small and hard to read
    • Tab. 2 is too small and hard to read. Also the captions of Tabs. 1 and 2 are placed below, while they should be placed above the tables.
    • In Eq. (2) the parenthesis of exp are too small. Similar issues can be observed in Eqs. (3-5)
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • The method seems interesting and novel. It may also be interesting for other alignment approaches, indicating a high relevance of the proposed work to the community that is definitely worth publishing
    • Some important baselines are missing, and the method’s results should be reinterpreted in this context, updating the claims of the work accordingly.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    My points have been addressed/clarified in the rebuttal. After adding and discussing the relevant baselines (as promised in the rebuttal), I think the work is worth publishing as the approach seems very interesting and promising.




Author Feedback

We are glad that the novelty (all reviewers), the effectiveness(R#1 and R#5), comprehensive experiments (all reviewers), application and broader impacts(R#1 and R#3) are appreciated. We are pretty excited that R#3 commented our work is ‘‘definitely worth publishing’’. In the following, we clarify the reviewers’ concerns and will incorporate improvements in the revision.

Response to Reviewer #1 Q1: Data split of the IU-Xray dataset. A1: We adopted the same setting used in references [14, 6, 37] to split the IU-Xray for a fair comparison, which has been widely utilized by MRG baselines. Q2: Experimental setting. A2: As mentioned in Section 4.1, we reported the average results of five runs with different random seeds. Q3: Presentation improvements of Tables. A3: The tables have been updated to include top and bottom borders for improved appearance. Q4: Code accessibility. A4: As mentioned in the last sentence of Section 4.1, we will release our code upon publication.

Response to Reviewer #3 Q1 & Q2: Miss baselines on report generation and low performance on MIMIC-CXR. A1 & A2: Thank you for your insightful references. We will certainly discuss and cite the works you’ve highlighted. Specifically, larger models often utilize more advanced checkpoints such as MAIRA-1 fine-tuned Vicuna-7B, while smaller models also leverage additional data based on Large Language Models (LLMs), such as Prompt-MRG and RaDialog. Due to space constraints, we are unable to delve into every baseline in detail. However, our work primarily aims to reduce annotation dependency, making it an unsupervised approach, which is particularly advantageous in clinical settings. In response to your suggestion, we have also compared our method with the omitted references. Despite these methods using relatively advanced checkpoints, our method still demonstrated slightly inferior performance in BLEU-1, METEOR, and ROUGE-L metrics. We will revise all statements regarding performance on the MIMIC-CXR dataset accordingly. Q3: Macro, micro or example-based F1 scores? A3: In this work, we utilized the example-based macro F1 score following [6] for a fair comparison, and the results are cited from [6, 15]. We will clarify these details in the revision. Q4: Suggestions on Presentation. A4: We have shortened the abstract and adjusted the size of the figures and tables to enhance readability.

Response to Reviewer #5 Q1: Bounding box in Figure 1. A1: The bounding box in Figure 1 was solely used for illustration. The proposed model doesn’t rely on labeled bounding box. For region-level alignment, we continue to use patches to align the tokens. Q2: Is the choice of the disease prototype a hyperparameter? A2: Yes, you are right. We will clarify this in the revision. Q3: Inter-subject interaction in Figure 2. A3: Your understanding is correct. For simplicity, we depicted only a single input pair in Figure 2. Disease-level alignment projects all input pairs within a mini-batch into a Cross-modal Disease Space for calculating game loss. We will reorganize the right part of Figure 2 accordingly. Q4: CE loss in Equation 5. A4: we described the CE(Cross Entropy) loss in Section 2.1 Background of MRG, which maximize pθ(Y|I) by minimizing the negative log-likelihood loss. We will include formal formula in the revision. Q5: The inference of the model. A5: As mentioned in the last paragraph of Section 3, HSA are only active during training and are removed during inference. Consequently, the inference process is identical to that of using only the backbone model. Q6: Results on CLIP-like models. A6: We carried out supplementary experiments utilizing the MedCLIP backbone. The results for BL-4, MTR, RG-L, and CDr on MedCLIP were 0.174, 0.198, 0.380, and 0.541 respectively. When using HSA with MedCLIP, the results improved to 0.216, 0.235, 0.427, and 0.602 respectively. These findings suggest that HSA can consistently enhance performance, even when paired with more advanced backbones.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    After carefully reviewing the paper and its experimental results, and given the positive comments, the innovative approach, and significant potential impact of the proposed method in reducing the dependency on annotated data in medical image report generation, the AC recommends acceptance. However, this recommendation comes with the stipulation that the authors address the concerns regarding dataset splitting, include the new relevant baselines, and enhance the presentation quality of tables and figures in their final manuscript.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    After carefully reviewing the paper and its experimental results, and given the positive comments, the innovative approach, and significant potential impact of the proposed method in reducing the dependency on annotated data in medical image report generation, the AC recommends acceptance. However, this recommendation comes with the stipulation that the authors address the concerns regarding dataset splitting, include the new relevant baselines, and enhance the presentation quality of tables and figures in their final manuscript.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviewers liked the game theory approach as a novelty which I agree is a good angle to present in this oft-attempted report generation problem. However, the response to the comparison with relevant baselines which the reviewers asked was rather weak seeming to indicate that they will report on those experiments in the final paper. The appropriate response would have been as to why they weren’t done, or what the conclusion from them was in terms of specific numbers if it was done but not reported in the paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The reviewers liked the game theory approach as a novelty which I agree is a good angle to present in this oft-attempted report generation problem. However, the response to the comparison with relevant baselines which the reviewers asked was rather weak seeming to indicate that they will report on those experiments in the final paper. The appropriate response would have been as to why they weren’t done, or what the conclusion from them was in terms of specific numbers if it was done but not reported in the paper.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Having read all reviewer comments and rebuttal, I concur that the paper should be accepted - there is some novelty coupled with extensive experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets showcasing good results.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Having read all reviewer comments and rebuttal, I concur that the paper should be accepted - there is some novelty coupled with extensive experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets showcasing good results.



back to top