Abstract

Existing mainstream approaches follow the encoder-decoder paradigm for generating radiology reports. They focus on improving the network structure of encoders and decoders, which leads to two shortcomings: overlooking the modality gap and ignoring report content constraints. In this paper, we proposed Textual Inversion and Self-supervised Refinement (TISR) to address the above two issues. Specifically, textual inversion can project text and image into the same space by representing images as pseudo words to eliminate the cross-modeling gap. Subsequently, self-supervised refinement refines these pseudo words through contrastive loss computation between images and texts, enhancing the fidelity of generated reports to images. Notably, TISR is orthogonal to most existing methods, plug-and-play. We conduct experiments on two widely-used public datasets and achieve significant improvements on various baselines, which demonstrates the effectiveness and generalization of TISR. The code will be available soon.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1810_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Luo_Textual_MICCAI2024,
        author = { Luo, Yuanjiang and Li, Hongxiang and Wu, Xuan and Cao, Meng and Huang, Xiaoshuang and Zhu, Zhihong and Liao, Peixi and Chen, Hu and Zhang, Yi},
        title = { { Textual Inversion and Self-supervised Refinement for Radiology Report Generation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The study proposes a novel approach to enhance the performance of radiology report generation systems. It incorporates a textual inversion module to bridge the modality gap between image and text. This module employs a technique to convert image features into textual features, generating pseudo-words that encapsulate visual and linguistic characteristics. These pseudo-words reduce the modality gap by serving as an intermediary representation. Furthermore, the proposed method leverages a self-supervised learning strategy that minimizes the contrastive loss between image and pseudo-word features. By optimizing this contrastive loss, the model is guided to generate reports that are more accurate and faithful to the input image.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1- Rather than modifying the architecture, this approach aims to bridge the gap between different modalities (text and images) by employing pseudo-words and self-supervised refinement techniques. This contribution is independent of architectural changes and can therefore be applied to various methods and enhance their performance. 2 - The proposed method demonstrates superior results across multiple evaluation metrics on two distinct datasets, outperforming other approaches.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1- The proposed method section could benefit from additional clarity. While some network architectures are borrowed from other papers, it would be helpful to provide more information about the functions $f_e$, $f_d$, and $f_l$, or at least mention the dimensions of their inputs and outputs. 2- On page 4, it is stated that $O’$ is computed by decoding $P”$, but the decoding process is not explained. It is unclear whether the same text decoder used previously is employed or if a different approach is taken. If the same text decoder is used, clarification on how it is applied would be beneficial. 3- The ablation study results presented in Table 2 are unclear. According to the proposed method, the self-supervised refinement module operates on the output of the textual inversion module (i.e., the pseudo-words). However, if one module depends on the other, it is unclear how they can be evaluated separately.

    • Some of the baseline results are a bit worse than what is reported in the original papers. Because of this, there is a little concern on whether the baseline methods are implemented correctly or not.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The details of implementation are not provided. For instance, no information is given about the architecture of the multi-layer perceptron described in equation (6). Since the code has not been made available yet, even though the authors have said they will release it soon, there are some concerns about whether the results can be easily replicated.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1- The method section needs to have more in-depth explanations. Specifically, additional details should be provided about the architectures of the components $f_e$, $f_d$, and $f_l$, such as the input and output dimensions of each. 2- The self-supervised refinement section requires further clarification. For example, it should explicitly explain how the output $O’$ is computed. The current description, which states that $O’$ is obtained by decoding $P’’$, is ambiguous and needs to be elaborated on to clearly describe the decoding process. 3- More information is needed regarding the ablation studies presented in Table 2. It is unclear how the individual contributions of textual inversion and self-supervised refinement are separately evaluated and quantified.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, the paper introduces an innovative technique to enhance the quality of report generation, diverging from previous approaches that focused on architectural modifications. Due to this novelty, I consider the paper to be suitable for acceptance. Nonetheless, certain sections lack clarity and detail, and the explanation of the proposed method could be improved for better comprehension.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    Bridge the modality gap by transforming visual features into linguistic space through textual inversion. The self-supervised refinement module searches for text representations close to the image content to minimize the contrastive loss.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. the text inversion module can generate pseudo-labels to project text and images into the same space and eliminate the modal gap.
    2. the article uses contrast learning in the training process, which can also reduce the modal gap between text and images.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. lack of clarity in the description of methods, especially in the Textual Inversion section;
    2. fewer comparison experiments, at least compare some newer articles (e.g., methods that use large models);
    3. the experimental details section should explain the specific settings of Image Encoder, Textual Inversion and Text Decoder;
    4. insufficient settings for ablation experiments;
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Make the description of methods more clearly, especially in the Textual Inversion section.
    2. Compare the proposed method with other methods that use large models
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper makes some contribution for report generation.

    1. Bridge the modality gap by transforming visual features into linguistic space through textual inversion.
    2. The self-supervised refinement module searches for text representations close to the image content to minimize the contrastive loss.
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors proposed Textual Inversion and Self-supervised Refinement (TISR) to bridge the feature gap between the visual and textual encoded features. This is to solve the problem of cross-modality issues that exists between textual and image data.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Novel idea: Authors proposed TISR by minimizing the contrastive loss between text and image representation using self-supervised learning. This can help to reduce the annotation workload which is expensive.

    Vigorous experiment: Authors compared their proposed idea with three baseline model for two datasets. Results showed that model with with TISR consistently improved for all model and datasets. Furthermore, authors also show vigorous ablation experiments results which is beneficial to understand the effectiveness of each component.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Marginal improvement on the results: Despite all models improved after applying TISR, the improvement is little ranging from 0.0009 to 0.027 (Table 1). How much is the computational time needed for such an improvement?

    Lack of clarity in experiment setup: What is clinical efficacy being computed? How do you obtain the ground truth of CheXbert?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Author mentioned code will be available soon.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The idea is interesting where you propose a way to bridge the gap between textual and image representation. I personally like the idea very much and the writing is very clear which make the reading very easy.

    However, I would really like to see how much computation cost is needed for this marginal increase. Moreover, it would also be interesting to know what authors’ opinion to improve performance.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The writting is clear and easy to follow. Authors explain the idea very well and provided all information. The idea of bridging the gap between textual and image using textual inversion and self-supervised is novel.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We sincerely thank the reviewers for their professional suggestions and appreciation of our work. All reviewers agree that this paper is novel, well-written, and valuable. We give answers A to all the questions Q below. Reviewer#1: Q1: Marginal improvement on the results Thanks! 1) As observed in the previous studies [3,13,15], this task yields stable metrics with relatively small improvements. 2) Despite some metrics show small improvement, our model has achieved significant enhancements across most metrics, demonstrating the overall effectiveness of TISR. Q2: Computation cost for TISR 1) The training time varies for different baselines and datasets, but it increases by no more than 30% compared to the original time. 2) TISR is integrated solely during training, so that the inference time remains identical to the original network. Q3: Clarity in experiment setup 1) Clinical efficacy is evaluated using precision, recall, and F1, which can be obtained by comparing the labels from the generated reports to the ground truth, allowing us to assess the model’s performance in identifying important medical conditions. 2) CheXbert[26] is widely used as an automated deep-learning based chest radiology report labeler that can label for 14 medical observations[3,4,5,13,15,17,22,27]. By inputting the generated reports and ground truth into CheXbert, we can extract labels for 14 important medical conditions from each report. Reviewer#3: Q1: Clarity and explanation of $f_e$, $f_d$, and $f_l$ Sorry for the oversight. $f_l$ is a linear layer that maps the outputs of decoder to the vocabulary size, with input dimension equals to the dimension of the decoder output and output dimension equals to the vocabulary size. $f_e$ and $f_d$ can be different according to the baselines[10,16,27,29,30]. In this paper, $f_e$ is a ResNet101 pretrained on ImageNet. $f_d$ varies depending on the baseline used[3,4,5]. We’ll add these details in the final version. Q2:Computation of $O’$ We sincerely apologize for our negligence. $O’$ is computed by decoding $P”$ and $I’$, We will correct it in the final version. Q3:Decoding process 1) In Eq(1), $O$ is obtained by decoding the text features $T$ and image features $I$. Similarly, $O’$ is obtained by decoding the refined text features $P’’$ and image features $I’$. Thus, the same decoder can be used for both tasks. 2) Additionally, using the same decoder allows the network to share parameters, which benefits network optimization and improves the efficiency during inference. Q4:how these components are separately evaluated 1) The validity of text inversion is assessed by calculating the contrastive loss between $I$ and $T$ in the self-supervised refinement. 2) Self-supervised refinement’s validity is assessed by calculating the contrastive loss between $I’$ and $P$. Q5:Implementation details and code availability Sorry for our negligence. MLP in Eq(6) contains two linear layers and ReLU. We’ll include this detail in the final version. Code will be released soon! Q6: Implement of baselines All baseline results were reproduced using the official code provided by the authors on GitHub, without any modifications. Most of the metrics we obtained are actually close to those reported in the original papers[4,5]. Reviewer#4: Q1:Explanation of methods We apologize for any confusion caused and will address this in the final version by providing more details, such as input and output dimensions, module structures, explanations of equations and more. Q2:Comparison with newer methods Thanks for your suggestion! We conduct experiments on CvT2DistilGPT2[22], which used gpt-2, and observed performance enhancement. Q3:Experimental details We apologize for the oversight. We will add further details in final version. For specific settings of the image encoder and text decoder, please refer to R3Q1. Q4:Insufficient settings for ablation experiments We’ll add more explanations and settings for ablation experiments in the final version.




Meta-Review

Meta-review not available, early accepted paper.



back to top