Abstract

Multimodal pre-training demonstrates its potential in the medical domain, which learns medical visual representations from paired medical reports. However, many pre-training tasks require extra annotations from clinicians, and most of them fail to explicitly guide the model to learn the desired features of different pathologies. In this paper, we utilize Visual Question Answering (VQA) for multimodal pre-training to guide the framework focusing on targeted pathological features. We leverage descriptions in medical reports to design multi-granular question-answer pairs associated with different diseases, which assist the framework in pre-training without requiring extra annotations from experts. We also propose a novel pre-training framework with a quasi-textual feature transformer, a module designed to transform visual features into a quasi-textual space closer to the textual domain via a contrastive learning strategy. This narrows the vision-language gap and facilitates modality alignment. Our framework is applied to four downstream tasks: report generation, classification, segmentation, and detection across five datasets. Extensive experiments demonstrate the superiority of our framework compared to other state-of-the-art methods. Our code is available at https://github.com/MoramiSu/QFT.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0889_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/0889_supp.pdf

Link to the Code Repository

https://github.com/MoramiSu/QFT

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Su_Design_MICCAI2024,
        author = { Su, Tongkun and Li, Jun and Zhang, Xi and Jin, Haibo and Chen, Hao and Wang, Qiong and Lv, Faqin and Zhao, Baoliang and Hu, Ying},
        title = { { Design as Desired: Utilizing Visual Question Answering for Multimodal Pre-training } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15004},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors aim to propose a multimodal pre-training method using VQA tasks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors aim to tackle an important problem in ML for Healthcare.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. I do not agree with the author’s claim contribution ‘To the best of our knowledge, we are the first to utilize VQA for multimodal pre-training in the medical field …’. For example [1,2] already done very similar studies, and at much larger scale (more datasets, more tasks).

    2. standard deviation is not reported

    [1] Towards Generalist Foundation Model for Radiology by Leveraging Web-scale 2D&3D Medical Data [2] MED-FLAMINGO: A MULTIMODAL MEDICAL FEWSHOT LEARNER

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    NA

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Would be very useful to think about how the paper situate itself with the many recent medical foundation models.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Reject — should be rejected, independent of rebuttal (2)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors try to tackle an important problem. However, the claimed contribution is not true in my view.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision

    I have read the author’s rebuttal and other reviewers’ comments. While I appreciate the authors’ effort in preparing the rebuttal, I am not convinced by the authors’ explanation on how the paper differ from the medical foundation model: 1. the medical foundation model can also be fine-tuned like what the authors do in the paper for various downstream tasks 2&3. I agree the dataset used and the specific training method used could be different, but this is not a contribution. To demonstrate contribution, the author need to show superiority over prior works that also used VQA. After all, i disagree that this is the first utilization VQA for multimodal pre-training in the medical field as claimed in the paper.



Review #2

  • Please describe the contribution of the paper

    The authors propose a mechanism whereby they use Visual Question Answering (VQA) to perform multimodal pretraining with the aim that learned representations can be used in a number of downstream tasks. They are the first to do this for the purposes of guiding the framework to focus on targeted pathological features specifically, which are often small in ultrasound images, without needing labels from clinicians on the pathologies present in the US images. The pathologies only need to be mentioned in the clinical reports, and the authors will prepare different questions based on the mentioned pathologies in the reports.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Very topical content
    • The introduction of the QFormer-based QFT module is particularly interesting.
    • The paper tries to address a significant problem in medical imaging, which is the lack clinician-provided labels, algorithmically.
    • The paper does compare their method with 5 other methods and two datasets on the report generation task. Comprehensive.
    • The paper does not need annotations and yet is able to learn features specific to different pathologies, which constitutes a great contributions.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • More needs to be emphasised as to how QFT is different compared to QFormer.
    • A clinician has not examined, looked at, or commented on the quality of the resulting Questions and Answers, which would have been a nice sanity check to include.
    • It is currently not clear enough how QCL is different from CL.
    • There needs to be more of an intuitive explanation as to why the QFT is helpful. Why does the QFT manage to narrow the modality gap and facilitate modality alignment? Also, how do we do know that for certain that such alignment is happening, and what about QFT makes this possible?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    For Table 2, would have been nice to see a case where CL and VQA are used but without QCL to properly compare the usefulness of including QCL, which based on my reading, is the primary contribution of this paper.

    • I would advise to kindly rewrite the first sentence (“we take the lead”)of the conclusion, since as it stands, it slightly implies that this is the first paper to have used VQA with pre-training in all domains, modalities, and use-cases.
    • In the Framework Setup, about fine-grained VQA, how do we guarantee that fine-grained text does not mistakenly become associated with the entire image but rather specifically the small region or part of the image associated with the pathological information of interest.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Please include clearer explanations on how QFT is different than other pre-training attempts that make use of contrastive learning.
    • It would be great to see the paper answer the question, why has no one done this before, as in trying to use VQA for multi-modal pretraining to guide the framework to focus on specific features.
    • Please expand if possible the section on how the Q-A pairs are prepared.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Good work, but do need more clearer information on QFT and QCL which are key contributions of the paper.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    No further comments beyond what was raised previously. We hope the final paper, if accepted, takes into account the comments, especially with regards to clarifications.



Review #3

  • Please describe the contribution of the paper
    1. The author proposed a multi-modal pretraining mthod based on paired ultrasound images and medical report. By applying a VQA model, the proposed model can focus on specific pathological features in the pre-training task of image-text pairs;
    2. Specifically, the author designs a quasi-textual feature space to help with bimodal feature alignment, in addition to using conventional contrastive learning for image-text pairs;
    3. Experimental results demonstrate the superiority of the proposed methods in four downstreaming tasks, including report generation, classification, detection and segmentation.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main strength of this model is the design of a VQA-based image-text alignment loss. In order to brigde the gap between the image and text features, the author propose to use quasi-textual features as additional supervisory signals for bimodal feature alignment. Specifically, the author firstly extract multi-granular QA pairs from the raw medical report, and then combine quasi-textual features and question embeddings to a VQA model to generate the answer. This process is updated by a language modeling loss. Then the quasi-textual features can be used to calculate the contrastive loss with global textual features.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The figure 2 demonstrates the main idea of this paper, but it is not clear in it which module parameters are trainable, as well as not reflecting the role of the text generator.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    no.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The model performs well on all four downstreaming tasks. Three visual recognition related tasks results were displayed well in both visualize and numeric. But for the report generation task, it will be better to display some example cases because the text evaluation metrics are based on natural language, which may be not so suitable for medical report.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The main idea is relativly new;
    2. Includes some engineering tricks, such as feature buffer and initialize methods;
    3. Comprehensive experiments and large performance gains over other methods.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We sincerely appreciate the constructive feedback from all three reviewers, who acknowledge that our paper addresses significant issues in the medical field (R1, R3), introduces novel methods (R1, R4), and provides comprehensive experiments (R4). In this paper, we employ VQA as the primary task for representation learning and utilize two contrastive learning losses to facilitate modality alignment. Differences between QFT and QFormer (R1): Thanks for your interest in our model. There are two differences: 1) We use a separate text encoder for better encoding rather than sharing weights with the adapter. 2) QFT is initialized with the same pre-trained weights as text generator rather than with a pre-trained BERT. We believe this will facilitate QFT to convert visual information to features that text generator can understand. Differences between QCL and CL (R1): CL uses global feature to represent text and images, which contains one visual or textual token. QCL uses the output from QFT instead of global visual features for calculation. These features contain multiple visual tokens, providing fine-grained information for answer generation. Why QCL works (R1): Through CL, paired images and texts are mapped to the same regions. This method brings the distributions of two modalities closer to each other, thereby narrowing the modality gap and promoting alignment. This approach is widely applied in image generation. For instance, DALL-E uses CLIP to encode text, enhancing the textual understanding of image generator. We employ a similar manner: we utilize QCL, a modification of CL, to narrow the distribution gap between image and report. This approach is beneficial for the language model to utilize visual features as the language model is pre-trained on a text distribution. After removing QCL, the performance in report generation shows an obvious decline (five percent in BLEU4). We believe this is because QCL loss is calculated with quasi-textual features, which are used to generate answers, while CL only affects global features, which do not directly participate in answer generation. Why there is less research exploring VQA for pre-training (R1): This is because building a targeted VQA dataset is cost prohibitive as it requires doctors’ annotation. Public VQA datasets contain diverse questions, making it difficult for the model to focus on the targeted aspect. We construct our VQA dataset based on medical reports. Medical reports offer accurate interpretations of medical images, ensuring the VQA dataset’s correctness without doctors’ involvement. We can also design the questions based on what is important in clinics. Existence of similar works (R3): Thanks for pointing out works similar to our paper. We are sorry about the confusion. In the two papers mentioned in the review, VQA is used for train a question-answering system. In our paper, we use VQA to learn a generalized medical representation. Our paper differs in the following aspects: 1) After pre-training, the pre-trained weights of our model are transferred and fine-tuned to enhance the performance of various tasks, including object detection and segmentation, rather than through instruction tuning. 2) We construct a targeted VQA dataset based on medical reports rather than utilizing public datasets. 3) We design our model in a contrastive manner to facilitate modality alignment, rather than directly conducting language modeling. Trainable parameters of the model and the use of text generator (R4): Thanks for the point. There is no parameter frozen in the training process. The text generator is used to generate answers and reports. We are grateful to the reviewers for their comments. Due to the limitations in the length of the paper/rebuttal and the restrictions in adding experiments, we will further explore more details in future work, such as QA pair details (R1), standard deviation (R3) and conducting doctor evaluation on the generated reports (R4).




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The author proposed a multi-modal pretraining method using paired ultrasound images and medical reports. By incorporating a VQA model, the approach focuses on specific pathological features during the pre-training of image-text pairs. But I disagree with the claim that this is the first utilization of VQA for multi-modal pre-training in the medical field.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The author proposed a multi-modal pretraining method using paired ultrasound images and medical reports. By incorporating a VQA model, the approach focuses on specific pathological features during the pre-training of image-text pairs. But I disagree with the claim that this is the first utilization of VQA for multi-modal pre-training in the medical field.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviews highlight several strengths of the paper, including its innovative use of VQA for targeted feature learning in medical images, the introduction of the QFormer-based QFT module, and comprehensive experimental validation. However, the reviews also raise critical issues, primarily concerning the novelty of the contributions, as opposed to previous works that have employed VQA for multimodal pre-training. The introduction of a novel dataset and methodological innovations like QFT and QCL are good contributions. Yet, although the authors cite differences from previous works, the specificity of these differences could be better articulated. The AC strongly recommends to improve these claims and provide rationale (eg why contrastive modality alignment rather than directly conducting language modeling is useful) in the final version.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The reviews highlight several strengths of the paper, including its innovative use of VQA for targeted feature learning in medical images, the introduction of the QFormer-based QFT module, and comprehensive experimental validation. However, the reviews also raise critical issues, primarily concerning the novelty of the contributions, as opposed to previous works that have employed VQA for multimodal pre-training. The introduction of a novel dataset and methodological innovations like QFT and QCL are good contributions. Yet, although the authors cite differences from previous works, the specificity of these differences could be better articulated. The AC strongly recommends to improve these claims and provide rationale (eg why contrastive modality alignment rather than directly conducting language modeling is useful) in the final version.



back to top