Abstract

Medical visual question answering (Med-VQA) aims to answer medical questions with given medical images. Current methods are all designed to answer a single question with its image. Still, medical diagnoses are based on multiple factors, so questions related to the same image should be answered together. This paper proposes a novel multi-question learning method to capture the correlation among questions. Notably, for one image, all related questions are given predictions simultaneously. For those images that already have some questions answered, the answered questions can be used as prompts for better diagnosis. Further, to deal with the error prompts, an entropy-based prompt prune algorithm is designed. A shuffle-based algorithm is designed to make the model less sensitive to the sequence of input questions. In the experiment, patient-level accuracy is designed to compare the reliability of the models and reflect the effectiveness of our multi-question learning for Med-VQA. The results show our methods on top of recent state-of-the-art Med-VQA models on both VQA-RAD and SLAKE, with a 3.77% and 4.24% improvement of overall accuracy, respectively. And a 6.90% and 15.63% improvement in patient-level accuracy. The codes are available at: https://github.com/shanziSZ/MMQL.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1159_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1159_supp.pdf

Link to the Code Repository

https://github.com/shanziSZ/MMQL

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Che_MMQL_MICCAI2024,
        author = { Chen, Qishen and Bian, Minjie and Xu, Huahu},
        title = { { MMQL: Multi-Question Learning for Medical Visual Question Answering } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15005},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces Medical Multi Question Learning (MMQL) to the Medical Visual Question Answering (Med-VQA) field, enabling joint training on multiple related questions per image. This approach enhances model performance by incorporating novel techniques like shuffle-based augmentation and entropy-based prompt pruning. Additionally, it presents a new patient-level evaluation metric, showing superior results over existing methods on public datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Firstly, it pioneers the introduction of Medical Multi Question Learning to the Med-VQA field, a novel approach that markedly enhances the model’s ability to process multiple related questions per image, thereby greatly improving diagnostic capabilities in complex medical scenarios. Secondly, it incorporates cutting-edge techniques such as shuffle-based augmentation and entropy-based prompt pruning, which effectively increase the model’s adaptability and reduce error propagation, ensuring more reliable outputs. Thirdly, the paper introduces a new patient-level evaluation metric specifically designed for Med-VQA, providing a more accurate measure of the model’s effectiveness in real-world clinical settings. These innovations collectively push the boundaries of what’s possible in medical question answering systems.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Lack of Reproducible Code: Although the paper is the first to apply Multi-Question Learning to the Med-VQA field, the novelty largely hinges on the application of MQL from general VQA to the medical VQA. A critical component for validating such innovative claims is the availability of reproducible code, which seems to be absent in this case as the paper only provides a non-functional link. This omission casts doubt on the replicability of the reported significant improvements, undermining the paper’s credibility.
    2. Questionable Superiority Claims: The paper asserts that its model outperforms SOTA Med-VQA models. However, it appears that the MedVInT-TD model described in the paper ‘PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering’ achieves better results on the VQA-RAD dataset than those reported in this study. This suggests that the claim of having the best-performing model may be incorrect, which could mislead readers regarding the paper’s impact and innovation level.
    3. Metrics Formula: Given that the paper emphasizes the introduction of a patient-level accuracy metric as a significant contribution, it would be beneficial for the authors to also present other standard metrics, such as simple accuracy, using the notation described in the text. This would help readers better understand the differences and relative advantages of the new metric compared to traditional ones, providing a clearer context for evaluating the paper’s contributions.
    4. Insufficient Methodological Detail: The methodological presentation within the paper could be improved. The framework diagram provided lacks essential details, which may hinder readers’ understanding of the model’s architecture and operational nuances. A reorganization or enhancement of Figure 1 to include these critical details could significantly improve the clarity and effectiveness of the presentation, aiding readers in fully grasping how the proposed system works.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    See weaknesses.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I appreciate the innovative design of the Medical Multi Question Learning (MMQL) method and the significant improvements reported in the results. However, my main concern lies in the lack of available code, which casts doubt on the reproducibility of these results. If the authors could provide accessible and executable code, it would greatly strengthen the paper’s credibility. Additionally, the comparison made in the paper does not seem comprehensive, as there are existing models that perform better, suggesting that the claim of this model being state-of-the-art (SOTA) might be overstated. These issues combined led to my decision, which could be reconsidered if addressed effectively in the rebuttal.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors provided a very detailed rebuttal that effectively addressed the important concerns. Even though the code is not available, they explained the reason and responded to other reviewers’ questions usefully. This influenced my decision positively.



Review #2

  • Please describe the contribution of the paper

    This work considers an interesting setting in which multiple questions need to be answered for a single medical image. To address this task, this paper proposes a multi-question learning method to explore the interdependence of the multiple questions to improve the accuracy of answers. Experimental study is conducted on two benchmark datasets to demonstrate its efficacy.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This work addresses a new setting for medical visual question answering. The proposed method considers the characteristics of this setting and proposes strategies to exploit the correlation among questions and mitigate the impact of wrongly answered questions to the subsequent ones.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    This work could further be improved at the following points.

    1. In the presence of multiple questions, capturing their correlation will be an effective way to improve the quality of answers. This work can better highlight how the proposed method captures and utilizes the correlation in the part of Introduction and Conclusion.

    2. Based on the definition of the problem in Section 2.1, it seems this work deals with the “close-end” setting of VQA (i.e., as a C-class classification problem). If it is this case, can the proposed method be extended to deal with the “open-end” setting of VQA?

    3. A shuffle-based augmentation is proposed in Section 2.4, which randomly changes the sequences of questions. However, will this lead to any adverse effect? After all, the order of raising questions and answering multiple questions in a certain order sometimes are important.

    4. It is appreciated that a false-prompt prune method is develop to handle the inaccuracy of prompt-QA. However, for the two models f_w and f_w/o, how they are trained and how they are used can be better explained.

    5. The inferior performance of the proposed MMQL with respect to M-Mixup in Fig. 3 can be better explained. Say, what are the rare samples? Can the proposed MMQL also augment such samples?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    None.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    See “main weaknesses of the paper.”

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work proposes a novel setting for medical visual question answering and proposes a technically sound method to address this task. Experimental study shows the promising performance of this work when compared with other related methods.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I have read the comments from peer reviewers. The rebuttal provided by the authors is informative, addressing most of issues well. Considering the technical novelty and experimental demonstration of the efficacy of the proposed work, the rating of Weak Accept is kept.



Review #3

  • Please describe the contribution of the paper

    Introduction of MMQL Approach: The paper introduces the Medical Multi-Question Learning (MMQL) approach, which integrates multi-question learning into the field of Medical Visual Question Answering (Med-VQA). This innovative method involves jointly training medical questions associated with a single image, leading to significant enhancements in Med-VQA models.

    Innovative MQL Module: The MQL module in MMQL accommodates scenarios with no-answer and prompt-QA availability, effectively utilizing external information to improve diagnostic accuracy. Additionally, a Shuffle-based augmentation algorithm is introduced to reduce the sensitivity of question sequences, enhancing the robustness of the model.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper’s innovative MQL module effectively harnesses external information, such as prompts and answers, to enhance the performance of the Med-VQA model. By accommodating scenarios with no-answer and utilizing prompt-QA availability, the model can draw insights from existing QA pairs and improve the overall diagnostic process. This is one of the major contribution of this paper. The patient-level evaluation method is also a strong aspect of the paper, demonstrating a focus on clinical feasibility and real-world applicability.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper may benefit from a more in-depth discussion on the generalizability of the MMQL approach across different medical imaging modalities and clinical scenarios. While the paper introduces entropy-based prompt pruning methods to address false question-answer instances, a more detailed exploration of error analysis and the potential sources of inaccuracies in the MMQL model could provide valuable insights for further refinement.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The calibration error could be disscussed in further details. Provide more explanation of the patient-level accuracy in figure 2 would be great.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The innvation of the methodology. The thorough analysis and comparison of the methodology.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

Thanks to the reviewers for your valuable comments of our work. We address the concerns as follows:

Methodological detail and available code (R1.Q4, R4.Q1, R4.Q4) 1.We have already prepared a GitHub repository to share our code publicly. But rebuttal guidelines prevent us from providing external links. Therefore, we will make the repository accessible immediately following the acceptance notification. 2.To enhance clarity in our methodology, we plan to revise Fig 1. It will be split into two separate figs: one showing samples and the other detailing the method. The latter will add a flowchart including Shuffle-Aug and Error-Prompt Pruning. For Error-Prompt Pruning, f_w is our proposed model, utilizes prompt-QAs in its input text, which includes both answered and unanswered questions. In contrast, f_w/o is trained and tested under identical conditions without the inclusion of prompt-QAs.

Questionable Superiority Claims (R4.Q2) 1.We have thoroughly reevaluated MMQL in comparison to MedVInT-TD, acknowledging it as significant work in the field. We will include MedVInT-TD in our references and integrate its results into Tab 1. However, it’s important to note that MedVInT-TD benefits from pre-training on the PMC-VQA dataset (177k samples), while our method does not utilize such extensive pre-training. This significant difference suggests that a direct comparison between MMQL and MedVInT-TD might not provide a fully equitable assessment.

Generalization on modalities, scenarios, and downstream tasks (R1.Q2, R3.Q1): 1.The MMQL leverages all relevant questions associated with a medical image in a single forward pass, enhancing the model’s ability to recognize inter-question correlations. Its applicability is driven by the inherent one-to-many relationship between images and questions, regardless of modalities, scenarios, and downstream tasks. 2.Our study utilizes VQA-RAD and SLAKE, which encompass 3 primary image modalities (X-Ray, CT, and MRI) and 3 anatomical locations (lung, abdomen, and brain). These modalities and locations are representative of the broader range other datasets like VQA-Med (2018-2020) and RadVisDial. Additionally, our work aligns with previous works (ref 3, 4, 7, 12, 19), which also focus on 1-2 datasets. 3.Previous studies (“Open-ended medical visual question answering through prefix tuning of language models.”) have characterized the open-ended setting as a text-generation QA. In this scenario, the modifications would be required in the final MLP layer. It would be replaced with a decoder. Additionally, adjustments would be necessary in the Error-Prompt Pruning to incorporate metrics like perplexity, which better suit the text-generation context.

Error analysis (R1.Q5, R3.Q2) 1.We have detailed 3 typical cases in the supplementary material. We observed primary issues such as inconsistency answers (Xmlab105) and language bias (Xmlab469). Error prompt-QAs can lead to cascading mistakes, prompting us to introduce the Error-prompt pruning. 2.Rare cases in Section 3.5 mainly involve language bias, particularly for questions with imbalanced answer distributions. E.g., the 9th question in Xmlab469 has two potential answers, with one being significantly more common. Mix-up helps augment these less common answers, a strategy not employed by our method. This likely explains why our method ranks second in terms of ECE/MCE metrics.

Other suggestions (R1.Q1, R4.Q3, R1.Q3) 1.We plan to enhance both the Introduction and Conclusion sections by incorporating examples that clearly illustrate our methods and their significance. 2.The extra 1/2-page in camera ready version can be used to include the formula for vanilla accuracy in Section 3.1. 3.Ordering questions is a great suggestion. But it needs extra labeling, and most datasets do not meet this requirement. Thus, training questions without ordering is our first option. In this scenario, Shuffle-Aug shows its effectiveness (Tab2), which can prevent over-fitting.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper addressed multi-question learning setting in medical VQA, which needs to jointly train multiple questions associated with a single image. As all reviewers acknowledged, this paper provides new knowledge to the field of MMQL, justifying its technical contributions. On the other hand, it is also a bit disappointed to see the model was only validated for close-end questions but not open-end questions, as most conventional medical VQA methods have to. Therefore, my feeling is mixed. Considering the technical contributions to a new medical VQA setting, I tend to accept.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper addressed multi-question learning setting in medical VQA, which needs to jointly train multiple questions associated with a single image. As all reviewers acknowledged, this paper provides new knowledge to the field of MMQL, justifying its technical contributions. On the other hand, it is also a bit disappointed to see the model was only validated for close-end questions but not open-end questions, as most conventional medical VQA methods have to. Therefore, my feeling is mixed. Considering the technical contributions to a new medical VQA setting, I tend to accept.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviewers provided an overall positive evaluation; therefore, the paper should be accepted.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The reviewers provided an overall positive evaluation; therefore, the paper should be accepted.



back to top