Abstract

Visual Question Answering (VQA) within the surgical domain, utilizing Large Language Models (LLMs), offers a distinct opportunity to improve intra-operative decision-making and facilitate intuitive surgeon-AI interaction. However, the development of LLMs for surgical VQA is hindered by the scarcity of diverse and extensive datasets with complex reasoning tasks. Moreover, contextual fusion of the image and text modalities remains an open research challenge due to the inherent differences between these two types of information and the complexity involved in aligning them. This paper introduces PitVQA, a novel dataset specifically designed for VQA in endonasal pituitary surgery and PitVQA-Net, an adaptation of the GPT2 with a novel image-grounded text embedding for surgical VQA. PitVQA comprises 25 procedural videos and a rich collection of question-answer pairs spanning crucial surgical aspects such as phase and step recognition, context understanding, tool detection and localization, and tool-tissue interactions. PitVQA-Net consists of a novel image-grounded text embedding that projects image and text features into a shared embedding space and GPT2 Backbone with an excitation block classification head to generate contextually relevant answers within the complex domain of endonasal pituitary surgery. Our image-grounded text embedding leverages joint embedding, cross-attention and contextual representation to understand the contextual relationship between questions and surgical images. We demonstrate the effectiveness of PitVQA-Net on both the PitVQA and the publicly available EndoVis18-VQA dataset, achieving improvements in balanced accuracy of 8% and 9% over the most recent baselines, respectively.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3403_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/mobarakol/PitVQA

Link to the Dataset(s)

https://github.com/mobarakol/PitVQA

BibTex

@InProceedings{He_PitVQA_MICCAI2024,
        author = { He, Runlong and Xu, Mengya and Das, Adrito and Khan, Danyal Z. and Bano, Sophia and Marcus, Hani J. and Stoyanov, Danail and Clarkson, Matthew J. and Islam, Mobarakol},
        title = { { PitVQA: Image-grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper builds a specialized dataset focused on VQA in the context of endonasal pituitary surgery.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    (1) The specialized dataset focused on VQA in the context of endonasal pituitary surgery would be helpful for research community of surgical videos. (2) The dataset will be open sourced.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (1) In addition to the surgical scenario, the proposed dataset is similar to existing ones. This dataset would be more valuable if the questions designed is more related to the specific surgical scenario, which is pituitary surgery. (2) Adding localization of targets and instruments would be helpful to increase the usability of this dataset. (3) Regarding the designed questions in this dataset, they are still in the same scope with previous surgical VQA datasets. Therefore, the novelty of this dataset is undermined in this regard.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The proposed dataset generally follows previous surgical VQA datasets, with a larger data scale and different surgical scenario. I would suggest the authors to not just follow previous works, but to think more about how to design more meaningful questions in different circumstances.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I rate this paper as Weak Accept because it provides a valuable dataset for surgical video research community. But it is generally similar to existing ones especially regarding the designed questions.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    The paper introduces “PitVQA,” a novel dataset specifically tailored for visual question answering (VQA) in the context of endonasal pituitary surgery. It also presents “PitVQA-Net,” which incorporates a new image-grounded text embedding and a gated-attention excitation block within a modified GPT2 architecture to address the VQA challenges in surgical settings. This model aims to enhance intra-operative decision-making through improved surgeon-AI interaction.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Novel Dataset. The creation of a specialized dataset (PitVQA) that includes 25 videos with extensive annotations for pituitary surgery is a significant contribution. This dataset fills a gap in the availability of VQA resources tailored to complex surgical procedures.

    Innovative Model Design. The novel image-grounded text embedding strategy employed in PitVQA-Net enhances the contextual alignment between the visual content and textual queries, which is pivotal for effective VQA systems in surgical settings.

    Demonstrated Clinical Feasibility. The paper effectively demonstrates the clinical feasibility of the proposed model by testing it on both the new PitVQA dataset and the existing EndoVis18-VQA dataset.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Limited Dataset Diversity. Although the PitVQA dataset is a valuable addition, its focus on a single type of surgery (endonasal pituitary surgery) may limit the generalizability of the findings.

    Lack of Comparative Baseline Models. The paper must conduct a more extensive comparison with a broader range of existing VQA models, which are mentioned in the paper, to benchmark the proposed method’s performance comprehensively.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors have promised to make the code and dataset available, which is a strong plus for reproducibility. However, the paper’s reproducibility would be further enhanced by the inclusion of more detailed implementation details and hyperparameter settings.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Consider expanding the dataset to include more varied surgical procedures to test the robustness and adaptability of PitVQA-Net across different surgical contexts.

    2. Provide detailed comparisons with a wider range of VQA models to better position your contributions within the landscape of surgical VQA research.

    3. Enhance the explanation of the model’s components, particularly the novel embedding and excitation block.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I would rate this paper a 4 out of 6. The major factors influencing my rating are the innovative dataset and modeling approach that address a clear need in surgical VQA, backed by strong empirical results However, this work has major issues in terms of comparative analysis.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors introduce a dataset and LLM-based method for visual question answering in endonasal pituitary surgery. A

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors introduce a novel dataset.

    The method is novel and an interesting contribution towards surgical scene understanding. The method outperforms the state-of-the-art in visual question answering.

    The presentation of the paper is very clear and well written.

    The authors perform a thorough comparison with the SOTA and an ablation study to show the effectiveness of their individual design decisions in method development.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Image-grounded Text Embedding seems to be directly adopted from BLIP which should be more clearly indicated in the paper.

    Squeeze-and-Excitation is an effective, yet common modification to improve the performance of DL methods.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The contributions at the end of the introduction section need to be reformulated as proper english sentences.

    What is the exact difference between contribution 2 and 3 as mentioned in the end of the introduction section regarding the “novel image-text embedding”?

    Table 1: what is the unit of the “average length”?

    Results: “It appears that most of the models are capable of accurately recognizing surgical steps, whereas the identification of instruments with localization reasoning mostly fails. However, PitVQA-Net demonstrates robust prediction across various types of question answering in both datasets.” - The qualitative results shown in this paper are not sufficient to support this claim. I suggest to remove this sentence or reformulate it.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper shows excellent results, an interesting method and is well written and structured. The evaluations are thorough and support the claims of the authors. The code is publicly available to facilitate further research in this direction. The clinical application is relevant and the contribution of the work is clear.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

N/A




Meta-Review

Meta-review not available, early accepted paper.



back to top