Abstract

Recent advancements in Computer Assisted Diagnosis have shown promising performance in medical imaging tasks, particularly in chest X-ray analysis. However, the interaction between these models and radiologists has been primarily limited to input images. This work proposes a novel approach to enhance human-computer interaction in chest X-ray analysis using Vision-Language Models (VLMs) enhanced with radiologists’ attention by incorporating eye gaze data alongside textual prompts. Our approach leverages heatmaps generated from eye gaze data, overlaying them onto medical images to highlight areas of intense radiologist’s focus during chest X-ray evaluation. We evaluate this methodology in tasks such as visual question answering, chest X-ray report automation, error detection, and differential diagnosis. Our results demonstrate the inclusion of eye gaze information significantly enhances the accuracy of chest X-ray analysis. Also, the impact of eye gaze on fine-tuning was confirmed as it outperformed other medical VLMs in all tasks except visual question answering. This work marks the potential of leveraging both the VLM’s capabilities and the radiologist’s domain knowledge to improve the capabilities of AI models in medical imaging, paving a novel way for Computer Assisted Diagnosis with a human-centred AI.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3261_paper.pdf

SharedIt Link: https://rdcu.be/dV1Vr

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72384-1_18

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3261_supp.pdf

Link to the Code Repository

https://github.com/knowlab/CXR_VLM_EyeGaze

Link to the Dataset(s)

https://physionet.org/content/mimic-eye-multimodal-datasets/1.0.0/

BibTex

@InProceedings{Kim_Enhancing_MICCAI2024,
        author = { Kim, Yunsoo and Wu, Jinge and Abdulle, Yusuf and Gao, Yue and Wu, Honghan},
        title = { { Enhancing Human-Computer Interaction in Chest X-ray Analysis using Vision and Language Model with Eye Gaze Patterns } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15003},
        month = {October},
        page = {184 -- 194}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper first time discuss about how to incorporate eye gaze patterns into the vision language models to help increase the report generation accuracy and VQA.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    It is the first time eye gaze patterns are used with visual language model, which is necessary and meaningful.

    This paper conducts four different type of tasks to study how eye gaze pattern can affect the language generation, which is good. This study could be a good start for incorporation of eye gaze pattern into the visual language models.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The experiment is single, the authors should also try other VLM models.

    There are no code released, it will be a great contribution for this field with open-source code and model.

    From the result provided, the help of eye gaze is not so obvious especially in report generation (GEN and SUM). For some of the models, the performance drop is huge with eye gaze. This is more like a trial-and-error, instead of the rigorous study. In other words, the results are not consistent enough to prove their findings. Also, some of the results are exactly same value between No gaze incoporation and with gaze incoporation, making the reliability of the result questionable.

    Only the quantitative results are given, the authors should add at least one generated report sample or VQA sample to show there are some difference before and after the gaze incoporation.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    There are no code released, it will be a great contribution for this field with open-source code and model.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The experiments are all based on finetune of LLaVA model, there are also some other VLM models, and even specific chest x-ray VLM model such as ChestXAgent.

    Why many results of both No G and G are the same in Table 3, such as LLaVA-v1.5 FT ERR DDx VQA.

    If the authors could display one generated report or VAQ sample, it is better to see the help of gaze pattern. Also, finding the relationship between the language sample with the chest x-ray and eye gaze patter would be great.

    For the eye gaze incorporation study, the listed related works are little. Suggest to add following reference and discuss eye gaze related study more in the introduction part: Mining Gaze for Contrastive Learning toward Computer-Assisted Diagnosis GazeGNN: A Gaze-Guided Graph Neural Network for Chest X-Ray Classification Follow my eye: Using gaze to supervise computer-aided diagnosis Gazeradar: A gaze and radiomics-guided disease localization framework

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    As mentioned above

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Reject — should be rejected, independent of rebuttal (2)

  • [Post rebuttal] Please justify your decision

    I think the authors did not answer my question well and did not solve my concerns.

    First, the performance is not consistent. The authors did not answer this question in the rebuttal. From my perspective, the experiments results can not lead to the conclusion that gaze helps LLM models. The reasons are for some of the models, the performance drop is huge with eye gaze. And the experiments are all based on singe model LLaVA, although the authors explain they can choose different backbone in LLaVA, the improvements are too random for me, so that it is not enough to prove their findings.

    Second, there is one more serious problem that some of the results are exactly same value between No gaze and with gaze. It raises my concern about the reliability of the results. However, the authors completely ignore this question. I am not sure if the results can be trusted then.



Review #2

  • Please describe the contribution of the paper

    The paper explores an intriguing idea of leveraging doctors’ gazing data to enhance the performance of VLMs. It incorporates a heat map created from the gazing data on top of the original CXR input.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The idea of incorporating gazing data into VLMs is commendable, particularly for medical VLM applications where one of the key issues lies in achieving fine-grained correspondence between text and images. Often, the model does not clearly understand which part of the image corresponds to each piece of the text. The authors of this paper make a preliminary attempt by directly using gazing data as part of the input. It would be even better if the gazing data could be more organically integrated into the training process of VLMs.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The experiments in the paper are somewhat rudimentary and do not offer many solid insights. For example, the Error Detection task presents limited clinical significance because clinically occurring errors are likely much more challenging to discern than randomly inserted errors. Additionally, the results do not solidly demonstrate the effectiveness of using gaze data. It may be due to computational constraints that the authors were unable to conduct more in-depth experiments.
    2. Some experimental details in the paper are rather vague, such as how the gaze heatmap is input into the VLM. Is it directly rendered onto the original CXR using RGB, or are the tokens from the CXR and the gaze heatmap concatenated together? The method of integration could significantly impact performance, and I’m concerned about this. Additionally, the authors mention finetuning a model “without the raw CXR images”. Does this mean only the gaze heatmap is input without the CXR? This approach seems somewhat unreasonable.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please refer to Q6.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper introduces an interesting concept; however, both the experimental design and the results require further refinement.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper proposes a framework for integrating eye-gaze of radiologists with Vision Language Models (VLM) for clinical applications. The effectiveness of the proposed approach is evaluated on 4 different clinical applications.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The model is compared with multiple variants of LLaVA models.
    2. The paper is well-organized.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Although the method proposed integrating eye-gaze with VLMs, the novelty of the method is limited.
    2. In Table 3, the proposed method is compared with several variants of LLaVA. However, a better comparison would have been if the results were also shown for some VLMs trained on medical data for example LLaVA-Med, etc.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    In Table 3, the performance of G for different tasks (such as ERR and DDx) when compared to No G is not better which leads to the question of where gaze is useful for these tasks.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    In Table 3, the performance of G for different tasks (such as ERR and DDx) when compared to No G is not better which leads to the question of where gaze is useful for these tasks.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed work discusses a novel way of integrating gaze patterns of radiologists with VLMs. However, the weak comparison reported in the work (not compared with Med VLMs) constitute as its main weakness. Overall the strengths outweigh the weaknesses.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors addresses my comments and clarifies other aspects and hence I decide to continue with my previous evaluation.



Review #4

  • Please describe the contribution of the paper

    This study presents an intriguing approach to augment human-computer interaction in chest X-ray analysis through Vision-Language Models (VLMs) enriched with radiologists’ attention, incorporating eye gaze data alongside textual prompts. The method utilizes heatmaps generated from eye gaze data, superimposing them onto medical images to emphasize areas of intense radiologist focus during chest X-ray assessment. The author evaluates this approach across various tasks, including visual question answering, chest X-ray report automation, error detection, and differential diagnosis. The observed results demonstrate a significant enhancement in the accuracy of chest X-ray analysis with the inclusion of eye gaze information. Moreover, the impact of eye gaze on fine-tuning is affirmed, as it surpasses other medical VLMs in all tasks except visual question answering.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper harnesses heatmaps derived from eye gaze patterns observed during image interpretation. By integrating these heatmaps, it introduces an additional layer of insight to the Vision-Language Model (VLM), signifying progress in human-centered AI for medical image analysis. Therefore, the evaluation parameters, encompassing Report Automation (GEN and SUM), Error Detection (ERR), Differential Diagnosis (DDx), and Visual Question Answering (VQA), offer a comprehensive framework for critically analyzing this proposal. Additionally, the authors implement strategic techniques such as low-rank adaptation (LoRA), the DeepSpeed zero-redundancy optimizer (ZeRO3), and flash attention to significantly reduce memory consumption. Despite the large dataset used for training and evaluation, these techniques effectively mitigate extensive memory usage.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The author contends that larger models like LLaVA-v1.5-13B do not perform well on the given task due to their higher model parameters. However, this assertion may be overly simplistic, as model performance is typically influenced by various factors such as parameter fine-tuning. It remains unclear why the author arrived at such conclusions without detailing attempts to enhance such models. Generally, the understanding is that larger model parameters entail higher computational costs but also yield better accuracy. Therefore, the author needs to justify their findings, as they seem to contradict this commonly accepted principle.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The author ought to provide clarification regarding the relevant details of other model parameters that contributed to the underperformance of large models like LLaVA-v1.5-13B. Such information is crucial for establishing the novelty of this work. The author should provide detailed ablation study to validate the contributions of the model components.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposal presented in this study offers an insightful opportunity to advance the use of large language models to enhance computer-aided diagnosis and understanding. The author has provided clear and relevant details of the work conducted and its significance.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The rebuttal responses from the authors provide comprehensive detail about the main concerns raised about this paper. I recommend the acceptance (#5) because of the significance of its contribution to LLM based Vision Language Model and clarity of the proposed work.




Author Feedback

We thank the reviewers for their time and suggestions. We will start our response with common feedback from reviewers.

Questionable effectiveness of the gaze data This work aims to close a specific gap in the literature: how/whether human experts’ eye gaze data could facilitate chest X-ray (CXR) analysis with LLM-based VLMs. From our results, integrating human expertise improved model performances for some downstream tasks, e.g., differential diagnosis (DDx), while not as much in others. Such observations imply that eye gaze data does not always improve VLMs, which is a valuable finding not reported in the literature. Both positive and negative observations from our results will stimulate future work in improving the utilisation and understanding of eye gaze data with VLMs.

Open source For the camera-ready version, we will provide a link to the code and the dataset.

Reviewer #3 Randomly inserted errors for the Error Detection Task It is worth clarifying that all errors introduced are clinically significant. The randomness is only used in picking which errors to use. Specifically, with the help of radiologists, we identified the top 33 most important phenotypes that must be reported in reports if present in corresponding CXR. So, when more than one of such phenotypes exists, we randomly choose a phenotype and introduce an error based on it. We will clarify this selection approach in the camera-ready version.

Vague experimental details About the integration of eye gaze, we rendered the gaze data directly onto the original CXR using red dots. “Without the raw CXR images” is misleading and incorrect, we made a typo. It should be a model trained “with the raw CXR images.” We will correct this for the camera-ready version.

Reviewer #4 Limited novelty The novelty of our work is that it is the first comprehensive study on LLM-based VLM in handling eye-gaze information for CXR analysis. By using a LLM, we extended VLM’s application to 4 real-world clinical applications. To the best of our knowledge, this is the first work in the literature on this particular topic.

No VLMs trained on medical data (e.g. LLaVA-Med) tested This is an inaccurate assessment of our work as we have already included LLaVA-Med as well as CXR-LLaVA which was trained with MIMIC-CXR data. The result is shown in Table 3.

Reviewer #5 Limited VLM tested We tested 10 models, two of which are our own trained models. Although they are LLaVA-based models or use LLaVA’s approach for training, each model has a diverse selection of the backbone LLM and vision encoder or usage of medically relevant data in the training phase. This ablation study of LLaVA variant models conducted in our paper allowed us to have a clearer understanding of medical domain training or even CXR-specific training effects, model size effect, and fine-tuning impact with eye gaze information.

Model response samples We will add samples of the response in the supplementary information section of the camera-ready version.

Related work Due to the page limit, the related work section is limited to two previous works. We will include some of the suggested works, but still, the models do not leverage text modality restricting their usage in clinical applications.

Reviewer #6 Larger models do not perform well due to their higher model parameters is a simplistic assertion In our manuscript, we did not assert that the larger model did not perform well because of the larger model parameters. We only said the higher model parameters do not always guarantee performance improvement (“This result suggests that adding more parameters does not directly translate to performance improvements”). We hope this response clarifies the misinterpretation of our findings about the larger model. In general, we agree with the reviewer that in-depth analysis and studies focusing on the larger models’ poor performances would be a very interesting future research direction.

Thank you for your time and consideration.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Reviewers positively noted the originality and potential clinical utility of the approach. The authors addressed most of the issues raised by clarifying the experimental methodologies and the significance of the findings. It is strongly recommended that the final version clarifies the relative utility of gaze and its limitations as pointed by R5.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Reviewers positively noted the originality and potential clinical utility of the approach. The authors addressed most of the issues raised by clarifying the experimental methodologies and the significance of the findings. It is strongly recommended that the final version clarifies the relative utility of gaze and its limitations as pointed by R5.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Enthusiasm for this work derives from its subject matter, which is of general interest to the community, rather than any especially compelling methodological development or results. Indeed, several significant concerns are raised, but I agree with the majority vote to accept as this work should generate healthy discussion.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Enthusiasm for this work derives from its subject matter, which is of general interest to the community, rather than any especially compelling methodological development or results. Indeed, several significant concerns are raised, but I agree with the majority vote to accept as this work should generate healthy discussion.



back to top