Abstract

Gaze estimation is pivotal in human scene comprehension tasks, particularly in medical diagnostic analysis. Eye-tracking technology facilitates the recording of physicians’ ocular movements during image interpretation, thereby elucidating their visual attention patterns and information-processing strategies. In this paper, we initially define the context-aware gaze estimation problem in medical radiology report settings. To understand the attention allocation and cognitive behavior of radiologists during the medical image interpretation process, we propose a context-aware Gaze Estimation (GEM) network that utilizes eye gaze data collected from radiologists to simulate their visual search behavior patterns throughout the image interpretation process. It consists of a context-awareness module, visual behavior graph construction, and visual behavior matching. Within the context-awareness module, we achieve intricate multimodal registration by establishing connections between medical reports and images. Subsequently, for a more accurate simulation of genuine visual search behavior patterns, we introduce a visual behavior graph structure, capturing such behavior through high-order relationships (edges) between gaze points (nodes). To maintain the authenticity of visual behavior, we devise a visual behavior-matching approach, adjusting the high-order relationships between them by matching the graph constructed from real and estimated gaze points. Extensive experiments on four publicly available datasets demonstrate the superiority of GEM over existing methods and its strong generalizability, which also provides a new direction for the effective utilization of diverse modalities in medical image interpretation and enhance the interpretability of models in the field of medical imaging. https://github.com/Tiger-SN/GEM

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1951_paper.pdf

SharedIt Link: https://rdcu.be/dVZiN

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72378-0_49

Supplementary Material: N/A

Link to the Code Repository

https://github.com/Tiger-SN/GEM

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Liu_GEM_MICCAI2024,
        author = { Liu, Shaonan and Chen, Wenting and Liu, Jie and Luo, Xiaoling and Shen, Linlin},
        title = { { GEM: Context-Aware Gaze EstiMation with Visual Search Behavior Matching for Chest Radiograph } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15001},
        month = {October},
        page = {525 -- 535}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes context-aware Gaze EstiMation (GEM) that predicts the eye-gaze of radiologists with medical image and lesion name. The paper also proposes VBMatch that attempts to align the estimated and ground truth gaze points by matching the graphs.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The Visual Behavior Graph Construction (VBGC) and Visual Behavior Matching (VBMatch) modules are novel.
    2. Apart from a few concerns in the Experiments and Results section (Section 3), the Introduction (Section 1) and Method (Section 2) sections are well-written.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The quantitative comparisons shown in Table 1 shows that the performance of GEM is close to the baselines. Hence, the statistical significance of the result comes to question. It would be better if the authors report the p-values for these results. The concern is the same for Table 2.
    2. The authors report quantitative results for the MIMIC-Eye dataset and qualitative results for the OpenI, MS-CXR and AIforCOVID dataset. This is not clear in subsection 3.1. Also, the justification for not reporting quantitative results on the three datasets is not mentioned which makes the experimentation details incomplete and weak.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors are recommended to address Weakness point 1 and 2. Minor: Why alpha and beta are selected to be 1 and 0.1. What is the reason for overweighting MSE loss?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The work proposes a novel method for context aware gaze estimation using VBMatch that helps in making the predicted gaze point more accurate. However, the main weakness of the paper is the lack of quantitative evaluation on the 3 datasets.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors addresses my comments and hence I upgrade my review.



Review #2

  • Please describe the contribution of the paper

    The paper provides a multi-modal gaze estimation technique based on visual search behavior for chest radiographs. Specifically, it uses eye gaze data to interpret visual search patterns of radiologists and coregisters with medical report information. Additional modules based on graph structures and matching techniques are used to estimate the gaze points.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Experiments conducted on four publicly available datasets shows the generalizability of the proposed model.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • Several aspects of the methodology are not very clearly explained. And the motivation for including different modules is not clearly justified.
    • Main drawback of the proposed method is that although several additional modules are included in the model for extracting meaningful information, the improvement above the state-of-the-art is no very statistically significant.
    • Except for PCK@0.3, the improvement in all the metrics is minimal.
    • Ablation study is missing a model with no context-aware module and only has VBMatch module.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please refer to the main weaknesses mentioned above.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the work is interesting, the results of the proposed model are not very significant.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This study shows a solution to estimate the eye gaze pattern for chest x-ray images without the need of scene and head image. They propose a context-aware method that harnesses the capabilities of both image and text to estimate eye gaze. Subsequently, it generates a graph of predicted keypoints, aligning or matching them with the graph of ground truth keypoints.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors utilize the power of image and text to learn the eye gaze pattern, which is really interesting and interpretable. The proposed technique visual behavior matching is also very innovative. The experiments is good and adequate with multiple datasets and evaluation metrics.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    There are only two comparison methods, which are not designed to do gaze estimation specifically.

    The improvement of accuracy seems little compared to other methods.

    Potential lack of thorough literature review of gaze estimation in medical imaging.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    For the graph construction part, the authors first use gaze mask to crop the correlation map and then construct the graph. One question is does each 6x6 patch in the gaze mask represent one node? Why the authors choose to use graph to do the matching instead of other feature alignment methods? Did other feature alignment methods help here?

    For eye gaze use in the medical imaging, normally gaze will be displayed as sparse and intense pattern on the image. for the pathology region, the gaze is shown as intense cluster, for other regions gaze is shown more sparse. In this study, can this kind of pattern be modeled during the gaze estimation.

    I suppose the gaze points predicted is more kind of localization information that is similar to bounding box rather than real gaze points (more intense or sparse over whole image as mentioned). In this case, why we still need to do gaze points estimation rather than directly predict bounding box. We originally use eye gaze is because it contains expert knowledge, and is a kind of by-product by radiologist, they do not need to draw bounding box anymore. Now, the gaze points are also estimated as bounding box, what is the advantage of gaze points over other kinds of fine-grained information.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I like this paper, but I wish the authors should think about the last question I mentioned in the detailed comment. What is the advantage of gaze points over other fine-grained information such as bounding box, segmentation mask, etc.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    For the question that I asked “ advantage of gaze points”, I do not think the authors’ feedback is not fully answered my question. What I ask is what is the advantage of predicted gaze points over other fine-grained information such as bounding box. Notice that here the gaze points are predicted not directly from radiologists. So it can not purely represent radiologists’ diagnostic processes. But if you said we want to learn a model to make the predicted gaze points represent radiologists’ diagnostic processes as much as you can, I think why don’t we directly learn a attention map or a bounding box since they all contain same information and easier to learn. The authors also said the sequence of gaze points over time can provide more useful knowledge, but I assume that based on this paper, it is not able to get the order or sequence of gaze points over time, and it is very hard to learn because each radiologist has his own style to read images. In conclusion, although this paper gives a solution to predict gaze points (with limited accuracy improvement), the authors do not explain well why this work is meaningful.




Author Feedback

We appreciate valuable comments from reviewers and will consider them in the final manuscript.

R5Q1: Clarity of method and motivation A1: Our GEM network aims to understand radiologists’ attention allocation and visual search behavior patterns during image interpretation, providing interpretability and insights into diagnosis. Since eye gaze can reveal the relation between reports and CXR images, we introduce a context-aware module to generate a correlation map that highlights the most related regions of CXR image about report. To simulate radiologists’ visual search behaviors and decision-making processes during image interpretation, we devise a VBGC to capture the eye gaze patterns as graphs, and a VBMatch to preserve real eye gaze patterns via graph matching. We will make the method and motivations clear in revised manuscript.

R5Q2Q3,R6Q1,R7Q2: t-test for Table 1,2 A2: We previously computed p-values for the results in Tables 1 and 2. The p-values across all metrics are less than 0.05, suggesting the statistical robustness of our GEM over other methods. These results will be included in Tables 1 and 2 in our revised manuscript.

R5Q4: Ablation study only w/ VBMatch A3: GEM w/o context-aware module but w/ VBMatch improved baseline by 5.2 PCK@0.3 score, indicating its effectiveness.

R6Q2: Experiments on CXR datasets A4: Since OpenI, MS-CXR and AIforCovid datasets do not include eye gaze data, we only perform quantitative experiments on the MIMIC-Eye dataset that includes eye gaze data. To qualitatively evaluate other datasets, we selected five CXR images from each and asked a radiologist to provide gaze points per image. Quantitative experiments were not performed on these datasets due to limited labeled data. We will include this experimental setting in our revised manuscript.

R6Q3: Weight of MSE and CE loss A5: The hyperparameters for alpha and beta were determined through grid search, with the results indicating that optimal values are 1 and 0.1.

R7Q1: Compared methods for gaze estimation A6: Our previous literature review found that no existing gaze estimation frameworks could predict gaze points on images with given texts. Existing methods either predict gaze solely from input images or use both images and head pose data, but do not incorporate text information. Our previous experiments showed that our method outperforms these approaches across all metrics, indicating its superiority in gaze estimation. We will update these results in Table 1 in revised manuscript.

R7Q3: Literature review of gaze estimation in medical imaging A7: In medical imaging, gaze estimation is divided into two groups: 1) using the medical image as input, and 2) using eye region images for gaze prediction. Gaze is used to guide different medical image diagnosis tasks. We will include this literature review in the revised manuscript.

R7Q4: About graph construction A8: 1) Yes 2) The edges in a graph can effectively represent high-order relationships between nodes. The arrangement of visual search patterns in gaze points is indicative of the relationships between these points, making graphs particularly suitable for this task compared to other feature alignment methods that may be inadequate substitutes.

R7Q5: Capability of modeling pathology and other region pattern. A9: Our model can capture this pattern. We will add a new figure in the revised manuscript to illustrate the sparsity of eye gaze for pathology and other regions.

R7Q6: Advantage of gaze points A10: Gaze points provide two main advantages. 1)Its distribution and density can reveal the radiologist’s attention patterns, areas of interest, and the relative importance of different regions within the image. 2)The sequence of gaze points over time provides insights into the radiologist’s scan path and the order in which different regions were analyzed. As these advantages cannot be obtained from other sources, gaze points are valuable for analyzing radiologists’ diagnostic processes.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper introduces a novel multi-modal gaze estimation method that leverages eye gaze data to interpret radiologists’ visual search behaviors, correlating these with medical report data through newly introduced graph structures and matching techniques. Reviewers appreciated the method’s innovation and its demonstrated applicability across multiple datasets. However, concerns were raised about the clarity of the methodology, the justification for the inclusion of certain modules, and the marginal performance improvements over state-of-the-art methods. The rebuttal has addressed most concerns, particularly by explaining the statistical significance of improvements and detailing methodological components. Nonetheless, it remains necessary to further clarify or provide empirical evidence on how predicted gaze points, despite not being directly sourced from radiologists, can offer distinct advantages over other forms of image annotation such as bounding boxes or attention maps, especially in capturing the diagnostic reasoning process in a clinically relevant manner.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    This paper introduces a novel multi-modal gaze estimation method that leverages eye gaze data to interpret radiologists’ visual search behaviors, correlating these with medical report data through newly introduced graph structures and matching techniques. Reviewers appreciated the method’s innovation and its demonstrated applicability across multiple datasets. However, concerns were raised about the clarity of the methodology, the justification for the inclusion of certain modules, and the marginal performance improvements over state-of-the-art methods. The rebuttal has addressed most concerns, particularly by explaining the statistical significance of improvements and detailing methodological components. Nonetheless, it remains necessary to further clarify or provide empirical evidence on how predicted gaze points, despite not being directly sourced from radiologists, can offer distinct advantages over other forms of image annotation such as bounding boxes or attention maps, especially in capturing the diagnostic reasoning process in a clinically relevant manner.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    There are mixed reviews (2 weak accept, 1 weak reject). Although there were some concerns of reviewers, the author have addressed major issues during the rebuttal. The paper introduce a novel multomodal gaze estimation method based on visual search behavior for Chest radiography analysis. Reviewers are agree that the method is novel. It looks valuable to discuss the paper at MICCAI. I would suggest acceptance of the paper.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    There are mixed reviews (2 weak accept, 1 weak reject). Although there were some concerns of reviewers, the author have addressed major issues during the rebuttal. The paper introduce a novel multomodal gaze estimation method based on visual search behavior for Chest radiography analysis. Reviewers are agree that the method is novel. It looks valuable to discuss the paper at MICCAI. I would suggest acceptance of the paper.



back to top