Abstract

Deep neural networks have demonstrated remarkable performance in medical image analysis. However, its susceptibility to spurious correlations due to shortcut learning raises concerns about network interpretability and reliability. Furthermore, shortcut learning is exacerbated in medical contexts where disease indicators are often subtle and sparse. In this paper, we propose a novel gaze-directed Vision GNN (called GD-ViG) to leverage the visual patterns of radiologists from gaze as expert knowledge, directing the network toward disease-relevant regions, and thereby mitigating shortcut learning. GD-ViG consists of a gaze map generator (GMG) and a gaze-directed classifier (GDC). Combining the global modelling ability of GNNs with the locality of CNNs, GMG generates the gaze map based on radiologists’ visual patterns. Notably, it eliminates the need for real gaze data during inference, enhancing the network’s practical applicability. Utilizing gaze as the expert knowledge, the GDC directs the construction of graph structures by incorporating both feature distances and gaze distances, enabling the network to focus on disease-relevant foregrounds. Thereby avoiding shortcut learning and improving the network’s interpretability. The experiments on two public medical image datasets demonstrate that GD-ViG outperforms the state-of-the-art methods, and effectively mitigates shortcut learning. Our code is available at https://github.com/SX-SS/GD-ViG.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1797_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1797_supp.pdf

Link to the Code Repository

https://github.com/SX-SS/GD-ViG

Link to the Dataset(s)

https://www.kaggle.com/competitions/siim-acr-pneumothorax-segmentation https://github.com/HazyResearch/observational https://physionet.org/content/egd-cxr/1.0.0/ https://physionet.org/content/mimic-cxr/2.0.0/

BibTex

@InProceedings{Wu_Gazedirected_MICCAI2024,
        author = { Wu, Shaoxuan and Zhang, Xiao and Wang, Bin and Jin, Zhuo and Li, Hansheng and Feng, Jun},
        title = { { Gaze-directed Vision GNN for Mitigating Shortcut Learning in Medical Image } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15001},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This work proposes GD-ViG, a gaze directed Vision GNN that integrates radiologists’ eye gaze patterns into neural networks. This method comprises two modules namely Gaze Map Generator (GMG) and Gaze-directed Classifier (GDC). The method is evaluated on 2 public datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The gaze map generation is a useful approach as during inference the eye gaze is not required. And, the gaze-directed graph construction layer (GDGC) is a novel method that generates graphs from feature and gaze maps.
    2. The quantitative comparisons are extensive and are evaluated on 2 public datasets.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The idea of constructing graphs from gaze has been proposed in GazeGNN and the application of using gaze-guided framework to mitigate shortcut learning has been proposed in EG-ViT. The paper lacks a clear explanation as to how the proposed method is different from these methods and not just a combination of the previously mentioned techniques.
    2. The quantitative comparisons shown in Table 1 are close with the baselines. It would be better if the authors can report the p-values (like in subsection 3.3).
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The authors are recommended to address Weakness point 1 and 2.
    2. In Figure 2, it would be better if the authors show arrows/bounding boxes in the image column.
    3. In Figure 3, the quantitative value of the distances can be shown. This would add more insights and make the figure more informative.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The work proposes a useful method to generate gaze maps and perform gaze-guided disease classification. The method outperforms several baselines.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors addresses my comments and hence I upgrade my review.



Review #2

  • Please describe the contribution of the paper

    The paper proposes a fascinating approach to leverage the gaze of radiologists, through their proposed gaze-directed Vision GNN (GD-ViG). The aim being to redirect the models’ attention towards disease relevant regions, eliminating shortcut learning. The paper’s GD-ViG consists of two components, a gaze map generator and a gaze directed classifier. The interesting method aims to help with the intepretability of a model.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • An original way to use gaze data with a GNN.
    • Well written paper. The paper is clear and organized enough.
    • They compared with other methods that use gaze in different ways.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper seems to suggest, as a major contribution, a number of times that the proposed method is unique in that it proposes gaze map generation for unseen medical images at test time/inference. However, there are other papers that have done something similar (not needing real gaze data at test time but generating gaze maps instead) but not with GNNs. Some of them are below: https://ieeexplore.ieee.org/abstract/document/8363851 https://link.springer.com/chapter/10.1007/978-3-030-00928-1_98 https://www.sciencedirect.com/science/article/pii/S1361841522002584 https://www.sciencedirect.com/science/article/pii/S1361841520301262#bib0010

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Please do include a definition of shortcut learning somewhere in the paper, and how it is different from overfitting. Please do more to explain the difference between feature distance and gaze distance. In sec 3.4, is the argument that using one of the distances along will lead to irrelevant regions being focused on but using both will lead to relevant regions? In the gaze directed classifier, it is not clear enough why we need K neighbors. What is your philosophy behind the choice of the value of the balance coefficient?

    • Please explain Fig 3. more, what is the difference between (a) and (b). Some of the patches are eliminated (red), and then somehow returned (blue). It is somewht unclear. This figure probably needs more explanation to be clearer. Do the red and blue patches mean different things than the red and blue dots? If possible, please try to make the figure be able to stand alone independent of the paragraph in section 3.4.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Please try to answer the question of how does the proposed method compare with other gaze using methods that generate a gaze map at inference.
    • Please include explanations to the paper that highlight its uniqueness compared to the other work that generate gaze maps at test time to help make the paper stand out more.
    • Please improve the clarity of Fig. 1. Currently, the colors of Down-Sampling and Up-Sampling look too similar. Please show more clearly where the GNN blocks are in GMG to make it easier for the reader to notice them. If possible, please have GDC show that the maps are getting downamples before they are fed into / inserted at each set of blocks / layers.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Good, paper interesting. Using gaze with GNNs is unique. However the paper does need to address some of the other methods that generate gaze maps at inference, and reevaluate whether that can still stand as a unique contribution of the paper.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors helped clarify the difference with their approach, though I do hope they update and change how they phrase things in the current version of the paper because at the moment, it implies that without a GNN, a model cannot generate gaze maps and eliminate the need for real gaze in the inference stage, even though that is what models like M-SEN are capable of doing as well but maybe not quite at the exact same level of performance.



Review #3

  • Please describe the contribution of the paper

    This paper presents GD-ViG, a Vision Graph Neural Network (GNN) that uses radiologists’ gaze patterns to guide disease-relevant region detection. It includes a Gaze Map Generator (GMG) and a Gaze-Directed Classifier (GDC), combining GNNs’ global modeling with CNNs’ locality. GMG creates gaze maps without real gaze data, enhancing practicality. GDC constructs graph structures using gaze information, improving interpretability and avoiding shortcut learning. Experimental results on medical image datasets demonstrate the model’s superior performance over existing methods.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper is well written and well organized.

    2. The paper improves the state of the art in radiology grade classification by a significant margin.

    3. The visualizations are good (e.g. Fig. 2 and Fig. 3) and helps better understand the merits of the model and contributions of this work.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The figure and table captions need to be more descriptive.

    2. Comparison of the gaze prediction subnetwork is missing with existing saliency prediction models such as Deep Gaze II, ELM Net Salincy, etc.

    3. Why graphic modeling for finding feature linkage is used has not been well explained. Transformer layers (e.g. ViT) also model patch feature similarity. How does graph modeling compares with use of self-attention layers in transformers to model patch similarity?

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    In the rebuttal I would like the authors to address the concerns I raised to further improve my rating.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper is well written and the performance improvement is significant compared to the existing models. Thus I lean towards acceptance provided the concerns I raised are addressed in the rebuttal.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The rebuttal clarified my queries. I vote for the paper acceptance.




Author Feedback

We thank all the reviewers for their efforts and insightful comments. We will address their concerns.

Q1: Novelty and Uniqueness of Our Method (R5) Our GD-ViG simulates the diagnostic process of doctors by constructing effective graph representations using generated gaze maps. The method spatially correlates different lesion areas, prompting the network to focus more on these disease-related regions and mitigating shortcut learning, as shown in Fig.3. In contrast, GazeGNN and EG-ViT can only utilize the ground truth of gaze map as simple auxiliary information (e.g., additional channels or feature masks), and therefore cannot intuitively reveal the model’s decision-making process. This characteristic significantly enhances the interpretability and reliability of our method in clinical applications. We will highlight the novelty of our method in the final version.

Q2: Comparative Analysis with Gaze Data-Utilizing Methods (R3 & R4) TSEN[1] and M-SEN[2] utilize GAN or biCLSTM for gaze generation and detection. However, they are constrained by the locality of CNNs, making it challenging to simultaneously consider lesion areas at different spatial positions. Our method overcomes this limitation by effectively generating gaze maps (Fig.S2) and using these maps to aggregate lesion areas that are beneficial for diagnosis (Fig.3). This method aligns with the diagnostic workflow of doctors, thereby enhancing interpretability. Experimental results show that our method outperforms M-SEN on the SIIM-ACR and EGD-CXR datasets (Acc 87.20 vs. 84.80; 85.05 vs. 78.50) and the saliency model EML-Net[3] (Acc 87.20 vs. 85.20; 85.05 vs. 77.57). We have publicly released the data and code and will provide a detailed analysis of the experimental results in the final version.

Q3: Explanation for Using Graph Model Instead of ViT (R4) While both the graph model and ViT involve dividing images into patches for feature extraction, the graph model has the advantage of explicitly aggregating distinct spatial lesion areas that represent various disease states. ViT relies on implicit learning of these spatial relationships, potentially limiting performance enhancement and network interpretability.

Q4: Clarification of Experimental Parameters (R3 & R5) The graph convolutional layer aggregates information from a node and its k-nearest neighbors, to improve feature representation. In alignment with the ViG [4], we selected the hyperparameter k as 9. The balance coefficient is chosen as 1 because we consider the optimization of the generator and classifier is equally important, and the network performance is better at this time. Moreover, the method and baseline methods have p-values less than 0.05 in paired t-tests on Acc, AUC, and F1 metrics.

Q5: Further Explanation of the Figure (R3, R4 & R5) Fig.3 shows two graph construction methods: (a) feature distance-based and (b) gaze distance-based. Feature distance quantifies the disparity between two regions in the feature space, while gaze distance reflects the variance in doctors’ attention to regions. In (a) and (b), red patches represent nodes unrelated to diagnosis, linked to the central node but removed in the revised graph structure after merging distances. Conversely, blue patches denote nodes related to diagnosis, connected to the central node in (a) or (b), and correctly preserved after merging. The final paper will furnish a more comprehensive explanation of Fig.3 with quantitative distance metrics. Fig.1 and Fig.2 will also be revised according to the comments.

Q6: Definition of Shortcut Learning (R3) Shortcut learning [5] refers to the model prioritizing learning simple but task-irrelevant content from the data, affecting generalizability and dependability. Overfitting is when a model excessively fits the training dataset, failing to generalize to new, unseen data.

[1] j.media.2020.101762 [2] 978-3-030-00928-1_98 [3] j.imavis.2020.103887 [4] 3600270.3600873 [5] s42256-020-00257-z




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    R4 and R5 have raised their scores after the rebuttal. I agree with the consensus.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    R4 and R5 have raised their scores after the rebuttal. I agree with the consensus.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top