Abstract

With the rise in respiratory diseases, the workload on radiologists is increasing, leading to a higher risk of diagnostic errors. One approach to improve diagnostic processes is to reduce the frequency of cognitive and perceptual errors made by humans. This study aims to predict radiologists’ diagnostic errors while interpreting chest X-rays using eye-tracking technology. We propose a novel method that combines human attention, derived from the locations of gaze fixation points, with attention from transformer neural networks. The resulting attention maps are combined with the segmentation of anatomical structures, including the lungs, clavicles, hila, heart, mediastinum, and esophagus, which restricts the analysis for regions potentially relevant for thoracic disease diagnosis. Attention maps are computed for each gaze fixation point, creating a longitudinal path representing the X-ray reading process. Finally, we applied Gated Recurrent Units (GRUs) to learn from the longitudinal attention maps and statistical gaze features to predict potential X-ray diagnostic errors. The proposed methodology was validated on 4000 chest X-ray readings performed by four radiologists. The model achieved an error detection accuracy of $0.79$, measured as the area under the receiver operating characteristic (ROC) curve. The code is available at https://github.com/annshorn/TEGRU

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3879_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/annshorn/TEGRU

Link to the Dataset(s)

N/A

BibTex

@InProceedings{AniAnn_Longitudinal_MICCAI2025,
        author = { Anikina, Anna and Ibragimova, Diliara and Mustafaev, Tamerlan and Mello-Thoms, Claudia and Ibragimov, Bulat},
        title = { { Longitudinal anatomical attention maps for recognizing diagnostic errors from radiologists’ eye movements } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15966},
        month = {September},
        page = {317 -- 327}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors combine human attention derived from gaze fixations with anatomical attention maps generated by a transformer network, restricted to clinically relevant anatomical regions. They model the longitudinal sequence of gaze and image features with a combination of Vision Transformers and Gated Recurrent Units (GRUs) to capture the temporal reading process.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Adequate data and reasonable validation: The study is based on 4,000 chest X-ray readings from four radiologists, providing a sufficiently large dataset for training and evaluation. The validation process is carefully designed and appropriate for the task.

    Use of longitudinal modeling: Applying a longitudinal approach to model the sequence of radiologists’ gaze and attention during image interpretation is a logical and well-justified choice for predicting diagnostic errors, effectively capturing the temporal dynamics of the reading process.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Lack of detailed ablation studies: The paper does not provide sufficient ablation experiments to separately analyze the contribution of each key component, such as anatomical attention mapping and longitudinal GRU modeling. In particular, while GRU is used for sequence modeling, there are many alternative methods, including Transformer-based architectures, that are now more commonly adopted. GRU’s main advantage is simplicity, but a comparison with more recent methods would better support the design choices.

    Unclear real-world clinical applicability: Although the proposed method demonstrates strong performance on the collected dataset, it remains unclear how well it would integrate into real-world clinical workflows, and whether it would generalize to different radiologists, institutions, or eye-tracking settings.

    Limited comparison to broader baselines (I am not sure): The paper mainly compares against previous gaze-based or feature-based models. However, it does not benchmark against strong modern chest X-ray classifiers that, while not using gaze data, could potentially detect diagnostic errors indirectly. Including such baselines could provide a clearer picture of the added value of gaze-based approaches.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I totally not understand the task of the paper. I only know the method.

  • Reviewer confidence

    Not confident (1)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    My concerns about the ablation study and broader baselines are not adressed.



Review #2

  • Please describe the contribution of the paper

    This work aims to detect radiologist errors in reading chest x-rays. It uses the location of the gaze fixation points and the image information merged in a system combining Vision Transformer and GRU networks.

    The Vision Transformer uses patches filtered by a mask corresponding to the region of interest (lungs) instead of using a regular grid of patches. The sequence of fixation points creates a sequence of features describing the behavior of the practitioner. A cumulative mask allows to qualify the amount of information seen by the human at each time step, weighting each position. The features coming from the ViT, the mask and the fixations are merged in a GRU. A weighted pooling layer makes the decision.

    The system is evaluated on public datasets merged into a large one (1000 images). Each chest X-ray was viewed by 4 radiologists to record their behavior.

    Results show that the use of eye-tracking allows to improve the error detection compared to image only existing solutions.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper is well written, especially the methodology gives a detailed description of the proposed model.
    • Subsampling the patches before ViT is very interesting. The use of CPD to register the patches is also original.
    • The experimental part is consistent. It seems that the authors have implemented several different approaches to compare with their proposal.
    • The figures are appropriate and helpful.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The description of the model is quite dense and hard to follow, especially the use of the mask $M^(t)$ in sections 2.4 and 2.5. Perhaps a step is missing to explain how this mask is used in ViT. Figures 1 and 2 help to understand the idea, but ambiguity remains.
    • The pooling layer (without trainable weights) after the GRU seems quite simple to make the final decision. Maybe a more complex strategy would allow to merge the information better. For example, a bidirectional GRU followed by a global max pooling might be interesting (and maybe fewer layers would be necessary).
    • This work creates new data, and it would be very interesting for the community to share it.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    For future work, it would be interesting to see some masks and see if they can be a source of information to explain the mistakes.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This work is original, with a lot of work for data collection and several implementations for comparison.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    This paper will be useful. There are few work combining eye-tracking and experts to improve the final diagnostic decision.



Review #3

  • Please describe the contribution of the paper

    The authors present a framework that combines longitudinal eye-tracking data and anatomically-guided visual transformers to predict radiologists’ diagnostic errors during chest X-ray interpretation. Attention maps are aligned with anatomical structures and encoded gaze (fixated regions). The approach is developed using 826 training cases and tested using 134 cases, with redundant gaze & dictation data from four radiologists.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The method is sound and fusing time-series radiologist gaze data with anatomical patches is an interesting concept. Quantitatively and intuitively, data encoded in the longitudinal data representation leads to better diagnostic error prediction performance.

    2. The data covers a wide range of pathologies across multiple datasets. Gaze data collected from four radiologists with diverse experience range is also a valuable data point.

    3. The discussion section and the findings are interesting: fewer fixations tend to lead to less errors is interesting and the method encodes this type of knowledge effectively using the ViT + GRU framework.

    4. Resulting performance suggests the proposed algorithm works as intended, leading to better diagnostic error detection performance.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Definition of diagnostic error is too coarse: uncovering more granular error modes such as error in presence/location/severity will make this paper more complete.

    2. Ablation is weak: the proposed system contains many parts such as patching, CPD, attention masking, GRU fusion etc. While some ablations are done, it is not comprehensive. However, the longitudinal encoding seems to be the main contribution of the paper and ablation study for this point is clearly provided.

    3. Minor but the motivation in the introduction does not fully align with the study design. For example, authors state “radiologists do more than interpret images, they also interact with patients and colleagues” but the reader study is designed to isolate radiologists while reading. Also while authors emphasize the significance of adopting of AI models for real clinical practice, the feasibility of wearing eye-tracking sensors during X-ray interpretation in real world workflows remains unclear & unjustified. It may actually take longer to read if technology like this is deployed in real workflows but this would be beyond the scope of this conference.

    4. More granular analysis using radiologist’s experience level (since the data is collected from a wide range of expertise, 3-30 years) and the model’s diagnostic error prediction performance would have made this paper stronger.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think the paper is a well executed paper for MICCAI: it is technically sound, solves a meaningful problem and uncovers interesting insights about longitudinal representation for gaze and its use for predicting diagnostic errors of radiologists. There are clear limitations of the study such as the feasibility to impact real world workflows, lack of sufficient granularity in analysis and definition of diagnostic errors. However, the scope of the paper is fitting to MICCAI and the work can lead to larger impact future studies, hence I lean to accept.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    Authors adequately addressed my criticisms in the rebuttal and suggested reasonable action plan to edit the paper.




Author Feedback

Dear Reviewers, Area and Program Chairs,

We thank you for your overall positive opinion and constructive feedback. In line with the rebuttal guidelines, we do not propose any new experiments; all suggested manuscript changes are intended to improve clarity and presentation:

  1. Motivation for using eye-tracking: Reviewer 3 asked for a fundamental clarification on why eye-tracking is needed to detect radiologists’ errors from gaze movements instead of directly comparing radiologists’ decisions to automated AI diagnoses. We acknowledge that this motivation may not be sufficiently explained:
    • Explainability: Many diagnostic errors occur because radiologists simply overlook a finding entirely. Eye-tracking technology directly detects what was viewed and what was missed. This allows the error detection framework to identify potential errors due to a lack of visual attention, which cannot be revealed by AI-based autodiagnosis.
    • Efficiency: Eye-tracking can detect errors immediately after the reader finishes looking at the X-ray image. If AI-based autodiagnosis is used for error detection, the system needs to wait for the radiologist to completely communicate their decision using dictation or typing. Only after this can potential errors be analyzed.
    • Effective use of AI: In the standard workflow using AI-based autodiagnosis, the radiologist first needs to provide the diagnosis in the form of clinical notes or dictations. These sources are then processed with NLP and audio analysis before the human diagnosis can be extracted and evaluated. This introduces errors that compound with those from the AI-based autodiagnosis.
    • Clarification in the manuscript: We will clarify this motivation explicitly in the Introduction.
  2. Real-world clinical implementation: Reviewers raised concerns about the real-world integration of eye-tracking in radiology. We want to emphasize that modern eye-trackers are compatible with routine clinical workflows. Screen-mounted eye-trackers, paired with any screen recording software, can be integrated with existing PACS systems with minimal setup. Wearable eye-tracking glasses require virtually no preparation.
    • Clarification in the manuscript: Literature references on practical eye-tracking integration will be added.
  3. Reproducibility: We agree with the Reviewers that the framework replication will be facilitated via data and code availability. We have created an open-access GitHub repository that includes detailed instructions for data preparation and framework implementation to support full reproducibility of our results. We agreed with some radiologists that part of their eye-movement data will be made publicly available in the repository. However, we did not obtain permission to release the complete database of eye movements and diagnostic dictations. Following the rebuttal guidelines prohibiting external links, we will provide access to this repository after the final decision.
    • Clarification in the manuscript: The GitHub link will be added.
  4. Ablation experiments: Two Reviewers suggested that ablation studies could enrich the manuscript. Due to the rebuttal requirements, we cannot propose any additional experiments. However, we would like to point out the ablation and comparison experiments mentioned in the original manuscript:
    • RNN vs. Transformer Architectures: In TransGATConv [16], the authors performed a comparison of different transformer architectures for eye tracking analysis. Due to space constraints, we only selected and presented the best architecture from [16], against our RNN-based solution.
    • Images and Patch Configurations: The comparison between anatomical and grid-based patches is presented in Table 1.
    • Geometric transformation: In the Discussion, we mentioned the framework’s performance changes under input X-ray transformation. Clarification in the manuscript: This can be moved to the Results for improved clarity.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper covers a less busy space and it will be desirable to see some papers in this area. All reviewers have seen the potential and interest in this work.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    I have read the manuscript, review comments, rebuttal letter. There exist mixed reviews, two reject, and one accept. All reviewers think this work did an interesting and important task of diagnostic errors and carefully designed and collected the data. However, R3 pointed out ablation study and clinical applicability issues. In real world environment, the radiologists might not only focus on the screen to read the reports, as well as the printed CT images. Besides, it’s hard to control the gaze patterns of radiologists. This meta reviewer decide to reject, owing to the abaltion studies and real-world clinical applicability.



back to top