Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Upper endoscopy is the preferred method for detecting early-stagegastrointestinal diseases and plays a crucial role in managing gastric cancer. Quality assessment has been a recurring concern in clinical research, particularly regarding the time specialists spend examining different anatomical sites. While current guidelines emphasize thorough inspection and documentation to minimize blind spots, adherence remains low due to the lack of second readers. State-of-the-art automatic approaches audit single-frame or fixed temporal windows, with limited performance in real applications. This paper introduces the Multi-Scale Sequence Informative (MSSI) module, a Transformer-based attention mechanism that audits video sequences across multiple temporal scales. The proposed approach estimates the time spent exploring different organs and regions of the stomach. The method processes 15 to 196 tokens (1 to 13 seconds) by a sliding window, building up a mosaic of sampled frames. Each frame is encoded with a pre-trained endoscopy embedding which feeds a Vision Transformer to capture short-, mid-, and long-range dependencies. The approach is evaluated with 233 endoscopic procedures (∼1.6 million frames), demonstrating a close alignment between estimated procedural times and expert-validated standards. It achieved 92.03% macro precision in organ classification and 89.34% in distinguishing 23 specific views of different stomach sites, a total of 27 classes to audit, showing real potential to be applicable in real clinical scenarios. Our code is available at https://github.com/Cimalab-unal/EndoAudit.git.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3297_paper.pdf

SharedIt Link: https://rdcu.be/eHw5O

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05141-7_4

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/Cimalab-unal/EndoAudit.git

Link to the Dataset(s)

GastroHUN dataset: https://doi.org/10.6084/m9.figshare.27308133

BibTex

@InProceedings{BraDie_Automated_MICCAI2025,
        author = { Bravo, Diego AND Ruano, Josué AND Gómez, Martín AND Gónzalez, Fabio A. AND Romero, Eduardo},
        title = { { Automated Auditing of Upper Endoscopy Procedure Times: A Temporal Multiclass Analysis } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15970},
        month = {September},
        page = {34 -- 43}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposed an organ classification system with multi-frame input for monitoring examination duration during upper gastrointestinal endoscopy. The system is constructed using a Temporal attention module, where sequences of video frames are fed into the model to identify organs and anatomical regions.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

#1 The figures are well-organized and easy to understand.

#2 It is noteworthy that the study tackles the supervision of surgical procedures, a matter of great significance in medical education.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

#3 Given that multi-frame-based classification models have been widely explored, it is difficult to assert the novelty of this work solely based on the incorporation of such an approach.

#4 Although the proposed system may have practical, it is imperative to clearly articulate its advantages over existing systems. Merely emphasizing its support for multi-frame input is not sufficient to substantiate the contribution of this work.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(2) Reject — should be rejected, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Given the limited technical novelty of the proposed approach, I have decided to assign this score.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

While the introduction section in this paper suggested that the temporal modeling and embedding strategies were the main contributions, the rebuttal clarified that the key contribution lies in the automatic auditing of the time spent for endoscopy. Accordingly, I have updated my evaluation of the paper.

Review #2

Please describe the contribution of the paper

The paper introduces a novel Multi-Scale Sequence Informative (MSSI) framework that leverages Transformer-based temporal modeling for auditing upper endoscopy procedure times across different anatomical regions. By integrating CNN-based spatial embeddings with a Vision Transformer, the method effectively captures both short- and long-range temporal dependencies in video sequences. Evaluated on a large public dataset, the proposed approach achieves high precision in classifying organs and stomach sites, offering clinically relevant quality indicators for real-world endoscopic audits.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

(1) The proposed MSSI framework combines CNN with multi-scale ViT to effectively capture the anatomical regions and their order in the upper gastrointestinal endoscopy video in the space-time domain. Compared with the traditional single-scale modeling method, MSSI can simultaneously focus on local tissue features and global organ transition information, improve the accuracy of anatomical site recognition in video clips, and has strong modeling capabilities and practicality.

(2) The authors evaluated the two subtasks of endoscopic organ classification and fine-grained classification of gastric locations, demonstrating the stability of MSSI at different granularity levels. At the same time, through ablation studies of various input lengths (8, 16, and 32 frames), the performance differences of the model under different time windows were revealed, verifying the effectiveness of its multi-scale design. This comprehensive experimental setting enhances the credibility of the method.

(3) Current quality control processes rely heavily on manual video review, which is time-consuming and subjective. The paper offers a scalable, objective solution that can be readily integrated into post-procedural audits or real-time feedback systems, thereby supporting consistent documentation, training, and quality assurance in clinical endoscopy workflows.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

(1) Although the authors show the performance of the method in different settings, there is a lack of comparison with the current SOTA time series modeling methods, especially some medical-specific methods.

(2) The paper lacks visual analysis (such as Grad-CAM) to explain why the model predicts a specific organ or site, which limits its credibility and feasibility in real clinical settings.

(3) The paper lacks specific analysis of blurred images, motion blur, illumination changes, and blurred anatomical region boundaries, which are common challenges in upper gastrointestinal endoscopy. In addition, there is a lack of experiments on cross-device generalisation capabilities.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents a practically valuable application of Transformer-based temporal modeling for endoscopy audit, addressing a relevant and underexplored clinical problem. While the methodological novelty is moderate, the proposed framework is well-engineered and evaluated on a real-world dataset with strong results. However, the lack of interpretability analysis and limited baseline comparisons weaken the overall impact.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

This paper presents a novel AI-based framework for automated auditing of upper endoscopy (EGD) procedures, focusing on temporal analysis of video sequences to estimate how long different anatomical regions are examined during the procedure. The authors introduce a Multi-Scale Sequence Informative (MSSI) module—a Transformer-based temporal attention mechanism that uses CNN-derived frame embeddings to capture short-, mid-, and long-range dependencies over 1 to 13 seconds of video.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
- Innovative use of Transformers in Endoscopy - The method leverages a Vision Transformer (ViT) for temporal modeling, capturing longer context than traditional LSTM or CNN-based methods used in prior work.
- High classification accuracy -The results on organ and site detection outperform prior methods, especially with longer temporal windows (9–13 seconds).
- Clinical relevance -The model outputs interpretable indicators, such as time spent in specific gastric regions, aligning with real-world quality benchmarks.
- Public dataset and open-source code - Use of GastroHUN (a well-annotated, open dataset) and the provision of code increases reproducibility and applicability.
- Classifies 22 stomach sites, which is finer granularity than most ViT applications
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
- Generalizability concerns - The model was trained and evaluated on a single dataset from one endoscope brand. Its performance on other brands or clinical settings remains uncertain (as correctly highlighted by the authors)
-Unclear failure cases - The paper doesn’t detail where or why the model fails: insights into misclassifications or uncertainty could inform future improvements.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission has provided an anonymized link to the source code, dataset, or any other dependencies.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While prior works use ViTs for recognition in surgical and endoscopic workflows, this paper pivots the focus to procedural auditing, introduces granular temporal monitoring, and provides a transparent, scalable framework on a public benchmark—all of which are underrepresented in the current literature.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We sincerely appreciate your positive feedback on our work, including its novelty (“is novel”-R1; “highly innovative”-R2; “is of novelty”-R3) and validation (“consistent improvements”-R1; “clear performance improvements”-R2; “a good way to evaluate”-R3). We also thank you for taking the time to read our paper and provide constructive suggestions. We will address your concerns as follows:

R1:

1) Regarding hyperparameter sensitivity, CAIN performs stably as M ranges from 1000 to 1500 and beta ranges from 0.8 to 0.95, with the best performance at beta=0.9, which is therefore selected. 2) CAIN’s performance (Fig.4c) drops notably when the scale of causal modeling is reduced, despite the presence of patient-level information, clearly confirming the effectiveness of causal modeling. 3) *Cosine similarity effectively identifies confounding variables. This has been validated by comparing it with other measures (e.g., Manhattan distance and Pearson correlation). *Confounder modeling addresses complexity across scanners and patient anatomies, achieved through dynamic updates during case learning. 4) The recent clinical study (Faro, JACC, 2023) has demonstrated associations among coronary branches. CAIN’s patient-level design enables rational prediction based on these findings, supporting clinical credibility. 5) The term ‘Do’ refers to the do-expression (Wang, CVPR, 2020), which denotes the intervention operation.

R2:

*We will incorporate your suggestions to provide more detailed results and analyses in the code-link, and further optimize our method in the extended study. *Although CAIN offers clear advantages, it still relies on centerline extraction, which is similar to that used in previous methods, making it susceptible to errors during the extraction process. *We plan to explore vector database technology with larger confounder banks and design a more robust mechanism to eliminate the above dependency.

R3:

1) Although both CTA and CPRs require feature extraction, the total number of extractions needed to predict all branches for a patient is comparable to that of SOTAs and does not affect inference speed. The confounder bank is constructed during training and thus does not impact inference efficiency. 2) The shift-window mechanism used in patient-level representation effectively models the continuity and structure of the vessels across patches, while the hierarchical attention mechanism enhances relevant feature extraction. 3) *F_cpr, C_at, and patient-level prediction are removed together, as they collectively constitute the patient-level pipeline. *The necessity of patient-level features is evidenced by a significant performance drop (Fig.4c) as the number of branches involved in the intervention approaches one. 4) Tab.1 reports the mean and variance across different levels of disease severity to reflect overall performance. A more detailed report will be available via the code-link. 5) Sensitivity is computed in the same way as Recall (Tab.1). 6) *j in L_loc and L_char denotes the index of e_qry and the lesion category, respectively, in each loss term. *The symbols for features will be further annotated in the final version. 7) The causal graph (Fig.1b) aligns with CAIN’s architecture: artery-level representation and prediction correspond to X->Y_a, while patient-level prediction, based on patient-level representation and artery-level prediction, correspond to X->Y_p and Y_a->Y_p. The causal intervention removes spurious associations related to D. 8) *As noted on Page 2, Line 27~31, clinical studies [2,11] indicate that confounders originate from both external factors (e.g., data heterogeneity) and internal factors (e.g., lesion-independent information). Center-based validation and qualitative results demonstrate CAIN’s effectiveness in handling both types. *Features most relevant to the current patient are beneficial to the predictions and are adaptively matched through a dynamic update strategy and causal intervention.

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Reject
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

For this CAI applicative paper reviewers converged towards an acceptance. Despite what it seems a limited novelty in the design, the paper has raised interest from the reviewers from the applicative perspective by addressing the relevant but under-explored auditing problem supported by good experimental results on a public dataset. Reviewers have also recognized the method’s potential to assist today’s manual quality control process. Some weaknesses remain: in particular, while there is a sota comparison against auditing methods, there is none against other time series modeling methods, although the authors claim no previous work has reported comparable information to compare, choosing a meaningful baseline outside of the specific experimental setup of the paper would improve impact.

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

back to top

Automated Auditing of Upper Endoscopy Procedure Times: A Temporal Multiclass Analysis

Author(s):