Abstract

Camera localization in endoscopy videos plays a fundamental role in enabling precise diagnosis and effective treatment planning for patients with Inflammatory Bowel Disease (IBD). Precise frame-level classification, however, depends on long-range temporal dynamics, ranging from hundreds to tens of thousands of frames per video, challenging current neural network approaches. To address this, we propose EndoFormer, a frame-level classification model that leverages long-range temporal information for anatomic segment classification in gastrointestinal endoscopy videos. EndoFormer combines a Foundation Model block, judicious video-level augmentations, and a Transformer classifier for frame-level classification while maintaining a small memory footprint. Experiments on 4160 endoscopy videos from four clinical trials and over 61 million frames demonstrate that EndoFormer has an AUC=0.929, significantly improving state-of-the-art models for anatomic segment classification. These results highlight the potential for adopting EndoFormer in endoscopy video analysis applications that require long-range temporal dynamics for precise frame-level predictions.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3604_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3604_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Mob_Harnessing_MICCAI2024,
        author = { Mobadersany, Pooya and Parmar, Chaitanya and Damasceno, Pablo F. and Fadnavis, Shreyas and Chaitanya, Krishna and Li, Shilong and Schwab, Evan and Xiao, Jaclyn and Surace, Lindsey and Mansi, Tommaso and Cula, Gabriela Oana and Ghanem, Louis R. and Standish, Kristopher},
        title = { { Harnessing Temporal Information for Precise Frame-Level Predictions in Endoscopy Videos } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The manuscript “Harnessing Temporal Information for Precise Frame-Level Predictions in Endoscopy Videos” presents a pipeline for performing frame-level classification in endoscopy videos using (i) foundation model for frame embedding, (ii) a data augmentation strategy tailored on endoscopic videos, (iii) a downstream transformer for classification.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Large dataset for model training and testing
    • Relevant topic
    • Easy to follow manuscript
    • Informative figures
    • Ablation study and statistical analysis are performed
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I do not see a strong methodological innovation in the three blocks of the proposed pipeline. The novelty behind the video-level augmenter is minor. The description of the transformer is also rather concise. It would be nice to understand how the three blocks of the proposed pipeline were adapted to the specific case study. I don’t see how the competitors where chosen. The discussion seems more a summary of the qualitative results.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    More information on dataset and training settings should be provided. I also believe the description of the algorithms should be improved (especially the transformer).

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The main issue in my opinion is the average methodological innovation. I do not see how the foundation model and the transformer were tuned to the dataset and clinical problem under investigation. Larger space may be given to method description, for example by summarizing the conclusions. Figure 5 is a bit qualitative. Using quantitative metrics is suggested. A MICCAI paper would require a stronger discussion, highlighting main benefits and weaknesses of the proposed strategy.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the dataset under investigation is large (and may pave the way for the translation of the developed methodology in the clinical practice), the methodology itself is of average innovation. The description of the methods should be improved (e.g., more details on the choice of the architectures should be provided).

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper presents EndoFormer, a model for frame-level classification of anatomic sections that makes use long-range temporal information in gastrointestinal endoscopy videos. The proposed approach combines a Foundation Model block (a ViT pretrained using the Dino2 mechanism). The main contribution (or at least the one with the biggest impact) is a module that makes well-informed video-level augmentations (inspired by domain-specic knowledge from such types of procedures). Finally, Endoformer makes use of a dowstream transformer for segment classification.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Sufficient introduction and motivations sections, the topic is an interesting area of research. The formulation seems adequate and the authors main idea seems to be the video-level augmenter, which they show to work quite well using ablation studies and performing a statistical test.

    2. The proposal is described in great detail with the training and configurations of the models and ablation studies described thoroughly. A comparison with the state of the art is done carefully, the authors have re-implemented the compared methods to make a fair comparison as possible.

    3. The foundations for the method are presented in great detail in a formalized manner and provides sufficient elements experiments to assess the validity of the proposed approach. But a few things could be better explained (see comment on question 7)

    4.The experimental design is good, showing a careful analysis to validate the proposal and several ablation studies to demonstrate how the various design choices. affect the results on the selected datasets.

    1. The authors discuss the methods limitations and avenues for furure research.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. One of the criticsms I can make to the paper is that the datasers are very poorly discussed. We don’t know what kind of procedure the authors are analyzing or even the classes.

    2. The video aumentations need to be a litte bit better explained. Also, how costly are these augmentations in terms of training time?

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors seem to be using private datasets. It could be interesting to see if the pre-trained model and some of the data is made available.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    I really liked this paper. It’s easy to read and ticks most of the boxes in terms of what this kind of paper, as I mentioned in the strenghts of the paper question. It could be nice to see the t-SNE plots for some of the other compared methods and model configurations. Finally, it could be nice to disusch in which isntance the current model fails and how to overcome such issues

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    For me this paper is clearly written and organized, perform several experiments and ablation studies and reimplements method for faire comparison. The formulation is quite intresting as well

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper introduces EndoFormer, a frame-level classification model that leverages long-range temporal information for anatomic segment classification in gastrointestinal endoscopy videos. EndoFormer is trained on a larger dataset than previous works. The authors conducted extensive experiments on a dataset comprising 4,160 endoscopy videos from four clinical trials, encompassing over 61 million frames. These experiments demonstrate that EndoFormer improves upon state-of-the-art models for anatomic segment classification.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Integration of Methods on Large Dataset: The authors efficiently combined multiple approaches to train and test on the largest dataset in this field, achieving results that surpass existing models.
    2. Detailed Ablation Studies: The authors conducted thorough ablation experiments to validate the effectiveness of different model components for endoscopy video frame classification.
    3. Future Application Potential: The method shows significant potential for adaptation and relevance across various endoscopic tasks, suggesting its broad applicability in future research applications.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Testing on In-House Dataset Only: The authors tested their model only on in-house datasets. Extending testing to established external datasets like EndoMapper [1] would enhance the model’s credibility and generalizability.

    [1] Azagra, Pablo, et al. “Endomapper dataset of complete calibrated endoscopy procedures.” Scientific Data 10.1 (2023): 671.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Minor Concerns and Questions:

    1. There appears to be a discrepancy in the standard deviations reported in Table 1 and Table S1 compared to other figures. Please check for possible decimal placement errors.
    2. Could the Video-level Augmenter interfere with the classification at the boundaries between different anatomic segments?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has a well-organized structure, strong motivation, and promising performance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

N/A




Meta-Review

Meta-review not available, early accepted paper.



back to top