Abstract

Phase recognition in surgical videos is crucial for enhancing computer-aided surgical systems as it enables automated understanding of sequential procedural stages. Existing methods often rely on fixed temporal windows for video analysis to identify dynamic surgical phases. Thus, they struggle to simultaneously capture short-, mid-, and long-term information necessary to fully understand complex surgical procedures. To address these issues, we propose Multi-Scale Transformers for Surgical Phase Recognition (MuST), a novel Transformer-based approach that combines a Multi-Term Frame encoder with a Temporal Consistency Module to capture information across multiple temporal scales of a surgical video. Our Multi-Term Frame Encoder computes interdependencies across a hierarchy of temporal scales by sampling sequences at increasing strides around the frame of interest. Furthermore, we employ a long-term Transformer encoder over the frame embeddings to further enhance long-term reasoning. MuST achieves higher performance than previous state-of-the-art methods on three different public benchmarks.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3730_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3730_supp.pdf

Link to the Code Repository

https://github.com/BCV-Uniandes/MuST

Link to the Dataset(s)

https://github.com/BCV-Uniandes/GraSP https://www.synapse.org/Synapse:syn18824884/wiki/591922 https://www.synapse.org/Synapse:syn21776936/wiki/601700



BibTex

@InProceedings{Pér_MuST_MICCAI2024,
        author = { Pérez, Alejandra and Rodríguez, Santiago and Ayobi, Nicolás and Aparicio, Nicolás and Dessevres, Eugénie and Arbeláez, Pablo},
        title = { { MuST: Multi-Scale Transformers for Surgical Phase Recognition } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper proposes Multi-Scale Transformers for Surgical Phase Recognition (MuST), a method designed to enhance understanding of surgical procedures. It presents an architecture that performs well across datasets and includes qualitative examples. However, some aspects, such as the clarity of keyframe selection and lack of testing on the “cholec80” dataset, could be improved. The paper is recommended for further exploration of its comparison to “SKiT” and clarification on processing times. Limited novelty is noted, with similar concepts previously explored in literature.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper features a robust related work section that covers crucial developmental steps in the field, providing context for the proposed approach.
    • The architecture demonstrates strong results across multiple datasets, showcasing the effectiveness of the proposed method.
    • The inclusion of qualitative examples is helpful for understanding the practical application of the proposed Multi-Scale Transformers for Surgical Phase Recognition (MuST).
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The absence of testing on the “cholec80” dataset, which is a standard benchmark for surgical phase recognition, raises questions.
    • The concept of keyframes lacks clarity. It is unclear if every frame eventually becomes a keyframe or if they are sparsely selected.
    • A comparison to “SKiT” would be beneficial, as the concepts appear similar.
    • The use of a temporal backbone, including “slowfast,” has been explored in prior literature and adds limited novelty.
    • The claim of robustness is not adequately supported by additional experiments, such as testing on artificial samples.
    • The discrepancy between offline (16 minutes) and online (100 seconds) processing times is not explained. A processing time of 8 minutes might be expected for offline, so the reason for the shorter online time is unclear.
    • The comparison to Tecno is inaccurate; the Tecno model is actually hierarchical modeling, not sequential in the temporal domain.
    • Limited novelty: The paper lacks significant novelty in its approach, as similar concepts and methods have been explored in prior literature.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Expanding on the comparison to “SKiT” would enhance the paper, as both approaches utilize keyframes for short and long-mid temporality.
    • Additional experiments would be required to support the claim of robustness, and clarification on the processing time discrepancy would increase the quality of the paper
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The concept of Multi-Scale Transformers for Surgical Phase Recognition (MuST) is promising and shows effective results. However, the absence of experiments on the “cholec80” dataset and the lack of clarity regarding keyframes are notable drawbacks. Including a comparison to “SKiT” and further exploring the similarities and differences with that approach would strengthen the paper.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    want to clarify that I did not intend for the authors to run additional experiments, as this is not permitted during the rebuttal stage.

    Nonetheless, this lack of comparison remains a weakness in my view, which the authors cannot address in this rebuttal.

    The concept of keyframes was clarified, although I would not term it a keyframe since each frame becomes a keyframe, lacking a distinct “Key” property. I would prefer calling it a processing or central frame.

    The discrepancy between offline (16 minutes) and online (100 seconds) processing times was not clarified.

    While the authors have addressed some of my concerns, the evaluation remains insufficient.



Review #2

  • Please describe the contribution of the paper

    (1) The introduction of a multi-sequence pyramid and a Multi-Temporal Attention Module for hierarchical analysis of temporal windows in surgical videos, and (2) a temporal consistency module that uses Transformer attention to enhance consistency across wide temporal segments.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper is well-oragainzied and easy to follow.
    2. The motivation is straightforward that multi-scale information is importance in surgical videos.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The multi-scale information and consistency has been explored in SHAC, while this paper do not compare the differences with SHAC.
    2. After reading the paper, I am still not very clear for why the second contribution is named as temporal consistency module. What does `consistency’ reflect? Since it seem just use the attention layer to capture the relations similar to Multi-Temporal Attention Module.
    3. The experiments only compare three SOTAs, while more recent works are not compared.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Just to provide more explanation for their proposed two attention modules, and more comparison experiments. See Weakness.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    See Weakness

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    The authors have acknowledged that they will include comparisons with related work in the final version. However, the current evaluation of the method is insufficient, and I suggest adding more datasets to verify its effectiveness. I am inclined to recommend a weak rejection.



Review #3

  • Please describe the contribution of the paper

    The paper presents a two-stage method consisting of a multi-term frame encoder with cross-attention and residual self-attention and temporal consistency module for surgical phase recognition.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Good literate review about phase recognition.
    • Good to compare with SOTA methods under online and offline settings.
    • The proposed method is clear and concise.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • From Table 1, on Grasp and MISAW datasets, the performance of the proposed model over the second-best one is somehow limited. Standard deviation and hypothesis test need to be added to justify if the improvement is significant.
    • The proposed method involves a two-stage training process and receives multiple sequences with different sampling rates as input. Although the performance is improved, there is one concern about the computation complexity. It would be helpful if the author could provide some analysis/discussion on computation complexity, such as FLOPs, training time, testing time, and model parameters.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    See above.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall the paper is well-designed and the method is kind of interesting. Further results need to be added to justify the model’s performance.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Overall, this paper is well written. Although the computational cost appears to be three times higher than that of Trans-SVNet, which could be a main weakness and might make the comparison seem unfair, I would recommend a weak acceptance due to the promising methodology and performance.




Author Feedback

We thank the reviewers for recognizing our model’s effectiveness across multiple datasets (R1, R4), our paper’s clarity and organization (R1, R3, R4), and our comprehensive literature review (R1, R4).

The reviewers have also raised some concerns that we will now address explicitly:

MuST on Cholec80 Benchmark (R1): We appreciate R1’s suggestion to add Cholec80, but as required by the program chairs, we cannot include new results during the rebuttal phase. Yet MuST performance is highly competitive on this benchmark, and we will include it in future work.Our paper validates MuST’s contributions on three datasets with high variability in surgical phase annotations and transitions. In the context of laparoscopic cholecystectomies, we chose HeiChole for its greater challenge compared to Cholec80, which has limited variability in its annotations.

Differences between SKiT & MuST (R1): Unlike MuST, SKiT (citation [17] in our paper) uses a ViT-based backbone that lacks temporal context, extracting features from individual frames. SKiT incorporates temporal information by building short and long-term feature sequences, processed through transformer layers. In contrast, we sample 4 sub-sequences of a surgical video at increasing sample rates to create a multi-scale pyramid, centering all sequences on the frame of interest (keyframe). This approach captures short, mid, and long-term information. Our video backbone extracts spatio-temporal features from each sub-sequence, followed by cross-attention between the 4 resulting embeddings. Whereas SkiT only has one feature per frame, we extract 4 distinct features. Finally, we merge them into a single, rich context embedding representing each keyframe.

Keyframe concept (R1): Our keyframe concept follows citation [26] in our paper, defining the keyframe as the main frame in an input time window for phase prediction (middle for offline, last for online). We use all frames sampled at 1 fps as keyframes.

Robustness statement (R1): Our robustness claim refers to the adaptability of our multi-scale modeling in recognizing phases of variable durations. We acknowledge R1’s concern about potential ambiguity and will clarify this in the final paper.

Technical Novelty (R1): Our technical contribution is not using a temporal backbone. Instead, our novelty is employing attention mechanisms to combine spatio-temporal features from a single backbone (unlike slowfast) at 4 different frame rates, effectively capturing and relating information across multiple spatio-temporal scales.

Differences between SAHC & MuST (R3): Like SKiT, SAHC (citation [7] in our paper) uses a frame-wise backbone to extract spatial features without temporal context, then aggregates them into a feature sequence and applies 1D convolutions to create a temporal feature pyramid. In contrast, our method builds a temporal pyramid by sampling the input sequence at 4 different rates and feeding these into our video backbone, enriching both spatial and temporal contexts. We appreciate R3’s comments and will include this comparison in the final paper.

Temporal Consistency Module (R3): Our temporal consistency module enables consistency by capturing an even wider temporal context compared to our temporal pyramid. This enables finer and more consistent segment predictions.

Experimental Results (R3): As we stated in our paper, we compare ourselves with the recent state-of-the-art methods that have publicly available code, allowing for training and evaluation on other datasets. Despite the existence of recent works like LoViT and SKiT, they do not have the necessary public code for comprehensive benchmarking.

Standard deviation (R4): We evaluate both GraSP and MISAW on their test set so there is no standard deviation to report. However, for ablations, we perform cross-validation and include the standard deviations.

Computational complexity (R4): Our model has 64.9M parameters and 303.16G FLOPs. We will include this in the final paper.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Authors did not well address reviewers’ concerns; some critical issues are unclarified, especially insufficient evaluation

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Authors did not well address reviewers’ concerns; some critical issues are unclarified, especially insufficient evaluation



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviewers have highlighted several key issues regarding the novelty, evaluation, and clarity, which can not be fully addressed by the rebuttal.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The reviewers have highlighted several key issues regarding the novelty, evaluation, and clarity, which can not be fully addressed by the rebuttal.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper received mixed reviews and the criticism relates to the insufficient evaluation of the approach on public data. This meta reviewer argues that the paper makes a valuable contribution despite its limitations. In particular, the paper includes methodological contributions that address interdependencies in surgical sequences and translational contributions to the field of surgical phase recognition. Validation is provided on multiple public benchmarks.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The paper received mixed reviews and the criticism relates to the insufficient evaluation of the approach on public data. This meta reviewer argues that the paper makes a valuable contribution despite its limitations. In particular, the paper includes methodological contributions that address interdependencies in surgical sequences and translational contributions to the field of surgical phase recognition. Validation is provided on multiple public benchmarks.



back to top