Abstract

Existing state-of-the-art methods for surgical phase recognition either rely on the extraction of spatial-temporal features at a short-range temporal resolution or adopt the sequential extraction of the spatial and temporal features across the entire temporal resolution. However, these methods have limitations in modeling spatial-temporal dependency and addressing spatial-temporal redundancy: 1) These methods fail to effectively model spatial-temporal dependency, due to the lack of long-range information or joint spatial-temporal modeling. 2) These methods utilize dense spatial features across the entire temporal resolution, resulting in significant spatial-temporal redundancy. In this paper, we propose the Surgical Transformer (Surgformer) to address the issues of spatial-temporal modeling and redundancy in an end-to-end manner, which employs divided spatial-temporal attention and takes a limited set of sparse frames as input. Moreover, we propose a novel Hierarchical Temporal Attention (HTA) to capture both global and local information within varied temporal resolutions from a target frame-centric perspective. Distinct from conventional temporal attention that primarily emphasizes dense long-range similarity, HTA not only captures long-term information but also considers local latent consistency among informative frames. HTA then employs pyramid feature aggregation to effectively utilize temporal information across diverse temporal resolutions, thereby enhancing the overall temporal representation. Extensive experiments on two challenging benchmark datasets verify that our proposed Surgformer performs favorably against the state-of-the-art methods. The code is released at https://github.com/isyangshu/Surgformer.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1220_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1220_supp.pdf

Link to the Code Repository

https://github.com/isyangshu/Surgformer

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Yan_Surgformer_MICCAI2024,
        author = { Yang, Shu and Luo, Luyang and Wang, Qiong and Chen, Hao},
        title = { { Surgformer: Surgical Transformer with Hierarchical Temporal Attention for Surgical Phase Recognition } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes a new method named Surgformer for surgical phase recognition. Specifically, it first designs a hierarchical temporal attention mechanism to model the multi-range temporal relationships. After that, divided spatial-temporal attention is utilized to capture the spatial-temporal characteristics of videos. Extensive experiments demonstrate the superior performance of this method.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The experimental results achieve the sota.
    2. The HTA module is proposed to address the challenges of capturing global and local information within varied temporal resolutions.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The method adopts a decrease in the input frame to address the challenges of temporal redundancy. However, this down-sample operation has been a common pre-process in video recognition for many years. This method may need more novelty.
    2. The surgformer proposes to utilize hierarchical temporal attention and aggregated spatial attention sequentially. However, the authors need to demonstrate the effectiveness of this order through experiments. The authors should evaluate the order of temporal and spatial attention.
    3. The paper needs to analyze the hierarchical temporal attention component in the ablation study. Meanwhile, the paper does not describe the differences between the Surgformer and TimeSformer.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please see the weaknesses.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The method lacks novelty; some operations and components have been demonstrated effectively in the video recognition community, but this method applies them to surgical phase recognition.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper proposes a surgical phase recognition model Surgformer which combines spatial-temporal attention and pyramid feature aggregation. The proposed method utilizes a Hierarchical Temporal Attention module to capture both global and local information with a limited set of sparse frames as input, and an Aggregated Spatial Attention module to learn the relative weight of selected frames. Superior or comparable results are achieved on two public benchmark Cholec80 and Autolaparo.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Hierarchical Temporal Attention focuses on a classic question in surgical phase recognition. For a centered frame, the balance between long-term and short-term information is challenging and delicate. The authors capture the relationship explicitly within varied temporal resolutions by considering local latent consistency among informative frames, and aggregating features across diverse temporal resolutions hierarchically. 2) The design of spatially enhanced tokens is interesting. Most prior works only use class information in supervision. Introducing class tokens as input to the decoder for prediction is worth exploring. 3) The experiments are rather sufficient. Good experimental results are achieved on the datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) The technical novelty of integrating long-term and short-term spatial-temporal information by a hierarchical structure is incremental. For example, the paper “Prompt-enhanced hierarchical transformer elevating cardiopulmonary resuscitation instruction via temporal action segmentation” adopts a similar idea but is not cited as reference. 2) Several details need to be declared or added.

    • In Table 2, the method of AVT is compared under unrelaxed evaluation but not under relaxed evaluation, which is confusing.
    • For visualization results in Figure 4, no comparisons of other methods or analysis of the result is provided.
    • For an end-to-end model with attention, the complexity and inference time can be added to better evaluate the model.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1) I would suggest the authors research more about the literature, especially papers about long-term and short-term information and the idea of hierarchical aggregation, to further refine the reference. 2) Some details can be declared and the analysis can be added.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The effectiveness and overall framework are fine and above average, while some details need further declaration. Also, the availability of code is unknown.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposes Hierarchical Temporal attention as an effective approach to model the spatio-temporal relations in surgical videos. The proposed Surgformer is validated on 2 datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Nice visuals explaining the concept of the method very well.
    2. Validated on 2 datasets
    3. Thorough ablation study.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The paper explicitly claimed about two shortcomings of the SOTA, namely lack of long range info or significant redundancy. But neither claim is explicitly validated by empirical evidence. All we got from Table 2 and 3 is somewhat marginal improvements on overall metrics. But this improvement is neither statistically significant, nor addresses the claims about SOTA made in abstract and intro.
    2. I feel the font size in all 3 tables are excessively small. Is it because of some manipulation? If yes, then you have violated the MICCAI instructions.
  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    For the Autolaparo datasets, the authors should provide the exact video ids of their training, validation, testing splits.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Acronyms defined vs. how it appears in the text and tables, make the connection hard. For example (T=16 and R-4) in Implementation Details or 8X4 in Table 1.
    2. It’s a bit unusual to have ablation table before comparison. Also, don’t use red to highlight improvements.
    3. For a method driven paper, the method seems a bit haphazard mish mash of ideas without a clear connect. Maybe the authors can rewrite to connect the technical contribution with the exact issue of Surgical phase recognition.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Despite being an interesting idea, I think the minimal improvements (often less than 1%) along with potential font size manipulation is a significant issue. If the authors can address these in rebuttal, I’ll be happy to reconsider my rating.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #4

  • Please describe the contribution of the paper

    This paper proposes a method for surgical phase recognition by aggregating local spatial information and long term spatial-temporal information based on transformers.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The study proposes a novel idea for combining information extracted from the target frame neighboring frames and long-term global information via an attention mechanism. The assumptions and choices in the model design are supported with improved results.
    2. Sufficient evaluation is made by using four common evaluation metrics (accuracy, precision, recall, Jaccard) and two different datasets. The presented results are consistent with the assumptions of the authors. Performance comparisons with state-of-the-art models are given.
    3. Ablation study tests the effectiveness of two important model components, namely HTA and KCA.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Introduction and related work section could be more informative

    2.In equation 1, X_t is a function of X^cls, however, in Section 2.2 X_temporal is obtained from X directly. Similarly, in Equation 2, X_st is a function of X_t (X_temporal), but, in Section 2.3 X_s (X_spatial) is computed with X. Figure 3 also follows descriptions in sections.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors provided a detailed description of the model and values of important parameters, however, due to the complexity of the model, the same results might not be reproducible.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. In general, language could be improved/simplified for better reading. For example, instead of using downsampling, the sentence ‘sparsification strategy with a frame rate R, which means sampling one frame from every R frames.’ is more difficult to follow.

    2. Limitations of the proposed method are not given. A short discussion could give better insight into the capacity of the network to the reader. For example, the number of GPUs and frames hints at a large memory/computation usage.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors made reasonable claims about short and long-term spatial features and justified their claims with improved results. The evaluation is performed in two datasets and results are compared with state-of-the-art models. Used metrics and their computation are in line with the mentioned models.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We would like to thank the reviewers for their thorough review of our work. We are delighted that the reviewers found our novel and interesting idea (R #4,7,8), sufficient experiments (R #4,7,8), thorough ablation study (R #4,7,8), and experimental results convincing (R #4,5,7). The related code will be available.

To Review #4 (1)We will add more contents about the two distinct paradigms and SOTA methods in the Introduction. (2) In Eq.(1), we use X_cls as the input of HTA for a neat expression. We don’t utilize class token in HTA, while utilizing class token in Spatial Attention. (3) We will improve/simplify our language for better readability, and add limitations in the Conclusion.

To Review #5 (1) Although decreasing input frames to address temporal redundancy is a mature technology, our main focus is to design components tailored for sparse frame sequence, in specific, the proposed HTA and KCA. (2) TimeSformer has investigated the order of temporal and spatial attention, and we follow the same paradigm. (3) In Tab.1, we implement a straightforward baseline by employing TimeSformer, termed as Baseline w/ MA, and integrates our proposed HTA into the baseline, termed as “Surgformer w/ MA”. The overall performance of baseline degrades as the length increases, while the variant gains significant improvements with the increased length, which demonstrates the effectiveness of HTA to learn more discriminative features. TimeSformer utilizes short video clips as input to learn global temporal features, and predicts action categories for each clip. But surgical phase recognition focuses on fine-grained frame-level analysis. Given the temporal ambiguities between different phases in surgical videos, Surgformer incorporates both long-term and short-term temporal information for accurate recognition of these phases.

To Review #7 (1) We will add more literature, especially papers about long-term and short-term information and hierarchical aggregation. We will reference the paper “Prompt-enhanced Hierarchical Transformer … via Temporal Action Segmentation” for sufficient analysis. It employs dilated temporal convolution layers for hierarchical temporal information, while we propose HTA to capture both long-term and short-term information within varied temporal resolutions from a target frame-centric perspective. (2) All the comparison results in Tab.2 are from SKiT for fair comparison. It only provides the unrelaxed results of AVT. For Fig.4, we will add the analysis of the provided results. We will also add the discussion on complexity and inference time.

To Review #8 (1) In the Appendix, we release the results obtained from training the model with diverse frame rates, while maintaining a fixed length 16. As frame rate increases, the sampled sequence encompasses a larger receptive field to capture long range info and result in a gradual decrease in similarity between adjacent frames, which leads to a substantial enhancement in the performance during the initial stages. Simultaneously, we fix the frame rate 4, and modify the sequence length to get long-range information, which also leads to a substantial enhancement in the performance during the initial stages. We will add more detailed analysis. (2) We will fix the content that may be confusing, such as acronyms defined vs. how it appears in the text and tables. We will also relocate ablation table to a position succeeding comparison table and remove red highlight in ablation table. We will revise the methodology part with more explanation and analysis for better readability. (3) For AutoLaparo, we utilize the official splits (https://autolaparo.github.io/). Compared with Cholec80, AutoLaparo is more difficult and requires stronger spatial-temporal information. Surgformer outperforms the best performance SKiT, which achieves gains of 2.8% and 6.8% in terms of Acc and Jac. (4) Sorry for the confusion of font size, we will double ensure the final version meets the requirements.




Meta-Review

Meta-review not available, early accepted paper.



back to top