Abstract

Automatic surgical phase recognition plays an essential role in developing advanced, context-aware, computer-assisted intervention systems. Knowledge distillation is an effective framework to transfer knowledge from a teacher network to a student network, which has been used to solve the challenging surgical phase recognition task. A key to a successful knowledge distillation is to learn a better teacher network. To this end, we propose a novel label-guided teacher network for knowledge distillation. Specifically, our teacher network takes both video frames and ground-truth labels as input. Instead of only using labels to supervise the final predictions, we additionally introduce two types of label guidance to learn a better teacher: 1) we propose label embedding-frame feature cross-attention transformer blocks for feature enhancement; and 2) we propose to use label information to sample positive (from same phase) and negative features (from different phases) in a supervised contrastive learning framework to learn better feature embeddings. Then, by minimizing feature similarity, the knowledge learnt by our teacher network is effectively distilled into a student network. At inference stage, the distilled student network can perform accurate surgical phase recognition taking only video frames as input. Comprehensive experiments are conducted on two laparoscopic cholecystectomy video datasets to validate the proposed method, offering an accuracy of 93.3±5.8 on the Cholec80 dataset and an accuracy of 91.6±9.1 on the M2cai16 dataset.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1997_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Gua_Labelguided_MICCAI2024,
        author = { Guan, Jiale and Zou, Xiaoyang and Tao, Rong and Zheng, Guoyan},
        title = { { Label-guided Teacher for Surgical Phase Recognition via Knowledge Distillation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors introduce a novel approach to knowledge distillation from a teacher to a student model, employing contrastive learning in the feature space. They utilize the SwinV2 model for extracting image/video features, achieving surprisingly strong performance. Additionally, integrating knowledge distillation and contrastive learning further enhances the model’s performance by 3%.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The concept of employing contrastive learning of embeddings guided by SCL presents an interesting avenue for optimizing the feature space.
    • The ablation analysis is presented clearly, aiding in understanding the impact of different components on model performance.
    • Results indicate improvements in performance, suggesting the potential effectiveness of the proposed approach in enhancing surgical phase recognition accuracy.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Questioning the necessity of knowledge distillation in the absence of ground-truth labels at inference, as supervised learning can infer information where labels are unknown. Missing related work on knowledge distillation for surgical phase recognition, along with concerns about the significant performance gap between SWIN v1 and SWIN v2 models. Highlighting the challenges of global ambiguities in surgical frames and the potential limitations of knowledge distillation in resolving such issues compared to temporal modeling. Emphasizing the importance of comparing the performance of the teacher and student models and clarifying the consistent usage of SCL in the student model. Suggesting the clarification of whether the student model also utilizes cross-entropy loss.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • ”Considering that ground-truth labels are not available at inference stage, we propose a knowledge distillation framework to transfer the label-guided knowledge learnt by the teacher network to a student network without needing any label information” Supervised learning is always label guided and can be used to infer information where labels are unknown so the motivation of the knowledge distilation is questionable for me. As soon as label is used as input it becomes a problem for inference but that is not the case in this paradigmn. So for me the question remains what is actually distilled. Using the Teacher in inference step should be absolutely possible so there is no need for distillation.
    • related work Knowledge distilation for surgical phase recognition missing:
    • https://arxiv.org/abs/1812.00033
    • federated learning
    • Other models use SWIN v1 with Acc of 82.16% https://arxiv.org/pdf/2203.09230.pdf in this study swin v2 is used achieving acc of 90.9% this is a significant difference and its unclear to me if this is possible given that the authors of swin v2 rport improvements of up to 1.5% in accuracy on imagenet. The results seem to bee too good. It would be important to understand the metrics e.g. averaging over all frames, averaging over all frames of one video then over all videos
    • A black frame can appear in almost all phases of the surgery but it does not contain visible differences to another black frame. These global abiguities are hard to untangle and should be addressed with long temporal models. This is also a problem for contranstive learning because similar visual frames belong to different classes. ”As shown in Fig. 3, it is obvious that the baseline model generates mis-recognized predictions while our method recognizes these challenging frames correctly, showing the significance of the label-guided teacher network with knowledge distillation.”

      I think this conclusion is inaccurate. To resolve this kind of problems the neighbourhood of a frame has to be considered so temporality is the key to resolve this not knowledge distillation.

    • What is the performance of teacher? Usually the best performing model is the teacher not the student - this should be shown.
    • SCL brings marginal improvements but there is no obvious reason why SCL shouldnt be used in the student always.
    • clarify if student also uses cross-entropy loss

    • “Considering that ground-truth labels are not available at inference stage, we propose a knowledge distillation framework to transfer the label-guided knowledge learnt by the teacher network to a student network without needing any label information.” Supervised learning is always label guided and can be used to infer information where labels are unknown, so the motivation of the knowledge distillation is questionable. As soon as a label is used as input, it becomes a problem for inference, but that is not the case in this paradigm. So, for me, the question remains: what is actually distilled? Using the teacher in the inference step should be absolutely possible so there is no need for distillation.
    • Related work on knowledge distillation for surgical phase recognition is missing. (Reference: https://arxiv.org/abs/1812.00033)
    • Other models use SWIN v1 with an accuracy of 82.16% (Reference: https://arxiv.org/pdf/2203.09230.pdf). In this study, SWIN v2 is used achieving an accuracy of 90.9%. This significant difference raises questions about the validity of the reported results. It would be important to understand the metrics used, such as averaging over all frames or averaging over all frames of one video and then over all videos.
    • A black frame can appear in almost all phases of the surgery, but it does not contain visible differences to another black frame. These global ambiguities are hard to untangle and should be addressed with long temporal models. This is also a problem for contrastive learning because similar visual frames belong to different classes. The conclusion that the label-guided teacher network with knowledge distillation resolves these issues may be inaccurate. To resolve these problems, the neighborhood of a frame has to be considered, so temporality is the key, not knowledge distillation. “As shown in Fig. 3, it is obvious that the baseline model generates mis-recognized predictions while our method recognizes these challenging frames correctly, showing the significance of the label-guided teacher network with knowledge distillation.” The conclusion may be inaccurate. To resolve these problems, the neighborhood of a frame has to be considered, so temporality is the key, not knowledge distillation.
    • The performance of the teacher model, usually superior to the student model, should be presented for comparison.
    • SCL brings marginal improvements, but there is no obvious reason why SCL shouldn’t be used in the student always.
    • Clarify if the student also uses cross-entropy loss.

    improve

    • Clarify whether VFE and LFCT/FST are trained together or through two-stage training, as this is currently unclear.
    • Ensure consistency in the use of abbreviations like SCL, LFCT, FST, and VFE to improve readability.
    • Specify the use of SCL in the baseline model more clearly and why it could not be used for inference
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the concepts of knowledge distillation and contrastive learning are not new to surgical workflow recognition, the paper raises questions about what knowledge is distilled and why a teacher-student differentiation is necessary. An alternative approach of pretraining with contrastive learning and fine-tuning with cross-entropy is suggested, eliminating the need for a complex teacher-student architecture.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    he clarification of the distillation step helped me better understand that part of the work, which was my primary concern. I appreciate the explanation that Swin V2 also uses the temporal model, as well as the insights into the teacher model’s performance.

    My main concerns were effectively addressed in the rebuttal. While I still have minor reservations, I find the overall concept of the work interesting.



Review #2

  • Please describe the contribution of the paper

    This study uses knowledge distillation, contrastive learning, and label-guidance techniques to solve surgical phase recognition problem in Cholecl80 and M2cai16 datasets using attention based networks.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Authors proposed a novel approach by combining teacher-student networks to distillate the knowledge and supervised contrastive learning for surgical phase recognition. They reported promising results with their method.

    2. Authors tested their model with two publicly available datasets and used sufficient evaluation metrics. The reported results are in parallel with their claims.

    3. Authors gave detailed explanations of their proposed model and background information.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Reference works in the introduction could be more informative and extensive. Similarly, some of the reference works used for comparison in Table 1 and Figure 2 are not in the SOTA state now. For example, EndoNet (2016), PhaseNet (2016), SV-RCNet (2017), and OHFM (2019) networks are fairly simple models compared to recent methods in SPR. Transformer-based Opera or SKiT (reported slightly better results), especially referenced similar work ARST could be added into comparison instead.

    2. In the ablation study, the effectiveness of the LFCT block is investigated. However, when tested without LFCT block, the entire teacher model and knowledge distillation are excluded. I am wondering if replacing attention-based LFCT with a simpler block, e.g. a convolution or linear layer-based block, wouldn’t emphasize the effectiveness of the design choices in LFCT?

    3. In Figure 3 qualitative results are given with challenging frames in which the proposed method correctly estimates phases while the baseline makes wrong estimations. Although the performance of the proposed model is better than the baseline in Table 2, the figure itself might not be sufficient to claim that the proposed model can consistently classify challenging frames correctly.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Authors can perform better literature review for the comparison of their method and introduction section.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors proposed a novel method to solve the SPR task and supported their claims with improved results. They tested their approach with two common public datasets and provided sufficient tests/metrics.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    The authors have adequately addressed my comments regarding the comparison with state-of-the-art methods, clarifications, and the efficacy of their approach. Furthermore, they have committed to sharing their source code.



Review #3

  • Please describe the contribution of the paper
    • They propose a knowledge distillation framework for surgical phase recognition, including a novel label-guided teacher network and a student network.
    • They introduce a label embedding-frame feature cross-attention transformer and a supervised contrastive learning framework to learn a better label-guided teacher network, which effectively improves the correlation between feature embeddings and ground-truth labels.
    • Extensive experiments are conducted on two publicly available video datasets to validate the effectiveness of their method.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • They propose a novel label-guided teacher network to a knowledge distillation framework for surgical phase recognition.
    • They introduce a label embedding-frame feature cross-attention transformer and a supervised contrastive learning framework to learn a better label-guided teacher network, which effectively improves the correlation between feature embeddings and ground-truth labels.
    • Experiments are conducted on two publicly available video datasets (Cholec80 and M2cai16). The proposed method shows the best performance in 7 out of 8 metrics compared to existing methods.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Some descriptions are unclear (see the comments section for details).

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Some descriptions are unclear as follows.

    (1) In Section 2.1, it says “We first trained a SwinV2-B as the VFE to extract D dimensional frame-wise visual features from all video frames for further temporal modeling”. What ground truth labels do the authors use for the training ?

    (2) Section 2.3 describes the training procedure. Which is the correct procedure? (a) VFE 50 epochs -> Teacher 50 epochs -> Student 50 epochs. (b) VFE 50 epochs -> (Teacher -> Student) 50 epochs.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The main factor is the novelty of the proposed method. They propose a novel label-guided teacher network to a knowledge distillation framework for surgical phase recognition. And they introduce a label embedding-frame feature cross-attention transformer and a supervised contrastive learning framework to learn a better label-guided teacher network, which effectively improves the correlation between feature embeddings and ground-truth labels.

    The other factor is that the proposed method shows the best performance in 7 out of 8 metrics compared to existing methods, which are evaluated using two public datasets.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The authors’ response to the main concern is appropriate.




Author Feedback

We thank all reviewers for their comments.

R3 motivation of knowledge distillation (KD). What is distilled? It was pointed out in [14] that most existing methods used a frame-level loss which did not fully leverage the underlying semantic information and dependency in the output space, leading to suboptimal performance. We argue that the ground truth (GT) labels can not only be used to supervise network training but also provide semantic structure hints for the intermediate layers. Thus, we designed a new KD mechanism which exploited label structures when training phase recognition network. This was done by first measuring the correlation between a sequence of lower-level frame features and that of higher-level label semantics among phases, and then by temporally reassembling the informative label embeddings for dynamic adaptation. It turns out that the student in such a KD mechanism learns significantly better than when learning alone in a conventional supervised learning scenario, as shown by our ablation study results in Table 2. By designing a KD training procedure to transfer structure label information, we are able to capture the underlying semantic information of a surgery, and as a result boost performance at test stage.

R3 temporal modeling (TM) in our method and clarification of performance of SwinV2. We would like to point out that the KD mechanism in our method was implemented on top of TM. In both teacher and student networks, the key element to achieve TM is our self-attention layer (SAL) as shown in Fig. 1-(e). The input to SAL is a sequence of frame features or label embeddings. SAL is designed to aggregate a certain range of temporal information for feature enhancement. In our paper, we empirically chose the range to be 500 frames. Please note that our baseline model also included TM. The accuracy (acc) of 90.9% was not from SwinV2 but rather from our baseline model (SwinV2 + TM). Acc is computed at video level, while other metrics are at phase level.

R4 comparison with state-of-the-art (SOTA) methods For fairness, we used official data split and relaxed metrics proposed in M2cai challenge to compare SOTA methods. However, not all SOTA methods use official data split or metrics. For example, Opera used different data split while ARST did not use relaxed metrics. This is why we did not compare with Opera and ARST. Our method achieved an equivalent acc to SKiT (93.4% of SKiT vs. 93.3% of ours).

R1&R3 clarification on training procedure Our method is trained in 3 stages. First, we train VFE for 100 epochs with L_ce loss, taking phase annotations as GT labels. Then, we train teacher network for another 50 epochs with L_all (Eq. (3)). Finally, we further train student network for 50 epochs with only L_dis (Eq. (4)).

R1, R3, R4 reproducibility We will release our PyTorch source code.

R3&R4 related works Due to page limitation, we focus on more relevant related works. We will cite the two papers suggested by R3&R4.

R3 why not use SCL in the student model? We have tried such an option but did not generate satisfactory results. By incorporating SCL in the student model at the third stage training, we obtained an acc of 92.4%. In contrast, our method achieved an acc of 93.3%. We argue that adding SCL may lead to a trade-off between SCL and KD, resulting in sub-optimal results.

R3 performance of teacher model By taking GT labels as input to LFCT blocks, the teacher model obtained an acc of 97.8% at test stage.

R4 efficacy of our method We agree with the reviewer that Fig. 3 itself is not enough. Additional results in Fig. 2 show a consistent improvement of our method over baseline when evaluated on complete videos.

R4 how about design LFCT with a simpler block As we explained above, the attention mechanism in LFCT was designed to explore the correlation between lower-level frame features and higher-level label semantics among phases. With a convolution or linear layer-based block, this may not be achievable.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top