Abstract

Surgical workflow analysis poses significant challenges due to complex imaging conditions, annotation ambiguities, and the large number of classes in tasks such as action recognition. Self-distillation (SD) has emerged as a promising technique to address these challenges by leveraging soft labels, but little is known about how to optimize the quality of these labels for surgical scene analysis. In this work, we thoroughly investigate this issue. First, we show that the quality of soft labels is highly sensitive to several design choices and that relying on a single top-performing teacher selected based on validation performance often leads to suboptimal results. Second, as a key technical innovation, we introduce a multi-teacher distillation strategy that ensembles checkpoints across seeds and epochs within a training phase where soft labels maintain an optimal balance—neither underconfident nor overconfident. By ensembling at the teacher level rather than the student level, our approach reduces computational overhead during inference. Finally, we validate our approach on three benchmark datasets, where it demonstrates consistent improvements over existing SD methods. Notably, our method sets a new state-of-the-art (SOTA) performance on the CholecTriplet benchmark, achieving a 43.1% mean Average Precision (mAP) score and real-time inference time, thereby establishing a new standard for surgical video analysis in challenging and ambiguous environments. Code available at https://github.com/IMSY-DKFZ/self-distilled-swin.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1323_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/IMSY-DKFZ/self-distilled-swin

Link to the Dataset(s)

CholecT50: https://github.com/CAMMA-public/cholect45 SARAS-ESAD: https://saras-esad.grand-challenge.org/ HeicholeActivity: https://link.springer.com/article/10.1007/s00464-024-10958-w

BibTex

@InProceedings{YamAmi_Smarter_MICCAI2025,
        author = { Yamlahi, Amine and Kalinowski, Piotr and Godau, Patrick and Younis, Rayan and Wagner, Martin and Müller, Beat and Maier-Hein, Lena},
        title = { { Smarter Self-Distillation: Optimizing the Teacher for Surgical Video Applications } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {524 -- 533}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces an ensemble approach for self-destination, where the teachers of the ensemble are selected from checkpoints in an intermediate training state (not too early to avoid high uncertainty and not too late in the training process to avoid overconfidence). The soft labels of the teachers are combined to generate the training label for the student model. The optimal point for the teachers in the ensemble are selected considering the student’s performance in cross-validation.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper shows the effects that initialization, seeds, and hardware can have on the student model’s performance during a self-distillation approach. While this effect has been observed previously in supervised settings, the paper shows that it can affect the selection of the teacher model for self-distillation.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Some of the weaknesses of the work can be in the novelty of the presented work, particularly:

    • I understand that the model employs an ensemble of multiple teachers trained in a separate stage, and then, a second stage uses the outputs of these teachers to train the student model using the aggregation of the teacher’s labels. This method is an ensemble approach that is employed to train a student model, and works not in the medical domain, like [1], seems to employ similar ideas (having an ensemble of the same architecture, with variations in the training process) in a knowledge distillation approach. I would recommend reviewing the contribution in the context of this (and others similar) work. 

    • Similarly, to my understanding, a self-distillation method employs the exact same architecture in the teacher and the student. However, in the case of the proposed method, even though each model of the ensemble follows the same architecture as the student, the final simpler single-model student model is trained with a more complex multi-model ensemble. Would this be more in the domain of knowledge distillation rather than self-distillation?

    • The current optimal teacher search requires training the student under multiple versions or checkpoints of the teacher (as the teacher trains) to select the teacher that leads to the best-performing student. How much is the training cost (training time increase to get the final model) compared with the performance gains with respect to a traditional self-distillation approach?   [1] Morocutti, Tobias, et al. “Creating a Good Teacher for Knowledge Distillation in Acoustic Scene Classification.” Detection and Classification of Acoustic Scenes and Events 2023.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Similar ideas might be already considered in the knowledge distillation domain that employ an ensemble of models as teachers (first point in weaknesses).

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #2

  • Please describe the contribution of the paper

    The authors propose a novel strategy for surgical action triplet recognition, leveraging a new self-distillation (SD) framework. The key innovation lies in the selection of soft labels: the authors argue that the quality of soft labels in SD is highly influenced by the choice of the teacher model and several design factors—including the training epoch at which the teacher is selected, as well as random seeds and hardware variations. They introduce three main contributions: (i) Teacher Selection Strategy – they propose selecting teacher checkpoints from a “practical region,” defined as the point during training where the student (rather than the teacher) achieves the highest cross-validation performance. (ii) Multi-Teacher Strategy – they introduce a framework where multiple teachers guide a single student. This allows the aggregation of soft labels from teachers trained under diverse conditions, thus mitigating the sensitivity of SD to specific design choices. (iii) Temporal Decoder – they incorporate a temporal decoder to enrich the best-performing student model with temporal context. The paper concludes with a comprehensive ablation study that demonstrates the superior performance of the proposed approach on the surgical action triplet recognition task.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The major strength of the paper lies in its novel methodological approach to selecting soft labels for self-distillation. The authors demonstrate that choosing the teacher model with the best validation performance often results in suboptimal soft labels for the distillation process. To address this, they propose selecting teacher checkpoints from a region where the corresponding student achieves the highest performance, rather than relying solely on teacher metrics. This first step is complemented by a second strategy: combining multiple teachers—each trained under different conditions—to aggregate diverse soft labels for a single student, thereby enhancing robustness and improving overall performance. The proposed approach is particularly compelling, as it offers a generalizable solution for improving self-distillation, independent of the specific application domain.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The major weakness of the paper is the lack of a quantitative and reproducible method for identifying the practical region used for teacher selection—an essential component of the proposed approach. As far as I understand, the paper does not provide sufficient details on how this region is determined, beyond stating that it corresponds to the points at which the student achieves the best validation performances. This lack of clarity makes it difficult to assess the robustness and reproducibility of the method. In addition, the figures and experimental results could be presented in greater detail to support and clarify the findings. Please rate the clarity and organization of this paper. Satisfactory. The content is clearly presented and the methodology is well-explained, making the paper easy to follow and understand.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The points raised in the paper are interesting and could positively contribute to the advancement of self-distillation techniques across domains. However, despite demonstrating superior performance, the proposed solution lacks a quantitative method for identifying key components—particularly the teacher selection process—thus limiting its reproducibility

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper proposes a strategy for self-distillation for surgical workflow analysis that utilizes multiple teachers, following an analysis that revealed that there is a high variability in the quality of soft labels. This in-depth analysis shows that different training conditions (different GPUs, different learning rate,…) have an influence on the quality of soft labels. The proposed strategy works in two stages: 1. Multiple teacher models are trained in an ensemble with varying configurations to generate soft labels, which are aggregated through averaging to train the student model (spatial learning). 2. Temporal learning: Sequences are generated with the students’ embeddings of consecutive frames. A transformer is used to generate final predictions for those sequences.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • In-depth analysis of the relationship between training conditions and performance of teachers in terms of soft label quality.
    • The manuscript is well-organized. The experiments are well-designed and follow a structure based on research questions, which makes the train of thoughts easy to follow.
    • Extensive evaluations for different epochs/initializations/hardware (RQ1) teacher selection strategies (RQ2), datasets (RQ3).
    • The paper is clear about its limitations.
    • Practical, informed guideline to pick a method that beats precision of SOTA on OOD data, leveraging a multi-teacher approach.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • Hard to reproduce, no code provided. However, data are public and data handling is described well. It should be possible, though laborious, to reproduce the code. Results cannot be reproduced this way.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Thorough design of experiments, interesting in-depth analysis that motivates methodology (which beats SOTA thanks to design choices that were made), manuscript is well-written and comprehensive.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank the reviewers for the early acceptance of our work, their feedback, and helpful comments to further improve our work. In the final version of the manuscript, we will clarify the following points:

Teachers ensemble novelty: (R1) Our primary contribution is the novel strategy for selecting optimal teachers for ensembles. This differentiates our work from existing Knowledge Distillation (KD) ensemble methods that typically use fully converged teachers in their ensemble. Paper [1] explores teacher ensembles but doesn’t address our key focus: the relationship between teacher training stage and distillation effectiveness. Our work shows that conventional selection of fully converged teachers often leads to suboptimal knowledge transfer. These aspects—optimizing soft labels and identifying the ideal teacher training stage—address common gaps in the literature. Additionally, our approach to ensembling is more targeted. While paper [1] seeks complementary teachers, we propose ensembles that capture: (1) intra-teacher knowledge through our Intermediate-Epochs-Ensemble (IEE) as soft labels evolve throughout training, and (2) inter-teacher knowledge (similar to [1]) by introducing diversity through random seed initialization. We will point to future work exploring additional diversity sources like architectural variations [1], augmentations, and learning rates.

Knowledge Distillation (KD) vs Self-distillation (SD) in the case of multiple teachers: (R1) While our Intermediate Teacher (IT) and Intermediate Epochs Ensemble (IEE) variants use one single teacher with multiple checkpoints, the IESE variant trained with different seeds could arguably fall in the KD category. Nevertheless, we opted for using the term SD because we believe KD mechanisms differ fundamentally when applied to different architectural configurations.

Training time increase compared to performance gains: (R1) We will add quantitative information on compute resources required for all the proposed versions.

Open access to code: (R1 & R2) As promised in the original submission, we will provide access to the code that can be used to reproduce our results by linking the relevant repository in the abstract.

Lack of quantitative method for identifying the practical region: (R3) We will clarify that the practical region follows a joint pattern across datasets (1) Soft label evolution consistently follows the same pattern across datasets (underfitting → optimal → overfitting). (2) Training loss approaching zero reliably indicates the beginning of the overfitting region. In our action recognition experiments (20 epochs), we consistently found best teacher performance at epoch 5, with the “practical region” spanning epochs 2-8. Similar patterns appeared across external datasets. Future improvements: We acknowledge opportunities to strengthen this methodology: subsampling epochs, formalizing training loss-based boundary detection, and exploring PEFT methods to accelerate region identification. We’ll update our methods section to include a better definition of the “practical region”.

References: [1]: https://arxiv.org/abs/2503.11363




Meta-Review

Meta-review #1

  • Your recommendation

    Provisional Accept

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A



back to top