Abstract

Active learning (AL) can reduce annotation costs in surgical video analysis while maintaining model performance. However, traditional AL methods, developed for images or short video clips, are suboptimal for surgical step recognition due to inter-step dependencies within long, untrimmed surgical videos. These methods typically select individual frames or clips for labeling, which is ineffective for surgical videos where annotators require the context of the entire video for annotation. To address this, we propose StepAL, an active learning framework designed for full video selection in surgical step recognition. StepAL integrates a step-aware feature representation, which leverages pseudo-labels to capture the distribution of predicted steps within each video, with an entropy-weighted clustering strategy. This combination prioritizes videos that are both uncertain and exhibit diverse step compositions for annotation. Experiments on two cataract surgery datasets (Cataract-1k and Cataract-101) demonstrate that StepAL consistently outperforms existing active learning approaches, achieving higher accuracy in step recognition with fewer labeled videos. StepAL offers an effective approach for efficient surgical video analysis, reducing the annotation burden in developing computer-assisted surgical systems.



Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3578_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{ShaNis_StepAL_MICCAI2025,
        author = { Shah, Nisarg A. and Safaei, Bardia and Sikder, Shameema and Vedula, S. Swaroop and Patel, Vishal M.},
        title = { { StepAL: Step-aware Active Learning for Cataract Surgical Videos } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15968},
        month = {September},
        page = {555 -- 565}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper proposes an active learning framework, StepAL, for long video surgical step recognition, aiming to reduce annotation costs while maintaining model performance. Traditional active learning methods, such as selection strategies based on single frames or short video clips, often fail to account for the temporal dependencies between surgical steps and the contextual information of the entire video, making them difficult to apply to long surgical videos. StepAL addresses this issue through Step-aware Feature Representation (SFR) and Entropy-weighted Clustering (EWC).

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The StepAL framework introduced in this paper is the first to combine step-aware feature representation with entropy-weighted clustering, directly addressing the core challenges of long video surgical step recognition, such as step dependencies and the need for global context in annotations. By capturing step distribution differences with pseudo-labeling, it overcomes the diversity loss caused by feature averaging in traditional methods, representing a significant advancement in the field.
    2. The paper emphasizes the practicality of full-video annotations (rather than segment-based annotations), aligning with the real-world annotation process where surgeons need to globally review the video. This approach avoids the context loss caused by local annotations in traditional methods, making it highly valuable for practical applications.
    3. Extensive experiments were conducted on two publicly available cataract surgery datasets, with multi-metric and multi-baseline comparisons. The results show that StepAL outperforms existing methods even during the first annotation cycle (R=1).
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The formulas are missing numbering, such as the first two formulas in Section 2.1 (Step-aware Feature Representation). While this is a minor issue, it is worth addressing for clarity and reference purposes.
    2. The paper mentions that pseudo-labels may be inaccurate during the early stages of active learning (“particularly in early AL cycles”), which directly affects the quality of Step-aware Feature Representation (SFR). The core of SFR relies on classifying segment features into predicted steps based on pseudo-labels. However, if the pseudo-labels have a high error rate (e.g., due to poor initial model performance), the feature representation may incorrectly aggregate segments from different steps, leading to bias in subsequent clustering selection. For example, if a step is not correctly predicted (i.e., the step is not labeled), the authors use global average features to fill in, which introduces noise, especially if the step actually exists but is not identified. This design may reduce the model’s sensitivity to the differences between steps, thus affecting clustering performance.
    3. Entropy-weighted Clustering (EWC) relies on video-level entropy, specifically the average entropy of segments. However, this averaging operation may obscure the uncertainty differences between steps within the video. For instance, if a video contains 10 steps, and 2 steps have highly uncertain predictions (high entropy) while the remaining 8 steps are predicted accurately (low entropy), the overall average entropy could be low, potentially underestimating the importance of the video. In contrast, segment-weighted entropy might be more effective in capturing such differences.
    4. The experiments were only validated on two cataract surgery datasets and did not cover other types of surgeries (e.g., laparoscopic surgery or orthopedic surgery). The step complexity, duration, and visual feature differences across various surgical scenarios are significant, limiting the generalizability of the approach. Additionally, the initial labeling data amount was set to 10% (equivalent to just 2.5 videos in Cataract-1k). If the initial model performance is poor, the quality of pseudo-labels could collapse, rendering subsequent selection ineffective, but the paper does not discuss this potential issue.
    5. The paper lacks a theoretical explanation of the effectiveness of the individual components. In particular, it would be valuable to explain whether SFR, by simply averaging globally, retains subtle but significant differences between the various steps in the surgical process.
    6. The ablation experiments only compared component combinations and did not analyze the impact of key hyperparameters, such as the number of clusters for weighted KMeans clustering. This omission limits the understanding of how hyperparameter choices affect the overall performance.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    1. Improve the numbering and referencing of equations to enhance the precision and rigor of the text.
    2. Include preliminary validation on other types of surgical videos or discuss the potential generalization limitations of StepAL.
    3. Provide a detailed analysis of the impact of the pseudo-label update strategy on the results to prevent early error accumulation.
    4. Offer a sensitivity analysis of hyperparameters (e.g., the basis for choosing the number of clusters) to guide parameter adjustments in practical applications.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the paper presents innovative ideas in the field of active learning for surgical videos, such as the combination of step-aware feature representation and entropy-weighted clustering, the current limitations, particularly in terms of pseudo-label reliability, experimental generalization, and weak theoretical analysis, affect the comprehensiveness and persuasiveness of the proposed method.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    This work is straightforward and has strong potential for easy adoption in clinical settings. Although its generalizability and robustness require further evaluation, it is still valuable to present at MICCAI for discussion.



Review #2

  • Please describe the contribution of the paper

    The authors propose a strategy to sample cases from unlabeled set in an informed manner that would boost the model performance the most. Specifically, they propose entropy-weighted cluster strategy to pick both uncertain and diverse cases.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The work has a strong potential of adoption in clinical setting. Video data has become humongous. Annotation is time-consuming and selecting all videos for annotation is not feasible. Identifying videos that require annotation to help model performance is essential. In that sense, this work is attempting to address that issue. Methodology is simple and easy to adopt.
  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Weak evaluation. Only benchmarked on Cataract datasets. There are other public benchmark datasets AutoLaparo, Cholec80 for Phases, MultiBypass140 for both phases and steps. Weak baseline comparison. There are more recent works in this stream trying to solve this problem. For example, Two-Stage Active Learning for Efficient Temporal Action Segmentation. In first stage, they identifying unlabeled videos, and later identify frames to annotates within the video (which might be not relevant to this work). But the comparison can be made at first stage.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
    • Looks like for R=1 which means when the data is much less, this method can be more beneficial which is huge. I think this can be emphasized in words clearly. When just saying R=1 that value is not clearly understood. In fig 2, looks like the performance matches with Coreset at R=2,3 but improves for 1,4. Can the authors comment on this?
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Requires more evaluation.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    This propose StepAL, an active learning framework designed for long, multi-step surgical videos, which effectively reduces the annotation cost through an all-video selection mechanism while improving the surgical step recognition performance. This includes Step-aware Feature Representation, which utilizes pseudo-labels to capture the distribution information of surgical steps in the video so as to better preserve the inter-step dependencies. And combined with Entropy-weighted Clustering, it optimizes both uncertainty and diversity in video selection to ensure that the selected videos are both informative and representative. Experiments verify the effectiveness and superiority of the proposed method.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper is well-organized and easy to follow.

    2. The combination of SFR and EWC is logically sound, enabling better utilization of fine-grained step-level information while effectively balancing uncertainty and step diversity.

    3. Extensive experiments on real-world datasets demonstrate the effectiveness and superiority of StepAL.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Pseudo-labeling is employed to construct step-aware feature representations, but the instability of pseudo-labeling in the early active learning loops has not been thoroughly analyzed (e.g., pseudo-labeling may be less accurate when model performance is low).

    2. While the methodology highlights the dependencies between surgical steps, it does not explicitly address how StepAL manages long or complex dependencies between steps.

    3. Reasoning time and computational resource consumption are critical in real-world active learning applications, yet the paper does not provide an analysis of StepAL’s reasoning efficiency.

    4. The performance of models trained using fully supervised learning was not evaluated.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Please see weaknesses.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although the proposed method demonstrates effectiveness on video datasets, it lacks certain relevant analyses. Refer to the Weaknesses section for further details.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The author addresses my concerns and I have no further questions at this time.




Author Feedback

  1. Step-aware Feature Representation (SFR) (Mechanism, Theory)(R1, R2): SFR preserves step nuances, unlike global averaging. It computes per-video average features (centroids) for each predicted step from associated clips. These C L2-normalized centroids form a CxD “bag of prototypes,” encoding video step composition. K-Means then groups videos by all-step similarity (akin to Fisher vectors), unlike single, pooled embeddings that loose crucial distinctions. Filling missing steps with global averages is rare (<2% segments), only for completeness. Ablation (Table 2) validates this: StepAL 71.69% Acc. vs. ME-KMeans 68.07%.

  2. Early‑cycle pseudo‑label noise (R1, R2, R3): Early-cycle pseudo-labels (Sec. 2) can be noisy; however, our model (R=0, 10% data) achieves ~45% accuracy, demonstrating a strong starting point. These labels only guide SFR for selection (not direct training) limiting impact. Experimentally, SFR outperforms global-feature methods by 5.3% (StepAL vs ME-KMeans, Tab 2). At R=1, StepAL boosts Cataract-1k accuracy by 4.66% over the next best (Table 1), confirming pseudo-label feature advantage.

  3. Entropy Weighted Cluster (R1): Mean entropy targets videos uncertain across all steps, making them stronger candidates than those with few doubtful segments (which the model often self-corrects). This method lifts the Entropy baseline to 67.03% vs. 64.17% for maximum segment entropy (Table 2 update planned). Finer-grained uncertainty is an interesting avenue for future work.

  4. Hyperparameter Clarification and Cold Start (R1): K=b (budget) selects 1 video/cluster, linking to cost without extra tuning. Cold‑start risk is acknowledged but empirically negligible in our setting. The 10 % seed (~3 videos ≈ 1.25 k clips) yields 45 % accuracy, matching the standard initial performance in AL literature (Coreset, CoreGCN). StepAL’s joint uncertainty + diversity picks expands coverage efficiently, raising accuracy to 71 % at R = 1, showing pseudo‑labels’ reliability. Datasets with poorer initial performance can start with a larger pool; we will note this.

  5. Dataset Evaluation (R1, R3): To ensure impactful contributions to cataract surgery workflow analysis, StepAL was rigorously evaluated on Cataract-1k (2021-23) and C-101 (2017)—the field’s established public benchmarks (Ghamsarian et al. 2024, Shah et al. 2023). The datasets have substantial variations (era, count, resolution, FPS, duration, step definition [13 vs 10], anno. density); C-101 is simpler (Sec. 3). StepAL’s consistent SOTA performance (both settings, Table 1) shows robustness. While our work is relevant for cataract surgery, we believe the extension of our approach to other surgery and general computer vision datasets is an interesting future direction.

  6. Method Evaluation(R3): TSAL’s 2nd stage (frame selection): irrelevant to our full-video task. Its 1st stage selects videos with dissimilar action-ordering to labeled set; while StepAL chooses diverse, uncertain videos as high-level fixed action-ordering in surgery videos. So, adapting TSAL is non-trivial in rebuttal constraints. Baselines (CoreGCN [CVPR], ME KMeans ablation [T2]—like TSAL’s) show StepAL’s superior perf. Will update the paper to mention TSAL.

  7. Coreset Performance (R3): StepAL leads Coreset (R=1, +9.24%) identifying better step diversity/uncertainty (SFR+EWC). While Coreset narrows this (R=2-3; StepAL +0.5%) via naive diverse sampling, StepAL significantly improves (R=4, +3.94%; while Coreset plateaus), avoiding saturation with budget increase for an efficient path to oracle.

  8. Reasoning cost & Fully‑supervised Performance (R2): Our runtime (1.14s/cycle) matches diversity methods (KMeans 0.91s, Coreset 1.03s), offering better performance (Table 1). AL’s goal to reduce costly human annotation means its efficiency gains (Table 1, Fig 2: higher accuracy, fewer labels) outweigh sample selection’s negligible compute cost. Full supervised oracle (Sec. 3: C-1k 92.0%, C-101 89.5%) are upper bounds.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    This paper proposes StepAL, a novel active learning framework specifically designed for long surgical step recognition videos. It aims to significantly reduce annotation costs while maintaining model performance through Step-aware Feature Representation and Entropy-weighted Clustering. Despite one reviewer not providing explicit feedback (nan), other reviewers found the authors’ responses satisfactory. The method is highlighted for its straightforwardness and strong potential for practical adoption in clinical settings, making it a valuable contribution to surgical video analysis. Therefore, I recommend accepting it.



back to top