Abstract

Consensus amongst researchers and industry points to a lack of large, representative annotated datasets as the biggest obstacle to progress in the field of surgical data science. Advances in Self-Supervised Learning (SSL) represent a solution, reducing the dependence on large labeled datasets by providing task-agnostic initializations. However, the robustness of current self-supervised learning methods to domain shifts remains unclear, limiting our understanding of its utility for leveraging diverse sources of surgical data. Shifting the focus from methods to data, we demonstrate that the downstream value of SSL-based initializations is intricately intertwined with the composition of pre-training datasets. These results underscore an important gap that needs to be filled as we scale self-supervised approaches toward building general-purpose “foundation models” that enable diverse use-cases within the surgical domain. Through several stages of controlled experimentation, we develop recommendations for pretraining dataset composition evidenced through over 300 experiments spanning 20 pre-training datasets, 9 surgical procedures, 7 centers (hospitals), 3 labeled-data settings, 3 downstream tasks, and multiple runs. Using the approaches here described, we outperform state-of-the-art pre-trainings on two public benchmarks for phase recognition: up to 2.2% on Cholec80 and 5.1% on AutoLaparo.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1998_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1998_supp.pdf

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Ala_Jumpstarting_MICCAI2024,
        author = { Alapatt, Deepak and Murali, Aditya and Srivastav, Vinkle and AI4SafeChole Consortium and Mascagni, Pietro and Padoy, Nicolas},
        title = { { Jumpstarting Surgical Computer Vision } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This work highlights the importence of SSL-based initilization for pre-training in surgical computer vision tasks. Though detailed experiments, the paper proposes a strategy for pre-training which can improve the model performance and can potentially be scaled to other area.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper proposes a very common but less explored question about how pre-training data initialization can affect the downstream computer vision tasks. It focuses on the data rather than the method, which can be inspiring and can be scaled to other areas.
    2. This work conducts very controlled and systematic experiments to demonstrate the effect of different data sources and their performance in multiple downstream tasks. The whole experiment consists of 4 stages including baseline and different combinations of data sources, procedures and localizations.
    3. This work provides a solid and detailed discussion about impact of different pretraining strategy on model performance. It utilizes quantitative details to demonstrate the advantages of using clinical data for pretraining, adjusting relationship between data scale and model capacity.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. Lack of novelty. This mainly focuses on the experimental comparison of pre-training data rather than theoretical novelty of data pretraining. It could be improved if the paper can propose a systemic data pertaining method which has the potential to be scaled to other tasks.
    2. These factors (procedures, localization) for experimental analysis are too narrow to provide insight for other computer vision scenarios. For example, this experiment explores the impact of different procedures, source center and their multiple combinations which makes sense for phase recognition and CVS assessment. However, these combinations cannot be meaningful to other surgical tasks like tissue segmentation, instrument detection. It will be interesting if the pretrained model can show better generalizability to other downstream surgical tasks with the current dataset. Otherwise, this work should propose a feasible method to transfer their pretraining strategy to other downstream tasks.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    It will be easier to reproduce if the author can provide model weight for different data pretraining methods.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Although this work did very details experiments and discussed their findings on the dataset, these finding cannot prove a specific strategy which can be extended to further areas. The author need to comment on their proposals and how their finding can bring insights to the extended areas.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The main concern is this paper lacks of a feasible proposal of pre-training strategy. Although they did very details experiments and discussed their findings on the dataset, these finding cannot prove a specific strategy which can be extended to further areas. The author need to comment on their proposals.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Although the authors claim that the method innovation is limited, the author argue that the systemic data-driven experiments to explore foundional model for surgical purpose. The author also illustrate the potential generalizability of their method.



Review #2

  • Please describe the contribution of the paper

    The authors evaluate the influence of domain shifts from pre-training to finetuning on performance in the realm of endoscopic video analysis. Fixing the self-supervised learning (SSL) method (MoCo v2 to pre-train a ResNet-50) and the downstream tasks (phase recognition and critical view of safety (CVS) prediction), they demonstrate that both, procedure type(s) and data origin(s) (location/hospital) used for SSL have a critical effect on downstream task performance, at least when working with common data set sizes.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper tackles a relevant question given the rise of medical foundation models. In contrast to mainstream work, it covers data-centric aspects.

    • Comprehensive experiments with data from multiple hospitals and various procedures were performed.

    • Interesting insights with respect to performance of SSL methods when compared to the traditional ImageNet baseline.

    • The paper is well-written and well-structured and features nice illustrations.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The experiments are not extensive enough to support the claims that authors are making
      • Limited sample sizes that do not substantially support any claims for generalizability. Results could potentially look very differently for different data set sizes.
      • Only one SSL method (Mocov2) was evaluated for most of the experiments.
      • Only two, very selective, and as the authors state, conceptually similar, downstream tasks were chosen for finetuning evaluation, thus the conclusions derived could only pertain to such a task or setting
    2. Having access to only a handful of videos for fine-tuning is in my opinion not a reasonable assumption. Hence, part of the validation strategy does not reflect a clinically realistic scenario.

    3. Reporting on the method’s performance using one metric per downstream task, namely mAP for CVS assessment and F1-score for phase recognition is not sufficient, because no performance metric can in isolation capture all relevant aspects (see https://www.nature.com/articles/s41592-023-02151-z).

    4. Some results are contradictory. In stage 2, the results suggest that pre-training on a procedure relevant to the downstream task improves performance while this was not confirmed in stage 4. Overall, it is hard to draw generalizable conclusions.

    5. The uncertainty analysis does not account for the variability of performance across videos or the confidence in the reported mean performance.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Along the lines of facilitating the transparency and reproducibility of presented approaches and results, the authors have included a statement that their pre-trained initializations alongside with the released code will be made available upon acceptance. On top of that, I would consider information related to the manner with which performance metric values were summarized, including reporting of the variability in method performance ( by means of confidence intervals and standard deviation) of high importance. From an infrastructural perspective, reporting on the computational resources that were required for the conducted experiments, deems highly relevant for the reproducibility of presented results.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Major

    1. Metric Selection: Motivated by previous work, you have considered two performance metrics, namely the F1-score and mAP for CVS prediction and phase recognition respectively. These should be complemented for a more comprehensive analysis. For example, detection metrics such as mAP do not take into account temporal consistency. Furthermore, when using the F1-score, details on metrics aggregation should be provided. Further reading:
      • Luiten, Jonathon, et al. “Hota: A higher order metric for evaluating multi-object tracking.” International journal of computer vision 129 (2021): 548-578.
      • Mao, Huizi, Xiaodong Yang, and William J. Dally. “A delay metric for video object detection: What average precision fails to tell.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.
      • Sobti, Anupam, et al. “VmAP: A Fair Metric for Video Object Detection.” Proceedings of the 29th ACM International Conference on Multimedia. 2021.).
      • Dergachyova, Olga, et al. “Automatic data-driven real-time segmentation and recognition of surgical workflow.” International journal of computer assisted radiology and surgery 11 (2016): 1081-1089.
      • Funke, Isabel, Dominik Rivoir, and Stefanie Speidel. “Metrics matter in surgical phase recognition.” arXiv preprint arXiv:2305.13961 (2023)?
    2. Stage 2 and Stage 4 descriptions: Could you please clarify how confounding that could potentially be attributed to differences in instrumentation or workflow, can be mitigated by having the same number of cases per procedure?

    3. Temporal consistency: Considering the performance metrics reported in the results table (mAP and F1-score), it appears that the temporal order of the frames included in the video data that have been used for the analyses has not been taken into account. Since the decision to focus on those two metrics is motivated by the previous bulk of work, could you please explain how you would incorporate the temporal aspect of the data when not restricting your assessment on those two metrics?
      Supplementary Material, Table 3: Could you please clarify how the reported SSL hyperparameters have been reached?

    4. Stage 4: Could you please clarify precisely how you identified the optimal scaling strategy? Would you consider that the optimal scaling strategy you considered generalizable? Also, the results of Stage 4 should be investigated or discussed further. The fact that the pre-training on Laparo425 without LC is not inferior to the pre-training without LC on the Cholec80 seems to be somewhat at odds with the assertion that procedure-specific pre-training improves results.

    Minor

    • Please introduce all acronyms in abstract and main text (e.g. SSL, LC)
    • Typos: Please correct typos (e.g. “mitgiating”, “surical”, “of a related surgical procedures”)
    • I would recommend to consistently use “foundation” (not “foundational”).
    • Table 3: Please improve readability by making columns wider.
    • 2.3: ”This pre-training procedure results in a trained ResNet-50 backbone, which we use to initialize a ResNet-50 classifier that we finetune separately for each downstream task while varying the number of labeled videos.” It is not clear to me, how the finetuning was done for the CVS task, especially as the term “Linear Finetuning” is used in the Suppl.
    • 2.4: “To enable this analysis, we separate Laparo420, which is a collection of various surgical videos, into 8 different subsets by procedure type (listed in Figure 1)”. From Figure 1 alone, it is clear that these datasets are then used for pretraining, but the text does not seem to specify them extensively.
    • Page 8: Please use “MultiCholec2024” instead of “MultiChole2024”.
    • There is no mention of the compute budget used for the various experiments. Could you please add some information related to the computational resources required for your conducted experiments?
    • 2.2 Downstream Tasks: Can you please replace “std” with “standard deviation” and also include this in the caption of Table 2 and Table 3.
    • Table 2, Table 3 captions: Please update the caption, so that it explicitly explains what the results included in the tables correspond to.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I like the data-centric approach to making progress in the field of surgical foundation models. The weakest point is the restricted setting with respect to the number of SSL methods, data set sizes and downstream tasks, which makes it challenging to draw broad conclusions from the work.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    In contrast to my colleagues I don’t think that the lack of (technical) novelty is a key issue of the paper. Comprehensive validation studies can reveal exciting insights that are highly relevant. However, I still think that the experiments don’t allow for drawing broad conclusions, which is why I don’t have a strong opinion on the paper.

    In my opinion, the authors should really change the title because the contribution does not justify such a broad claim.



Review #3

  • Please describe the contribution of the paper

    The paper addresses the lack of large annotated datasets in surgical data science and explores the use of self-supervised learning (SSL) to reduce the dependency on labeled data. Through over 300 experiments, the study investigates the impact of various factors on pre-training efficacy, such as dataset composition, procedure types and procedure centers. Pre-training on relevant procedures significantly boosts performance, highlighting the importance of procedure-specific initializations.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) This study occasionally outperforms state-of-the-art approaches on two well-established public benchmarks in surgical data science. It demonstrates the effectiveness of SSL-based initializations in enhancing both phase recognition and CVS assessment tasks. 2) By focusing on data rather than just methods, the paper contributes to building general-purpose ‘foundational models’ that can be applied across diverse surgical domains, showcasing advancements in SSL design for surgical computer vision tasks 3) The study offers recommendations for optimizing pre-training datasets, providing practical guidance on how to leverage SSL approaches to improve performance in surgical data science tasks.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1) The scientific novelty of this paper is limited. The main contribution is limited to manipulating/preparing datasets for SSL surgical data science 2) The discussion does not adequately address how initialization with ImageNet pretraining differs from surgery-based pretrained models when using a reasonably labeled dataset. The results are generally comparable.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Curating multi-procedure and multi-center datasets for self-supervised pre-training may need specialized methodological design. Further analysis is needed to clarify whether a large model is necessary to learn feature representations effectively and to boost the downstream tasks.

    Typos: Mitgiating : mitigating

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Although this study has limited scientific novelty, the extensive number of experiments and robustness of the analysis contribute to the development of general-purpose ‘foundational models’ that can be applied across diverse surgical domains.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I appreciate the author’s efforts in addressing my comments. While I still consider the technical novelty to be limited and feel that a constructive conclusion for all the experiments is missing, this study addresses an important segment of any deep learning pipeline: data manipulation. The extensive data-driven experiments contribute to the development of a foundational model for the surgical domain.




Author Feedback

We are grateful to the (R)eviewers for their considered and informed reviews as well as their constructive comments to better this work. We categorize and address the principal concerns below:

Generalizability: To balance the range of settings and the robustness of results, we made several design choices while selecting the final 315 experiments summarized in this manuscript. We appreciate the reviewers’ concerns about the applicability of our findings beyond the explored settings, including other metrics, SSL methods, tasks, and sample sizes. Wherever possible, these tradeoffs were informed by experimental findings (e.g. whether there were informative changes in trends using other metrics) and on previous literature (e.g. MoCo’s comparable performance to other SSL methodologies on a range of relevant downstream tasks including anatomy segmentation and instrument presence detection [17]). Still, we share these concerns and have tried to emphasize this by tempering our claims and being transparent with our results (through the release of checkpoints and code). Extensions of this work would build on this extensive (noted by R1,3,5) foundation to explore and bolster the recommendations made.

Novelty & Value: Note that this work is positioned as an application of SoTA methodology (i.e. SSL) to a new problem (i.e. leveraging diverse surgical data), addressing the application track of MICCAI. As such, while not methodologically innovative, we strongly argue that our work does present scientific novelty of value to the MICCAI readership. While previous studies have demonstrated the value of scaling SSL methodology to use diverse surgical data, this work is the first to systematically explore the impact that dataset composition can have on performance. This is relevant as it highlights important limitations in previous work (including at MICCAI, e.g. where only procedures significantly represented in the pretraining datasets were tested [6,22]) and provides indications for improvement. Broadly, as noted by R1,3, it provides practical insights into how to leverage diverse (and accessible) sources of data, particularly in low-label settings (e.g. feasibility studies, rapid prototyping, and few-shot adaptation) and in the shift toward unified foundation models.

Metrics: We would like to specify that we calculate F1-score for surgical phase recognition by averaging across videos to reflect the variability across different videos. We thank R1 for highlighting the need for clarity, as this has previously caused confusion in the literature. Due to space constraints, we will provide additional metrics, balanced accuracy for CVS and accuracy, precision and recall for phase on github.

Clarification on experimental setup for Stage2,3: As R1 noted, fixing the number of videos in Stages 2 and 3 doesn’t address workflow variations within and across centers and procedures. Instead, having a large number of cases per procedure/center and many procedures/centers helped us represent this variability. We will revise the text for clarity.

Clarification on stage 2,4 results: R1 points out that the relatively modest boost between Laparo420 with and w/o LC in stage 4 may contradict the claim that procedure-specific initializations improve results. First, we would like to note that this is inline with the modest increase in chlecystectomy pretraining representation from 0 to ~10%. Note that the two pure-cholecystectomy pretrainings perform markedly better in almost every setting. Finally, we are not advocating that other procedures should not be used but rather that it may not be trivial, illustrated by the wide range of performance boosts for different procedures in stage 2.

For brevity, we have omitted minor corrections that we will address in the manuscript such as typos, selection criteria for hyperparameters, and the inclusion of computational budgets.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top