Abstract

Surgical AI often involves multiple tasks within a single procedure, like phase recognition or assessing the Critical View of Safety in laparoscopic cholecystectomy. Traditional models, built for one task at a time, lack flexibility, requiring a separate model for each. To address this, we introduce MML-SurgAdapt, a unified multi-task framework with Vision-Language Models (VLMs), specifically CLIP, to handle diverse surgical tasks through natural language supervision. A key challenge in multi-task learning is the presence of partial annotations when integrating different tasks. To overcome this, we employ Single Positive Multi-Label (SPML) learning, which traditionally reduces annotation burden by training models with only one positive label per instance. Our framework extends this approach to integrate data from multiple surgical tasks within a single procedure, enabling effective learning despite incomplete or noisy annotations. We demonstrate the effectiveness of our model on a combined dataset consisting of Cholec80, Endoscapes2023, and CholecT50, utilizing custom prompts. Extensive evaluation shows that MML-SurgAdapt performs comparably to task-specific benchmarks, with the added advantage of handling noisy annotations. It also outperforms the existing SPML frameworks for the task. By reducing the required labels by 23\%, our approach proposes a more scalable and efficient labeling process, significantly easing the annotation burden on clinicians. To our knowledge, this is the first application of SPML to integrate data from multiple surgical tasks, presenting a novel and generalizable solution for multi-task learning in surgical computer vision. Implementation is available at: \url{https://github.com/CAMMA-public/MML-SurgAdapt}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4230_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/CAMMA-public/MML-SurgAdapt

Link to the Dataset(s)

Cholec80 dataset: https://github.com/CAMMA-public/TF-Cholec80 Endoscapes dataset: https://github.com/CAMMA-public/Endoscapes CholecT50 dataset: https://github.com/CAMMA-public/cholect50

BibTex

@InProceedings{WalSoh_Adaptation_MICCAI2025,
        author = { Walimbe, Soham and Baby, Britty and Srivastav, Vinkle and Padoy, Nicolas},
        title = { { Adaptation of Multi-modal Representation Models for Multi-task Surgical Computer Vision } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15970},
        month = {September},

}


Reviews

Review #1

  • Please describe the contribution of the paper

    Authors propose a framework that adapts a pre-trained vision-language model (CLIP) for multi-task surgical video analysis using natural language supervision. Instead of training separate models for each task, the proposed approach combines phase recognition, CVS assessment and action triplet recognition into one model.To handle missing labels in the dataset, the authors introduce single positive multi-label (SPML) learning with Hill loss, which reduces the need for annotations by 23%. The model performs well compared to both task-specific and other multi-task methods, showing a strong potential for more scalable and efficient surgical computer vision solutions.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) Relevant contribution: The paper addresses the important and complex challenge of multi-task learning in surgical video understanding, a domain where integrating diverse label types (phases, actions and anatomical views) is both necessary and underexplored. 2) Effective use of weak supervision: The adoption of the SPML framework is well-justified for the surgical context, especially because dense annotations are costly and time-consuming. This setup could allow for a more practical and scalable model training. 3) Modeling label relationships via GCN: Incorporating graph convolutional networks to capture inter-label dependencies adds depth to the feature space, improving over simpler multi-label approaches that treat labels as independent. 4) Thorough experimental evaluation: The paper uses strong baselines and compares multiple loss functions across different supervision setups, providing a solid empirical foundation for the claims posed in the paper. 5) Strong performance with reduced supervision: The proposed model outperforms existing baselines and performs on par or better than fully supervised, task-specific models, despite using 23% fewer labels, demonstrating both effectiveness and efficiency in low-annotation settings.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Weaknesses: 1) Ambiguity in SPML setup: It is somewhat unclear how the single positive label per image is selected. Is it randomly chosen once before training or re-sampled every epoch? Moreover, given that each dataset is focused on a specific task, is there a risk that randomly selecting one label per image introduces bias, especially with imbalanced label distributions across tasks? Could the authors clarify how label selection is handled and whether any balancing strategy is used to mitigate potential bias? 2) Unclear handling of noisy labels and missing annotations: The paper states that missing labels are treated as negatives, which can introduce false negatives. While Hill loss reweights negatives, are there any additional strategies explored, such as label smoothing, uncertainty modeling, or masking to deal with this noise? Could the authors elaborate on how robust Hill loss is to such noise and whether alternatives were considered? 3) Sensitivity to prompt wording: CLIP can be sensitive to prompt phrasing, yet the paper does not detail whether different prompt formulations were tested. Have the authors conducted any ablation or sensitivity analysis on prompt design to ensure robustness? How were the final prompt styles chosen? 4) CLIP pretraining mismatch with surgical domain: CLIP was trained on general data found on internet, with little to no surgical or laparoscopic content. Other models like BioCLIP, MedCLIP or PMC-CLIP have domain-specific pretraining on biomedical texts and may be more suitable. Have the authors considered using or comparing against such medically adapted models? If not, could they justify the use of CLIP and discuss potential limitations in transferring to the surgical domain? 5) Lack of discussion on class imbalance: The datasets contain a highly imbalanced label space, especially with the 100 triplet classes. How does the model handle rare triplets or underrepresented labels? Are there any mechanisms (e.g. sampling, weighting) to prevent collapse to majority classes?

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    To improve reproducibility of the paper, the following should be clarified: 1) Label management across datasets: The paper refers to a combined label space of 110, but it is unclear how labels are actually managed during training or testing. Are all labels treated as a unified set regardless of task? re test-time predictions restricted to task-relevant labels? How are missing labels treated across datasets? 2) Construction and limitations of the label graph: Since CLIP’s text embeddings are not optimized for surgical semantics, the label graph obtained using the GCN may not reflect true task-specific dependencies. How is the graph sparsified? Is the adjacency matrix learned or fixed during training? Have the authors explored dynamically updating the graph or fine-tuning the text encoder? Could performance improve if these components were adapted to the domain? 3) Which exact pretrained weights were used?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed work addresses a pressing challenge in the field of surgical computer vision - how to scale surgical understanding across tasks and datasets, with minimal annotation effort, while most solutions in the field tend to focus on a single task which are actually co-dependent. The integration of weak supervision, natural language prompts, and a graph-based label correlation module is well-motivated and shows promising empirical results across three diverse surgical tasks. While there are some weaknesses, particularly around reproducibility (lack of clarity on certain aspects and the setup), there are no fundamental flaws in the method and most comments are addressable through improved discussion/clarification - I believe the paper could be significantly strengthened with minor revisions.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    My recommendation is to accept the paper based on the fact that the authors have addressed most of the limitations pointed out by the reviewers. In general, the paper proposes a well-motivated and carefully developed SPML learning framework for surgical AI applications, which is practically relevant to the field and could be generalizable across other tasks/domains. While the evaluation is limited to one procedure and class imbalance is not fully addressed, these are acknowledged and do not invalidate the key contributions of the work. Certain limitations still exist, such as the explanation of the use of general CLIP vs domain-specific, which is only empirically tested, as well as class imbalance and the details around the use of GCN; thus I would strongly urge the authors to include a discussion on these limitations, with the mention of any empirical results, in the final version of the paper.



Review #2

  • Please describe the contribution of the paper

    This paper proposes a CLIP-based vision-language model, MML-SurgAdapt for multi-task learning (the tasks being surgical phase recognition, critical view of safety assessment, action triplet recognition) in the context of laparoscopic cholecystectomy. In training of this model, the authors utilize single-positive multi-label learning (SPML) to deal with the fact that the images from combined datasets typically only have labels for one of the tasks in the multi-task setting. The authors compare their model against multiple vision-only, task-specific, and multi-task SPML models, and find that MML-SurgAdapt performs competitively with other models, while requiring 23% less annotated data.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    Novel way of using a CLIP-based vision-language model (VLM) in a multi-task setting. Extensive and rigorous evaluation against competing methods. Strong, competitive results compared to said methods.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Weak/insufficient justification for inclusion of the Graph Convolutional Network (GCN) module: not clear how addition of this complex component is motivated.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Novel problem formulation with competitive classification metrics while requiring less data than competing methods, no major flaws.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank all reviewers for their valuable feedback and helpful suggestions. We address all questions below and will clarify any ambiguities in the camera-ready version. We plan to publicly release our code after acceptance.

R1 noted ambiguity in selecting single positive labels and possible bias from label imbalance. We select one positive label per image at the start of the training via uniform sampling from task-specific ground truth and keep it fixed. Datasets share the procedure but differ in labels, and this sampling may cause task-level imbalance. No balancing strategy was applied, and is beyond the scope of this work. R1 asked about strategies for handling missing labels treated as negatives and the robustness of Hill loss. We evaluated several loss functions and pseudo-labeling methods, selecting the best-performing ones (Hill, SPLC, WAN) for further experiments. Hill loss is robust as it down-weights highly confident positive predictions even when labels are negative, helping reduce noise from false negatives. R1 requested elaboration on the selection of final prompts and sensitivity analysis of prompt formulations. We experimented with various prompt styles, including variations in label descriptions and formatting, and observed minimal differences in model performance after training. Due to the low sensitivity, we adopted a prompt format similar to the widely used “a photo of a [CLS]” for consistency. R1 requested justification for using CLIP despite its general-domain pretraining and comparisons to domain-specific models like BioCLIP or MedCLIP. Initial tests showed domain-specific models (e.g. BioClinicalBERT) underperformed compared to CLIP, so we chose CLIP for its empirical strength. We acknowledge that domain mismatch may affect transferability, under noisy supervision and during graph construction, and leave exploration of medical CLIP variants to future work. R1 asked about handling class imbalance given the highly imbalanced label space. While our current setup does not explicitly address class imbalance, we conducted preliminary experiments with class weighting, which yielded inconclusive results, and thus were omitted. We acknowledge this as a limitation and plan to explore more in future work. Regarding label management (R1), all labels are treated as a unified set during training. Each dataset provides ground truth for its own task, with other labels treated as missing. One positive label is sampled per image from available ground truth. At test time, images are grouped by dataset and evaluated using only task-relevant labels.

Concerns regarding GCN module (R1, R2, R3): The GCN was included to model inter-label dependencies and support structured predictions across tasks under noisy labels. Its adjacency matrix, based on cosine similarity over Word2Vec embeddings, is sparsified via top-K filtering and kept fixed during training. The text encoder is fine-tuned jointly, contributing to performance gains. While gains are modest in ablation, GCN utility is expected to grow with greater label diversity or stronger domain-specific pretraining.

R3 questioned the realism of SPML in surgery. SPML allows integration of multiple tasks for one procedure with less annotation costs, letting clinicians label only within their expertise and making data collection economical and practical. Despite a single positive label per image, the model learns effectively. R3 noted limited generalization as validation was on one procedure. Validation was performed on laparoscopic cholecystectomy and strong results here mark a key milestone. “Generalizable” here refers to task-agnostic learning within a specific procedure. We plan on validating our method on additional procedures in the future. R3 also raised fairness concerns in baseline comparisons. To clarify, all baselines were trained on the reduced split for fair comparison. SOTA results on the full dataset are cited only as a reference to contextualize our performance.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top