Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

The Critical View of Safety (CVS) is crucial for safe laparoscopic cholecystectomy, yet assessing CVS criteria remains a complex and challenging task, even for experts. Traditional models for CVS recognition depend on vision-only models learning with costly, labor-intensive spatial annotations. This study investigates how text can be harnessed as a powerful tool for both training and inference in multi-modal surgical foundation models to automate CVS recognition. Unlike many existing multi-modal models, which are primarily adapted for multi-class classification, CVS recognition requires a multi-label framework. Zero-shot evaluation of existing multi-modal surgical models shows a significant performance gap for this task. To address this, we propose CVS-AdaptNet, a multi-label adaptation strategy that enhances fine-grained, binary classification across multiple labels by aligning image embeddings with textual descriptions of each CVS criterion using positive and negative prompts. By adapting PeskaVLP, a state-of-the-art surgical foundation model, on the Endoscapes-CVS201 dataset, CVS-AdaptNet achieves 57.6 mAP, improving over the ResNet50 image-only baseline (51.5 mAP) by 6 points. Our results show that CVS-AdaptNet’s multi-label, multi-modal framework, enhanced by textual prompts, boosts CVS recognition over image-only methods. We also propose text-specific inference methods, that helps in analysing the image-text alignment. While further work is needed to match state-of-the-art spatial annotation-based methods, this approach highlights the potential of adapting generalist models to specialized surgical tasks. Code: \url{https://github.com/CAMMA-public/CVS-AdaptNet}

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/4116_paper.pdf

SharedIt Link: https://rdcu.be/eHw6q

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-05141-7_41

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/CAMMA-public/CVS-AdaptNet

Link to the Dataset(s)

The Endoscapes Dataset for Surgical Scene Segmentation, Object Detection, and Critical View of Safety Assessment: https://github.com/CAMMA-public/Endoscapes

BibTex

@InProceedings{BabBri_Multimodal_MICCAI2025,
        author = { Baby, Britty AND Srivastav, Vinkle AND Jain, Pooja P. AND Yuan, Kun AND Mascagni, Pietro AND Padoy, Nicolas},
        title = { { Multi-modal Representations for Fine-grained Multi-label Critical View of Safety Recognition } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15970},
        month = {September},
        page = {423 -- 432}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposed a novel method that adapts multi-modal foundation models for fine-grained surgical tasks, in this case, the critical view of safety (CVS) recognition. Unlike traditional methods that rely on pixel-wise spatial annotations and graph-based models, the proposed approach uses positive and negative natural language prompts to guide image-text alignment, which removes the need for costly segmentation. The results show that the method outperforms image-only baselines and approaches the results of segmentation-dependent methods, offering a more scalable and annotation-free alternative for surgical AI tasks.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Strengths: 1) Clear motivation and gap: The paper identifies a key bottleneck in current CVS recognition methods (dependence on segmentation) and addresses it with a well-reasoned and motivated alternative using text-based prompts and contrasting learning. 2) Novel framing: Reformulating CVS recognition task as a multi-label problem using natural language prompts is a clever adaptation, considering the complexity of the task. 3) Prompt engineering with LLMs: Use of both positive and negative prompts, enhanced by LLM-generated paraphrasing, enriches the textual input space. Combined with ablation studies on prompting types and diversity, a lot of meaningful insights can be observed. 4) Thorough evaluation: the authors include zero-shot and fine-tuned performance comparisons, ablations on prompt design, multiple inference strategies, and comparisons across various vision-text encoders, which allows for a comprehensive assessment. 5) Diverse inference strategies: the inclusion of 3 well-motivated types of inference (standard, contrastive and multi-class) showcases the method’s flexibility and adaptability. 6) Strong baseline comparisons: Results demonstrate that the approach outperforms image-only baselines and approaches the performance of segmentation-based methods, without requiring manual annotation, which holds a lot of promise for future work in this direction.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Weaknesses: 1) Clarification of novelty: While this work adapts existing components, it is not entirely clear what is novel vs repurposed. The combination of all modules is moderately novel in the surgical domain, particularly on the CVS task, but it could benefit from a clearer differentiation. 2) Loss function: The authors utilize KL-divergence rather than binary cross entropy, but this is not entirely justified. The claim that KL “encourages sharper separation” or “better alignment of multi-modal features” is plausible but lacks empirical support (e.g., ablation or citations). It is also unclear why KL was chosen over more standard contrastive losses like InfoNCE or soft cross-entropy. 3) Prompt sampling: It is not specified whether prompts are sampled randomly or deterministically during training. Particularly, in Eq.2, are the same prompts reused across batches, or are random ones sampled each time? This could affect regularization and model robustness - thus the authors should mention/discuss this. 4) Quality of negative prompts: The authors do not mention whether LLM-generated negative prompts were manually reviewed or filtered. This raised concerns about potential false negatives (for instance, stating something is not present when it is ambiguously visible in harder examples), which could hurt training. 5) Single-dataset evaluation: All experiments are conducted on a single dataset. Have the authors planned or at least run a small pilot experiment to check on cross-dataset generalization, or at least performed split-patient testing? Can the authors discuss this (at least within the scope of future work) to support claims of generalizability and annotation-free scalability. In fact, although the authors frame the proposed method as general, many design choices (prompt structure for instance) are tailored to the CVS task. A brief discussion on how this could transfer to other surgical multi-label tasks (or tasks in general) would help clarify generalizability. 6) Training duration: Can the authors include training time or convergence speed - this would be useful for comparing the practical efficiency of the method. 7) Limited comparison to alternative prompt-based methods: While the paper compares image-only and image+text methods, it does not benchmark against other recent prompt-based or few-shot adaptation methods in vision-language modeling (for example, prompt tuning, adapter modules,…). 8) Lack of uncertainty estimation or calibration: It would be beneficial if the authors assessed model’s confidence calibration or ability to handle uncertain/ambiguous cases.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The core idea of the paper is solid and much needed, especially in applying multimodal approaches to reduce annotation burden in a surgical task. Methodological contributions are well-motivated and interestingly combined, even though they are based on existing components. The evaluation is also quite thorough and does demonstrate meaningful gains over strong baselines. However, the paper is not yet fully acceptable for the following reasons: 1) Novelty is somewhat incremental - the authors could clarify the contribution better. 2) Some details and clarifications are missing that currently affect reproducibility. If addressed, the paper could be accepted. 3) Single-dataset evaluation limits claims on generalizability.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The authors have adequately addressed all of the major questions and limitations pointed out in the review. While the limitations such as a single dataset and the lack of comparison for prompt tuning and adapters was addressed by acknowledging these limitations, the proposed work is still an interesting and important approach towards the reformulation of the CVS recognition problem as a multi-label prompt-matching task that could have a strong benefit to the field. I would strongly suggest to the authors to address all of the limitations, particularly considering the dataset, lack of uncertainty estimation, handling class imbalance, as well as provide justification for architectural choices (such as using CLIP and its potential limitations in transferability) in the final camera-ready version of the paper.

Review #2

Please describe the contribution of the paper

In this paper, the authors propose CVS-AdaptNet, a framework for utilizing visual and textual information for Critical View of Safety (CVS) assessment without using segmentation labels (in contrast with the current state-of-the-art). This framework works by aligning textual features for various CVS assessment criteria (such as clarity of exposure of different anatomical structures) and visual features of the input image in a contrastive manner. During inference, an image is “compared for similarity” with different CVS criteria (in positive and negative formulations) to assess whether a given criterion is satisfied. The paper demonstrates that this framework outperforms classification-only baseline, but still lags behind methods reliant on anatomical segmentation labels.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Novel formulation and approach: it is a very interesting view of the CVS assessment problem, utilizing textual cues together with pre-existing language-vision models to establish whether a given CVS criterion is satisfied. Extensive and rigorous evaluations: comparison with multiple image-only models, evaluation of multiple inference strategies and multiple language-vision models.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Weak results: the authors admit that image-only ResNet50-MoCov2 model performs comparably to CVS-AdaptNet (text+vision). It is not described how “heavy” the utilized foundation models are, that would help put performance metrics into context.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not provide sufficient information for reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(5) Accept — should be accepted, independent of rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

While this paper hasn’t demonstrated significant improvements over image-only baseline, the novel problem formulation is very interesting for subsequent research.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We thank the reviewers for their constructive comments, for recognising the contribution of CVS-AdaptNet as a novel method adapting multi-modal foundation models to fine-grained surgical tasks (R1, R2), outperforming image-only baselines (R1, R2, R3), and enabling scalable alternatives to annotation-heavy segmentation methods (R1).

Novel vs repurposed/ Incremental Novelty (R1, R3): While CLIP adaptation and prompt tuning with positive/negative prompts are known, our contribution is in formulating fine-grained CVS recognition as a multi-label prompt-based task to address clinician subjectivity. CVS involves nuanced descriptions rather than single-word class names (e.g., cat) and is not well suited for standard zero-shot models. The novelty comes from the careful design elements: diverse LLM-generated positive and negative prompts per criterion, KL divergence loss to model label ambiguity, and a multi-inference strategy for robustness. We provide ablations showing insights into prompt choices handling subjectivity.

KL divergence instead of BCE/ InfoNCE (R1, R3): CVS labels are subjective (κ = 0.38) and form a soft distribution. Similarly, an image may match multiple prompt descriptions, unlike hard labels of classification (0 and 1). BCE assumes binary independence, and InfoNCE’s 1:N contrast is unsuitable when multiple prompts may describe a single image, and multiple images in a batch can match the criteria. We experimented with BCE loss and found KL divergence to consistently outperform it, likely because it enables soft target matching between image and prompt distributions.

Single dataset limiting generalizability (R1, R3): We acknowledge the limitation of using a single dataset, as few CVS datasets exist. However, our approach is designed to generalize to subjective, multi-label surgical recognition tasks, where dense annotations are expensive. We will explicitly discuss this in the conclusion.

Prompt sampling, Quality of negative prompts (R1), Reproducibility (R1, R3): During training, prompts are randomly sampled per batch from the LLM-generated set to enhance robustness and reduce overfitting. Section 2 “Training” clarifies this; we will make it explicit. All prompt templates (training/inference) will be released publicly. A clinician manually reviewed prompts to ensure clinical accuracy, especially for negatives.

Comparison to prompt tuning or adapters (R1, R3): Our goal was to assess how surgical foundation models can be adapted with minimal architectural changes. DualCoOP, MMA underperformed than image-only baselines. Due to architectural mismatches, these were not detailed.

Implementation Details/ Confidence Calibration (R1): Training took ~4 hours (20 epochs on a single RTX A5500 GPU). We’ll include this. Due to space constraints, not all metrics were included, but qualitative analysis of ambiguous cases was performed.

Comparison to ResNet50-MoCov2(R2): ResNet50-MoCov2 benefits from task-specific pre-training on cholecystectomy data, while our model uses vision encoder pretrained from surgical data from the web. The comparable performance highlights the strength of adaptation via prompt-based supervision.

Model Size (R2): Details on model size and components are included in the implementation section.

On temporal reasoning (R3): While surgery is inherently temporal, CVS criteria can be visually assessed from individual frames. Temporal cues may enhance and be a potential extension, but our task is defined frame-wise.

Comparison to LG-CVS (R3): LG-CVS benefits from dense spatial annotations, graphs, and a multi-stage design, which are costly. Our method improves over vision-only baselines without spatial annotations, highlighting its scalability and practical value.

Frozen CLIP+Linear Classifier ablation (R3): They underperformed compared to trainable variants. Results were excluded from the ablations due to space.

We confirm that code and prompt templates will be released for full reproducibility

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

The clinical application of the paper is important and the author addresses most of the issues raised by the reviewers in the rebuttal. More advices of the paper: 1) show statistical values (mean, std and p-value) of the table 1 and 2. 2) Figure 1 needs more detailed captions. The author ought to well-addresses them in the cam-ready version.

back to top

Multi-modal Representations for Fine-grained Multi-label Critical View of Safety Recognition

Author(s):