Abstract

Natural language could play an important role in developing generalist surgical models by providing a broad source of supervision from raw texts. This flexible form of supervision can enable the model’s transferability across datasets and tasks as natural language can be used to reference learned visual concepts or describe new ones. In this work, we present HecVL, a novel hierarchical video-language pretraining approach for building a generalist surgical model. Specifically, we construct a hierarchical video-text paired dataset by pairing the surgical lecture video with three hierarchical levels of texts: at clip-level, atomic actions using transcribed audio texts; at phase-level, conceptual text summaries; and at video-level, overall abstract text of the surgical procedure. Then, we propose a novel fine-to-coarse contrastive learning framework that learns separate embedding spaces for the three video-text hierarchies using a single model. By disentangling embedding spaces of different hierarchical levels, the learned multi-modal representations encode short-term and long-term surgical concepts in the same model. Thanks to the injected textual semantics, we demonstrate that the HecVL approach can enable zero-shot surgical phase recognition without any human annotation. Furthermore, we show that the same HecVL model for surgical phase recognition can be transferred across different surgical procedures and medical centers. The source code will be made available at https://github.com/CAMMA-public/HecVL

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1025_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1025_supp.pdf

Link to the Code Repository

https://github.com/CAMMA-public/HecVL

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Yua_HecVL_MICCAI2024,
        author = { Yuan, Kun and Srivastav, Vinkle and Navab, Nassir and Padoy, Nicolas},
        title = { { HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The manuscript studies the impact of using natural language models video-text pairs on the task of surgical phase recognition. The work expands the work presented by SurgVLP. In SurgVLP and new dataset which combines video and text was presented. Contrastive learning of the text-video data was suggested. In this study the authors break down the text to 3 hierarchical levels. Then they suggest a novel fine-to-coarse contrastive learning framework. The impact of this new framework is tested using the task of surgical phase recognition on 3 datasets.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The main two novel aspects of this study are: 1) the breakdown of the text to three hierarchical levels. 2) fine-to-coarse contrastive learning framework

    SOTA restuls are presented on this specific task

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    Surgical data is known to have a hierarchical structure and, in that sense, the suggested method “makes sense” and sounds promising. However, in this study the proof was limited to one specific task, that hold only one hierarchical level.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The algorithm is clear, it is not fully clear how the parsed the data.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The paper discusses clip-level, phase-level, and video-level. From what I could understand the clip-level was taken as from SurgVLP. The video-level was part of the metadata. I assume that was also present in the SurgVLP dataset (yet not used by them). It is not clear to me what is the source of the pahse-level. Overall, from what I could see the study heavily builds on SurgVLP (the contrastive learning was also suggested by them). This should be clear in the introduction. I think they should add an “our contribution” paragraph outlining their main contributions. In addition, if they have any thoughts and directions on the impact of their Hierarchical approve on other Hierarchical tasks in surgery, they might consider adding it to the discussion. In the future additional tasks should be examined to assess the impact of this method.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The study presents a novel approach to video-text analysis of surgical data. In addition, they reach SOTA results. Nevertheless, many concepts have been presented in SurgVLP and only one task was evaluated.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    1.Hierarchical Dataset Creation: HecVL pairs surgical videos with texts at three levels—clip, phase, and video—to enhance the model’s understanding across different granularities.

    1. Fine-to-coarse contrastive Learning Framework: It introduces a unique learning framework that separates embedding spaces for each text-video hierarchy, enabling the model to encode diverse surgical concepts effectively.
    2. Zero-Shot Phase Recognition: HecVL enables the model to recognize surgical phases without prior specific training, demonstrating its transferability across different surgeries and settings.
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper is well-written and eay to follow.
    2. This paper focuses on an interesting insight that learning fine-grained video-lanugage representation for surgical video.
    3. The proposed dataset is somewhat benefical to the field of surgical videos.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The novely is limited. The method of this paper seems similar to “Ashutosh K, Girdhar R, Torresani L, et al. Hiervl: Learning hierarchical video-language embeddings[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 23066-23078.”
    2. Missing the ablation studies for Action+Abstract level or Phase+Abstract level in Table 3.
    3. No visualization, e.g., T-SNE, for the learned embeddings for different levels of visual-text pairs.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. Please clarify the differences with Hiervl: Learning hierarchical video-language embeddings.
    2. The visualization, e.g., T-SNE, for the learned embeddings is also required to verify the effect of the proposed method compared with single level ones.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The main concerns is the method of the paper, which is similar to Hiervl: Learning hierarchical video-language embeddings. Please clarify their differences. Furthermore, the visualization, e.g., T-SNE, for the learned embeddings is also required to verify the effect of the proposed method compared with single level ones.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The authors suggested a novel approach to zero-shot surgical phase recognition by training an encoder on three different hierarchical layers and disentangling the embedding spaces associated with each layer. This method effectively establishes zero-shot surgical phase recognition and it could be transferred across other similar surgical procedures.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1.The authors suggested a fine-to-coarse contrastive learning framework that learns features at three levels. The framework enables zero-shot surgical phase recognition and can be applied to similar surgical procedures. 2.The authors expanded the SCL datasets by adding additional video-text pairs, enhancing the robustness and utility of the dataset. 3.The author provides detailed descriptions of the training process and conducts comprehensive ablation studies to validate the effectiveness of the proposed methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. For the “alternating training strategy” mentioned in Section 2.4 on the training pipeline, its impact is not evident from the ablation studies. Why not use a simultaneous training approach, which might better balance overall optimization?
    2. The role of the Agg function is unclear. Why was average pooling chosen?
    3. Based on the description in Section 3.1 of the Dataset, it appears that the SVL dataset already includes all necessary videos and corresponding text for each frame. How were the phase-level texts and video-level abstracts obtained? Were they part of the original dataset, or derived through some summary extraction method from the texts of each frame?
    4. The results from the ablation study suggest that the approaches using hecvl and single methods are not significantly effective. To better demonstrate the benefits of multiple embedding spaces, it would be useful to provide additional results for actions and phases under a single embedding space to differentiate whether the improvements are due to the addition of abstract information or the embedding spaces.
    5. The authors should provide a more detailed explanation in Tables 1 and 2 about the computation of these results? It is assumed that the zero-shot learning approach involves the input of a new surgical video and automatically classifying each frame into stages like Preparation, Calot Triangle Dissection, etc. Does this enable a detailed understanding of the surgical procedures in the video, such as specific operations? If not, this could be an avenue to extend the research work, given the extensive use of textual information.
    6. It seems that SurgVLP also has capabilities for triplet recognition and text-based video retrieval. Can the model perform these tasks as well?
    7. You should have relevant experiments in the ablation study to strengthen the justification of the fine-to-coarse contrastive learning strategy and the superiority of using the three stages of training directly.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The author should provide the detailed code and the dataset used in order to facilitate better reproducibility. In addition, the author should describe the training time.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The method may be used for different tasks as described in the introduction. It would be beneficial to include examples of these tasks, such as triplet recognition and text-based video retrieval as highlighted in SurgVLP. Demonstrating these tasks could significantly enhance the perceived efficacy of the learning method. 2.It would strengthen the draft if an additional experiment in the ablation study could be conducted without using the alternating training strategy. This would provide more convincing evidence of the strategy’s importance. 3.Including results for action+phase under a single model in the ablation study would better illustrate the advantages of learning across the three layers. 4.Adding detailed information about how video-text pairs for phases and videos were obtained for the datasets would more convincingly demonstrate the authors’ contributions in the field.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, the draft has some impact to the field. It demonstrates superior performance in phase recognition compared to previous efforts such as SurgVLP and makes some contribution to the dataset. There is a lack of details on the potential application of the model, which does not align with the broad applicability of multiple tasks as the authors suggested. There is some novelty, however further experiments to showcase the innovative aspects of the approach will strengthen the paper significantly.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

We thank the reviewers for the early acceptance of our work, their feedback, and insightful comments to further improve our work. In the following, we address the main comments of the reviewers.

Phase-level texts and video-level abstract texts – R#1&R#4

In this work, we use the same pretraining videos from SVL dataset. In addition to the narration text from SVL, we curate the coarser-level phase and abstract texts using the metadata of the videos obtained from the e-learning platform. It should be noted that not all the videos include narration, phase and abstract texts at the same time. Out of 1326 videos from SVL dataset, we obtain 1007 videos with narration and phase texts, and 920 videos with narration, phase, and abstract texts. Our contribution is to curate these hierarchical texts and effectively use them for surgical video-language pretraining.

Hierarchical downstream tasks – R#1

Thank you for pointing out, the hierarchical video-language pretraining can improve downstream tasks of different hierarchical levels. For example, HecVL can significantly improve text-based video retrieval when the textual query for the longer videos, e.g., retrieving the whole video based on the abstract texts. Also, HecVL shows slightly better results on the fine-grained level tasks, e.g., action triplet recognition, because the clip-level pretraining is the same as SurgVLP. We report the surgical phase recognition because it is a phase-level task that HecVL can largely boost from the hierarchical pretraining.

TSNE – R#3

We have conducted the tsne visualization but cannot include it due to the page limit. We observe that the modality gap is narrowed during the hierarchical pretraining. Also, the gap is closer than the prior works. Modality gap, as a geometric phenomenon of the multi-modal models, hampers the transferability to the cross-modal tasks, e.g., image captioning. This visualization shows that HecVL can enable a better performance on vision-and-language downstream tasks.

Scene understanding from pretraining – R#4

Tab.1 and 2. show the zero-shot surgical phase recognition results without any fine-tuning on the downstream datasets. We achieve this by decomposing the phase labels into basic concepts and constructing textual prompts, as shown in the supplementary. It shows that the model understands the specific surgical operation more than high-level abstract phase definitions. It should be noted that the zero-shot performance of HecVL shows a degree of surgical scene understanding but there is a large performance gap compared to the fully supervised methods. Therefore, two avenues can be considered in the future, improving the pretraining process by incorporating external surgical knowledge and a data-efficient approach to adapt the pretrained model to specific surgical scene understanding tasks.




Meta-Review

Meta-review not available, early accepted paper.



back to top