Abstract

Surgical triplets recognition aims to identify instruments, verbs, and targets in a single video frame, while establishing associations among these components. Since this task has severe imbalanced class distribution, precisely identifying tail classes becomes a critical challenge. To cope with this issue, existing methods leverage knowledge distillation to facilitate tail triplet recognition. However, these methods overlook the low inter-triplet feature variance, diminishing the model’s confidence in identifying classes. As a technique for learning discriminative features across instances, contrastive learning (CL) shows great potential in identifying triplets. Under this imbalanced class distribution, directly applying CL presents two problems: 1) multiple activities in one image make instance feature learning to interference from other classes, and 2) limited training samples of tail classes may lead to inadequate semantic capturing. In this paper, we propose a tail-enhanced representation learning (TERL) method to address these problems. TERL employs a disentangle module to acquire instance-level features in a single image. Obtaining these disentangled instances, those from tail classes are selected to conduct CL, which captures discriminative features by enabling a global memory bank. During CL, we further conduct semantic enhancement to each tail class. This generates component class prototypes based on the global bank, thus providing additional component information to tail classes. We evaluate the performance of TERL on the 5-fold cross-validation split of the CholecT45 dataset. The experimental results consistently demonstrate the superiority of TERL over state-of-the-art methods.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0026_paper.pdf

SharedIt Link: https://rdcu.be/dV59i

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72120-5_64

Supplementary Material: N/A

Link to the Code Repository

https://github.com/CIAM-Group/ComputerVision_Codes/tree/main/TERL

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Gui_TailEnhanced_MICCAI2024,
        author = { Gui, Shuangchun and Wang, Zhenkun},
        title = { { Tail-Enhanced Representation Learning for Surgical Triplet Recognition } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15011},
        month = {October},
        page = {689 -- 699}
}

Reviews

Review #1

Please describe the contribution of the paper
1. The authors proposed a transformer-based method to perform triplet recognition.
2. The method exploits the low inter-triplet class variance, applies contrastive learning between triplet classes, and semantic enhancement with prototype-based loss.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper is relatively easy to follow.
2. The idea to focus on improving the tail class recognition is a straightforward attempt to enhance the overall triplet recognition performance as most model have poor performance on tail classes.
3. The use of contrastive learning between tail and other triplet classes is an interesting approach that can benefit recognition of tail triplet classes and possibly other triplet classes.
4. The method shows improved performance over existing others in the literature.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. How are these 3 tail triplet classes selected? As the dataset is public, there must be other triplet classes that are lesser in number of instances compared to the selected tail classes.
2. In section 2.3, the authors mention the use of CL between positive pairs (embeddings from the same tail triplet class) and negatives from different classes. It is not clear whether these negative classes are from the tail triplet pool or other remaining triplets minus the tail triplets.
3. In section 2.4, the authors proposed PBSE to maximize similarity between tail embeddings (that are triplets) and corresponding component class prototypes (that are I, V or T). The rationale behind this is not clear as the embeddings are only from 3 tail classes.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
1. Referring to point 1 in the main weaknesses, the authors should provide reasoning for the choice of tail classes which is vital to the whole premise of the paper. Few additional inputs to emphasize the selection of tail classes.
2. a. Comparison of instance counts across all triplets can highlight the actual collection of tail triplet classes.
3. b. Can a random selection of tail triplet classes still benefit the triplet recognition performance?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Reject — could be rejected, dependent on rebuttal (3)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper attempts to improve triplet recognition by focusing on tail classes which is novel.

The results seem to show the importance of PBSE and ILCL modules.

The authors should clarify more on the points discussed, especially the choice of triplet classes as tail. Therefore, I recommend weak reject.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Weak Accept — could be accepted, dependent on rebuttal (4)
[Post rebuttal] Please justify your decision

The authors provided an apt rebuttal for the questions raised. The clarification on the head and tail class selection has simplified the proposed framework and it is logical to incorporate contrastive learning on triplet classes to capture semantics beneficial for triplet recognition. However, calling other triplets with fewer than 10000 instances as “tail” is not entirely reasonable as there might be some triplets for example with only ~4000 instances which might be enough in some contexts.

Review #2

Please describe the contribution of the paper

The proposed method aims to improve the accuracy of recognizing tail instances in Surgical Triplet Recognition, where the quantities for each task (instrument, verb, target, IVT) are limited, making recognition challenging. To address this, the method utilizes instance-level contrastive learning and prototype-based semantic enhancement to enhance representation, thereby improving performance.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

The paper effectively improves performance in triplet recognition by addressing the challenging tail tasks using contrastive learning and instance-level representation enhancement methods. Defining prototypes based on a memory bank for the tail class in the model and comparing them with the output is an innovative process that reinforces representation for the tail class. how it expands beyond simply learning about triplets to dividing into closely related sub-groups and extending into a multi-task concept, achieving mutual complementarity. Additionally, the proposed methods are thoroughly evaluated through ablation studies. Overall, the paper is well-structured, logically developed, and the written sentences are easy to read.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

It appears that there are missing explanations regarding the use of a shared backbone network in Figure 2 - (a), as well as details about the inputs x and x’. I’m curious whether experiments were not conducted with different values when changes in alpha and beta were progressing in parallel in the ablation study. Additionally, the rationale behind selecting the tail classes 17, 19, and 60 is not sufficiently explained.

In the paper, a method for enhancing the performance of tail classes was proposed; however, there is no performance comparison specifically for the tail class. Since IVT involves motion, temporal information is also an important factor. Are there reasons why only 2D methods were used without utilizing this information?
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

If more detailed information about the multi-head classifier, including layer configurations, is provided, it should be relatively straightforward to reproduce.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

It seems like the experiments in Figure 4 with varying alpha and beta values were conducted only with alpha and beta set to the same values. I wonder if there were experiments conducted with different values for alpha and beta.

Regarding the minor typos:

In section 3.1, there appears to be an error in the sentence discussing the data split of 40-5-5. In Table 1, the phrase “in the he ~” should likely be corrected to “in the ~”.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall, the content is well-written, detailing methods used to overcome tail classes and achieve performance superior to SOTA approaches.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Weak Accept — could be accepted, dependent on rebuttal (4)
[Post rebuttal] Please justify your decision

The authors provided appropriate rebuttals to the questions. However, tail class classification according to the number of instances seems to lack evidence.

Review #3

Please describe the contribution of the paper

The authors propose a solution for improved learning of underrepresented classes during surgical action triplet recognition. The method consists of a “disentangle module” to learn instance level features and contrastive learning on these features for a better representation of rare classes.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

Validation: The paper is clearly structured and provides detailed validation of their method, such as sensitivity analysis and ablations.

Methodology: The components of the method seem well tailored to the problem at hand and well reasoned.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

Writing: There are some spelling and grammar mistakes.

Validation: An additional dataset would support the argument that the proposed method is not only strong on CholecT45, but generally on tail distributed datasets of surgical action triplets. The availability of these dataset is however strongly limited.

Dataset: The authors do not describe which classes get selected as tail classes beyond naming their index. Why these three classes were selected how this selection (why not 10 classes etc.) influences the performance should be definitely be elaborated on. Are these tail classes still present in all splits of all folds?An additional ablation on the number of classes used as tail classes would be very insightful.

Methodology: The demonstrated performance increase over previous sota seems barely significant, on APIVT in Table 1, there is only an improvement on the ensembled version and it is just 0.5 at a standard deviation of 2.4, compared to MT4MTL-KD.

Reproducibility: It would be beneficial for reproducibility if the source code was provided.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Do you have any additional comments regarding the paper’s reproducibility?

see above, please provide the source code if possible.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

See above, especially the section on weaknesses. Please clarify the selection of the tail classes and if possible provided additional information on how this selection influences the performance of the overall method. An evaluation on the performance of tail classes vs. other classes on this method vs. sota and ablated variants would also be very insightful.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The method seems well constructed and reasoned and brings a slight performance improvement. The validation could be improved to provide more specific analysis on the tail classes.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

Weak Accept — could be accepted, dependent on rebuttal (4)
[Post rebuttal] Please justify your decision

The authors have addressed most comments in their rebuttal adequately. Some aspects remain unclear unfortunately, especially the influence of which triplet classes get sorted into head and tail. The authors here should have tested their criteria. Still, my recommendation remains weak accept.

Author Feedback

We thank the reviewers for their valuable feedback. Overall, the reviewers consider this paper to be innovative (R1), interesting (R3), and well-structured (R1&R4), having effectively improved performance (R1&R3) and a thorough evaluation (R1&R4). We have carefully considered the raised concerns and will clarify them as follows:

[Q1-R1&R3&R4] Tail class selection. We apologize for the typo in referring to triplet class indexes 17, 19, and 60 as tail classes. We intended to express that 17, 19, and 60 are head classes, whereas the remaining classes are considered as the tail. The head class is selected if the category has more than 10,000 samples.

[Q2-R1] The input details of ILCL. As shown in Fig. 1(a), x and x’ are the input frame and its augmented version, respectively. In ILCL, they are passed through two networks, of which each consisting of a feature encoder and the proposed MHC.

[Q3-R1] Performance on tail classes. In Fig. 1, we visualized the tail class confidence scores between the SOTA and our methods. This demonstrates that our model is more confident in recognizing the tail class <bipolar, coagulate, liver>.

[Q4-R1] The impact of loss balance coefficients. On the left of Fig. 4, the blue line represents experiments conducted with a ∈ {0, 0.25, 0.5, 0.75, 1, 1.25, 1.5} and β = 1, while the orange line represents experiments with α = 1 and β ∈ {0, 0.25, 0.5, 0.75, 1, 1.25, 1.5}.

[Q5-R1] The utilization of temporal information. At the beginning of Section 2, we mentioned that temporal modeling is performed after TERL. Compared to 3D networks, this strategy is more efficient in long-term dependency modeling [1][2].

[Q6-R1] The layer configuration of MHC. In Fig. 2(b) and Section 2.2, we mentioned that each classifier within MHC is a 1 × 1 convolution layer. As for reproducibility, we will release the code once the paper is accepted.

[Q7-R3] The goal of ILCL. ILCL aims to minimize the similarities of negative pairs (i.e., embeddings from different tail triplet classes) while maximizing similarities of positive pairs (i.e., embeddings from the same tail triplet class).

[Q8-R3] The rationale behind PBSE. Sorry for the mistake made (see Q1). We have 97 tail classes rather than 3. PBSE embeddings are derived from these 97 tail classes. Some of these tail classes share the same triplet component (i.e., instrument, verb, or target). The component semantics can be learned from various tails. For example, for <grasper, dissect, cystic plate> enhancement, the semantics of cystic plate can be further learned from <scissors, cut, cystic plate> and <bipolar, coagulate, cystic plate>.

[Q9-R4] Evaluation on an additional dataset. Our approach does not leverage surgical priors or class information from CholecT45. Consequently, we believe it can be applied to general tail-distributed datasets of surgical action triplets. We will consider this in future work, once a new public dataset is available.

[Q10-R4] Comparison with the SOTA method. There seems to be a misunderstanding about our method’s improvement over the SOTA method. According to Table 1, our results actually show a 1.5% improvement, rather than the 0.5% mentioned by the reviewer. Moreover, as shown in Fig. 1, our method achieves higher confidence scores in recognizing tail classes, demonstrating its superiority in tail enhancement.

[1] Czempiel, T., et al.: Tecno: Surgical phase recognition with multi-stage temporal convolutional networks. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 343–352. Springer (2020)

[2] Yi, F., et al.: Not end-to-end: Explore multi-stage architecture for online surgical phase recognition. In: Proceedings of the Asian Conference on Computer Vision. pp. 2613–2628 (2022)

Meta-Review

Meta-review #1

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A
What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

N/A

back to top

Tail-Enhanced Representation Learning for Surgical Triplet Recognition

Author(s):