Abstract

Temporal Action Segmentation (TAS) of a surgical video is an important first step for a variety of video analysis tasks such as skills assessment, surgical assistance and robotic surgeries. Limited data availability due to costly acquisition and annotation makes data augmentation imperative in such a scenario. However, extending directly from an image-augmentation strategy, most video augmentation techniques disturb the optical flow information in the process of generating an augmented sample. This creates difficulty in training. In this paper, we propose a simple-yet-efficient, flow-consistent, video-specific data augmentation technique suitable for TAS in scarce data conditions. This is the first augmentation for data-scarce TAS in surgical scenarios. We observe that TAS errors commonly occur at the action boundaries due to their scarcity in the datasets. Hence, we propose a novel strategy that generates pseudo-action boundaries without affecting optical flow elsewhere. Further, we also propose a sample-hardness-inspired curriculum where we train the model on easy samples first with only a single label observed in the temporal window. Additionally, we contribute the first-ever non-robotic Neuro-endoscopic Trainee Simulator (NETS) dataset for the task of TAS. We validate our approach on the proposed NETS, along with publicly available JIGSAWS and Cholec T-50 datasets. Compared to without the use of any data augmentation, we report an average improvement of 7.89%, 5.53%, 2.80%, respectively, on the 3 datasets in terms of edit score using our technique. The reported numbers are improvements averaged over 9 state-of-the-art (SOTA) action segmentation models using two different temporal feature extractors (I3D and VideoMAE). On average, the proposed technique outperforms the best-performing SOTA data augmentation technique by 3.94%, thus enabling us to set up a new SOTA for action segmentation in each of these datasets. https://aineurosurgery.github.io/VideoCutMix

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3401_paper.pdf

SharedIt Link: https://rdcu.be/dV5z0

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72089-5_68

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3401_supp.pdf

Link to the Code Repository

https://github.com/AINeurosurgery/VideoCutMix

Link to the Dataset(s)

https://aineurosurgery.github.io/VideoCutMix

BibTex

@InProceedings{Dha_VideoCutMix_MICCAI2024,
        author = { Dhanakshirur, Rohan Raju and Tyagi, Mrinal and Baby, Britty and Suri, Ashish and Kalra, Prem and Arora, Chetan},
        title = { { VideoCutMix: Temporal Segmentation of Surgical Videos in Scarce Data Scenarios } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {725 -- 735}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper introduces “VideoCutMix”, a video-specific data augmentation technique for Temporal Action Segmentation (TAS) tailored for scarce data scenarios. Unlike previous augmentation techniques, VideoCutMix focuses on generating pseudo-action boundaries where TAS errors often occur. They do so without disturbing the video optical flow and thus preserves the training distribution. The paper also proposed a novel dataset NETS for the TAS task and a training strategy divided into 4 steps where the model is iteratively trained on data of increasing difficulty. The authors have tested their method on 3 datasets and 9 segmentations models and observed the superiority of their approach in most cases.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1 - The paper is well structured and easy to understand, the figures are well presented and help the reader better digest the information presented in the paper (specifically Figure 3). 2 - The paper has extensive experimental results with 9 different architecture and 3 datasets as well as 2 different evaluation metrics in the main paper and more in the supplementary materials. 3 - The paper introduce a novel dataset for the task of TAS which could improve future research in this area. 4 - The author observed that in data scarce scenario, most TAS errors comes from the boundary between actions and leveraged this observation to design an augmentation that focuses on increasing the number of action boundaries. 5 - The method section is clear and the algorithms are well explained enough for any reader from the field to reproduce the method.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1 - TAS in data constrained setting: while the authors mention that 176 is a small number, can they provide perspective on what is the usual number of video (and duration per video) for the task of TAS? For example how big is the breakfast dataset in comparison? 2 - Results Section: The results section is relatively short (less than 2 pages by removing the conclusion section) and could be perhaps extended by removing text from the NETs dataset presentation and adding the details of this dataset in the supplementary instead. 3 - Main Results: It is unclear under which conditions the baselines are calculated. While the abstract mention the results improvement are obtained by comparing the use of VideoCutMix against No Augmentation, does the baselines in Table 1 corresponds to no augmentation? Does the authors use weak augmentations (static warping, rotation, …) for their proposed method in Table 1? If so it would be unfair to not apply these weak augmentations for the baselines as well. 4 - Curriculum learning: Is it possible to apply CL to other augmentation technique? If so has it been done when comparing the results in Table 2? Since the CL involves several training steps, was the model trained for more epochs/iterations when using the proposed augmentation VS baselines in Table 1? Likewise when comparing against SoTA augmentations in Table 2, did the training account for a fair number of iterations between using CL and not using CL? 5 - Ablation: What is the baseline row in Table S4? Do they corresponds to the case with no augmentation? If so the numbers do not match the ones in Table S2 when they should If I understood correctly. For example in Table S4, the Edit score of mGRU is 68.36 for baseline and in Table S2, this Edit score for baseline with mGRU is 72.95.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    No.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    1 - Comparison against SoTA augmentations: In Table 2, is it possible to include the No Augmentation baseline so we can see how each augmentation method is doing against this baseline?

    2 - SoTA augmentation against No Augmentation: It seems most of the SoTA augmentation techinique degrade the model performance and in most scenario No Augmentation just performs better (Can be seen in Figure 4 or by comparing the values between Table 2 and Table S2). The author mentionned previous augmentations disturb the training distribution and this point could explain the better performance of not using any augmentation in scarce scenario. In this case is there any previous augmentation technique that is relevant to the task of TAS or is VideoCutMix the only augmentation that works so far for scarce TAS? Did previous work perform TAS on C-50 and Jigsaw datasets? If so what aug did they use? If none of previous augmentation works for scarce TAS, I think it is important to highlight this point by writing it down explicitly in the result section. The authors can then mention VideoCutMix that solves this problem.

    3 - Performance in no scarce scenario: while the paper demonstrated the effectiveness of the VideoCutMix in scarce scenario, have the authors tried to see how this augmentation is doing against other augmentation in no scarce scenario? For instance the author test on breakfast (50%) compared VideoCutMix against no aug, but how does VideoCutMix perform against SoTA aug? If it doesnt work, it would still be interesting for the author to mention this point as a limitation.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    While the method proposed by the authors effectively improved TAS results in a scarce scenario, along with previously outlined weaknesses, my main concern lies with the ablation study in Table S4 and the unclear training setting around the main results presentation. 1 - For the training settings, it is unclear if the comparison between the proposed method and the baselines is fair since the authors didn’t write the specific about which aspects of the proposed methods and the baselines are compared (weakness 3 and 4). 2 - For the ablation study, I would like to know what is the correct value for the baselines in Table S4. In the case of mGRU, The authors reported a baseline Edit score of 68.36, then using VideoCutmix, this edit score increase to 71.98 and finally using Curriculum Learning they achieve the final score of 74.29. However in Table S2, the same baseline Edit score is 72.95 which involves that VideoCutMix degrades the performance first but then Curriculum Learning re improve

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    The reviewers have addressed to some degree the open points.



Review #2

  • Please describe the contribution of the paper

    The authors present several contributions in the context of temporal action segmentation (TAS). First, they introduce a new dataset for non-robotic neurosurgery, the neuro-endoscopic trainee simulator (NEST) dataset, which is the first public dataset for this domain. Second, they propose a new video augmentation technique specific for TAS tasks (VideoCutMix). Finally, a new training approach for applying the proposed data augmentation technique for TAS is introduced. Comprehensive experiments on several surgical datasets suggest that their proposed augmentation yields better improvements than previous state-of-the-art (SOA) augmentations on multiple SOA models for TAS.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The new dataset is the first large dataset of its kind and appears to be well-curated.

    • With a new video augmentation method, the authors address a relevant challenge, recently identified to be a “white spot” in the field of biomedical image analysis (https://openaccess.thecvf.com/content/CVPR2023/html/Eisenmann_Why_Is_the_Winner_the_Best_CVPR_2023_paper.html).

    • Their augmentation method for TAS overcomes the issue of translating augmentation techniques from image to video analysis. While (many) existing SOA augmentations originate from the image domain and might have limited utility in video analysis, the proposed augmentation is used specifically for TAS and/or other video analysis tasks.

    • The proposed augmentation method together with the applied training curriculum substantially improves the performance of TAS models without data augmentation on multiple SOA TAS models on two existing surgical datasets and the newly introduced surgical dataset. The proposed method also appears to outperform previous SOA augmentations on a surgical dataset.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. VideoCutMix data augmentation technique:
      • The authors claim that “most video augmentation techniques disturb the optical flow” but provide no quantitative evidence for this.
      • They state that replacing the first or last frames with frames from other videos ensures that the optical flow is consistent inside the sample. However, they do not discuss the fact that the optical flow at action transitions is not preserved.

    Curriculum learning mechanism: The proposed training strategy appears to be quite specific to the proposed augmentation as it creates much more Augmented Multi-label Feature Representations (AMFRs) than Augmented Unilabel Feature Representations (AUFRs). Did the authors apply the same training curriculum for the other SOA augmentations? If so, this could be a reason why the proposed augmentation outperforms the other augmentations. If not, details of the training of the TAS models with existing SOA augmentations and a justification for comparing these trained models with the fine-tuned models from the proposed learning mechanism are missing.

    NETS dataset: One of the main weaknesses of the manuscript is the lack of information on the dataset, such as camera or lighting details, the average amount of scenes per video, labeling instructions or inter-rater variability between annotators. The lack of information makes it hard to judge the quality and diversity of the dataset. For example, if the same box-based trainer and same camera/lighting details were used, the diversity of the video content would be limited.

    Validation:

    • Confidence intervals and/or variability in the result tables are currently missing (see: https://www.nature.com/articles/s41746-022-00592-y).
    • Metrics are not justified and potentially not suitable. While Segmental Edit Score is a good metric to check the order of predicted actions, it comes with some flaws (e.g., small mistakes may be heavily penalized). -The effect of the dataset size is confirmed by a non-surgical dataset, which may show the generalizability of results but is counterintuitive as the method was introduced for a surgical application. Why did the authors not confirm the hypothesis with a surgical dataset?
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The authors claim improved performance by using their proposed video augmentation in line with a new training curriculum for scarce surgical datasets. In case the authors do not publish their code, details related to training are missing, such as the amount of training steps/epochs for the finetuning phases. In addition, details related to the dataset are currently missing which makes it difficult to understand how the dataset was built.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Based on the weaknesses described above, the following suggestions could help improving the manuscript:

    • More details of the NETS dataset should be added to assess the quality and diversity of the dataset, such as camera details, lighting details, the average amount of scenes in a video, labeling instructions, and inter-rater variability between annotators. In addition, the used SOA segmentation models are not described in the text. A short description or statement, which methods were chosen and why, should be added.

    • In addition to the reported metrics, the authors may consider adding additional metrics for temporal assessment and to avoid pitfalls, e.g. the ones presented in https://dumas.ccsd.cnrs.fr/TIMC-IMAG-GMCAO/hal-01299344.

    • Instead of validating the proposed augmentation on subsets of the breakfast dataset, the augmentation should be validated on subsets of (at least) one of the surgical datasets (NETS, JIGSAWS, Cholec T-50). While the used surgical datasets are already smaller than the breakfast dataset, such an experiment would strongly support the primary claim of the paper that the proposed augmentation is beneficial for TAS on surgical videos in scarce data scenarios.

    • Details regarding the test/external validation sets should be added, especially the amount of videos/temporal windows, subjects and class balance, to further raise the credibility of the results.

    • The temporal window is rather small (9 frames). I would suggest increasing the window size and comparing the effect of changing the window size.

    • Confidence intervals should be added

    • Justifications for the reported metrics should also be added.

    • Related work and limitations of the paper should be added, including a short description of the SOA models used for TAS.

    • Do the authors have an explanation of why the general results are much lower for the C-T50 dataset?

    Feedback for improving the clarity and organization of the paper:

    • The structure of the introduction is a bit confusing to me: An introduction typically starts with a broad introduction to the application, introduces challenges, and describes the contributions based on them. In this paper, the authors directly start with a quite technical description.

    • The author’s contributions should be described more concisely while the problems, background, and challenges should be explained in more detail. For example, while the abstract mentions why new pseudo-action boundaries are needed, the issue is not explained in the introduction.

    • A discussion on why other augmentation methods were not explored (e.g. PixMix, VideoReverse,…) would be helpful.

    The following minor details should also be addressed to improve the clarity of the manuscript:

    • The figures are sometimes hard to read due to small font sizes and the fonts are often stretched.

    Figure 1:

    • The figure is currently never referenced in the text (every figure and table should be referenced at least once).
    • “(d) showcases the performance against various state-of-the-art (SOTA) augmentation techniques […]”
    • Which performance metric is shown? What examples are shown and for which datasets?

    Figure 2:

    • A legend is missing for (b) explaining the blue and red colors.
    • There is a typo in the caption: “shows a sample frames for each of the classes, pick, move, release, and background”

    Figure 4:

    • The caption should be expanded to include more details, e.g., what are the four examples standing for and which actions are represented by the colors?
    • Spelling errors should be fixed, such as “The results of the various SOTA […] is shown in Table 1” (it should be p7)

    • Table 2 is currently referenced as Table 3.

    • In the abstract, the authors claim that “[t]he proposed technique outperforms the best-performing SOTA data augmentation technique by 3.94%”, but in the results section, “the proposed architecture outperforms the current SOTA augmentation technique by 1.8%”. The numbers do not match. If one of the numbers is stated incorrectly, this should be fixed.

    • Additionally, I am unsure how the number 1.8% is inferred from the presented table 2/3 and what is defined as the “current SOTA augmentation technique”. This is not clear from table 2/3 and should be clarified.

    • The authors state: “We also modify the target probability vector for computing the cross-entropy loss with the predicted vector.” Please clarify the benefit of this change?

    • The authors mention “significant performance” several times although no statistical tests were performed.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors address the relevant topic of surgical video data augmentation, which has - so far - received almost no attention in the literature. According to the experiments, the proposed method might become a valuable tool in the surgical data science community.

    I put “weak accept” rather than “accept” because the missing details make it hard to thoroughly and reliably judge the paper.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    Although not all my concerns with respect to the validation have been resolved, I decided to choose accept - one reason being the release of a new unique data set, which is a strong contribution by itself.



Review #3

  • Please describe the contribution of the paper

    The proposed framework introduces augmentation and curriculum learning techniques to enhance the performance of Temporal Action Segmentation (TAS) and demonstrates improved results. Additionally, a new neuro-endoscopic trainee simulator dataset is proposed.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The proposed approach validates its effectiveness by comparing before and after applying augmentation and hardness-specific curriculum (transitioning from single action label to multi-label and using augmented samples) across various datasets and models. This comparison demonstrated the superiority of the proposed method. A new neuro-endoscopic trainee simulator dataset is proposed.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    It seems that the contribution of the proposed dataset (NETS) is relatively weak. Similar datasets like JIGSAWS and PETRAW already exist, so what distinguishes NETS from these datasets?

    Regarding setting a temporal window of delta units before/after a specific time t, why was the temporal window not used to include frames from 2delta units before t?

    In Table 1, it appears that some of the bold-highlighted results indicating improvement are incorrectly applied.

    I wonders that there are no results in Table S2 showing the utilization of 100% of the dataset size in the Effect of Dataset Size section.

    I’m curious if the comparison before and after applying curriculum learning used the same amount of training iterations for pair comparisons.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    It seems beneficial to provide an explanation of Temporal Action Segmentation (TAS) at the beginning of the paper along with the theoretical basis and reasons for introducing augmentation, which is the core contribution of the proposed method. Suggesting additional data sets, while commendable, may not be critical to the overall flow and consistency of the content.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Regarding the paper title and contributions, the experiments performed are well done, but there are some unnecessary parts that seem less relevant, such as the proposed dataset. Also, the explanatory power of the introduction seems to be somewhat lacking.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Although the rebuttals did not resolve all questions, the decision was made because of the superiority of the new dataset proposal and methodology.




Author Feedback

We thank reviewers for insightful comments. Reviewers appreciated our efforts in writing the paper (R1, R3, R4), figures (R1), developing the dataset (R1, R3), VideoCutMix architecture (R1, R3, R4), & exhaustive experimentation (R1). We will address all issues raised in the final version. Here, we answer a few important questions.

(R1) Datasize: Breakfast vs JIGSAWS: Breakfast has 1712 videos (77 hours), JIGSAWS has 58 min

(R1,R4) Baseline results with weak augmentation? …parameters? All baseline results use weak augmentation with same parameters for both baseline & our models.

(R1,R3) CL on other SOTA augmentation? Other SOTA augmentations do not generate AUFR & AMFR, so we can’t apply our CL to them. Table S4 shows most improvement comes from VideoCutMix.

(R1,R4) More training for Proposed model with CL? No. All surgical models (Baseline, SOTA aug: No CL, & our model: with CL) were trained for 25 epochs each (Table. S1). The proposed model was trained for 5 epochs on UFR & MFR each, then 8 epochs on AMFR & 7 on AUFR.

(R1) mGRU I3D baseline: 68.36 or 72.95? Sincere apologies for the typo. The edit score is 68.36 and accuracy is 72.95. We shall correct it in the table & accordingly update the numbers.

(R1) Is VideoCutMix the only augmentation for scarce TAS? previous TAS-Aug work on C-50 & Jigsaw? Most SOTA video augmentations are proposed for action recognition, and to the best of our knowledge, VideoCutMix is the first augmentation for data-scarce TAS in surgical scenarios. No other works report augmentation numbers on C-50 & Jigsaws. We shall highlight this in the paper.

(R1) Compare SOTA augmentation on Breakfast (50%) dataset. VideoCutMix is slightly better than the best SOTA augmentation. We can’t give results as per MICCAI rebuttal guidelines.

(R3) Quantitative evidence for SOTA aug disturbing flow. On average, augmented videos of Randmix have an 80.9% deviation in optical flow from the base videos of the JIGSAWS dataset, TubeMix: 67.54%, Framemix: 48%, & proposed technique: 6.25%. We shall add this info in the introduction.

(R3) Optical flow at action transitions is not preserved: VideoCutMix doesn’t preserve optical flow at boundaries but creates pseudo boundaries. Please refer to contribution 2, the last para in the introduction section.

(R3, R4) Lack of Information about the dataset: 70 trainee neurosurgeons from 14 hospitals across 3 countries performed tasks over 5 years on 6 box-trainers in working hours with minimal lighting variation. The dataset includes data from 12 cameras (2 per box-trainer). This is the first neurosurgeon-specific, non-robotic dataset. We will add these details.

(R3) Report other metrics: Though the proposed model is much better in terms of precision, recall, & accuracy (eg: ~11% in recall for NETS), we reported Edit & F1 scores due to space constraints, following [1,2,7,9,21,23].

(R3) Effect of datasize on C-50: Surgical TAS datasets are generally small. Though we observed similar trend in improvement of edit score on 25%, 50% & 100% of C-50 data (~12% to ~5%), we did not add this in the table because 25% of C-50 corresponds to only 12 videos, raising concerns about statistical significance.

(R3,R4) Choice of the temporal window. Selected through ablation study. Next best size = 7

(R3) Why are numbers on C-50 low? It is a cholecystectomy surgery dataset recorded at 1 FPS with only around 180 frames per class, per video. High data variance & low FPS result in low accuracy.

(R3) Overall improvement: 3.84% or 1.8%? 3.84% is the average improvement across all datasets & architectures. 1.8% is the least improvement in the Edit score on the JIGSAWS-KNT dataset with I3D. We will rephrase this in the manuscript to avoid confusion.

(R3) Why modify the target probability vector? Post-augmentation, frames for TFR come from different classes, so we modify the target prob vector accordingly.

We apologise for typos in text & numbers, & shall correct it in the final version.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



back to top