Abstract

Few-shot video object segmentation aims to reduce annotation costs; however, existing methods still require abundant dense frame annotations for training, which are scarce in the medical domain. We investigate an extremely low-data regime that utilizes annotations from only a few video frames and leverages existing labeled images to minimize costly video annotations. Specifically, we propose a two-phase framework. First, we learn a few-shot segmentation model using labeled images. Subsequently, to improve performance without full supervision, we introduce a spatiotemporal consistency relearning approach on medical videos that enforces consistency between consecutive frames. Constraints are also enforced between the image model and relearning model at both feature and prediction levels. Experiments demonstrate the superiority of our approach over state-of-the-art few-shot segmentation methods. Our model bridges the gap between abundant annotated medical images and scarce, sparsely labeled medical videos to achieve strong video segmentation performance in this low data regime. Code is available at https://github.com/MedAITech/RAB.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3024_paper.pdf

SharedIt Link: https://rdcu.be/dY6fU

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72390-2_26

Supplementary Material: N/A

Link to the Code Repository

https://github.com/MedAITech/RAB

Link to the Dataset(s)

https://github.com/haifangong/TRFE-Net-for-thyroid-nodule-segmentation https://www.kaggle.com/datasets/aysendegerli/hmcqu-dataset https://polyp.grand-challenge.org/AsuMayo/ https://www.kaggle.com/datasets/aryashah2k/breast-ultrasound-images-dataset https://github.com/cv516Buaa/MMOTU_DS2Net

BibTex

@InProceedings{Zhe_Reducing_MICCAI2024,
        author = { Zheng, Zixuan and Shi, Yilei and Li, Chunlei and Hu, Jingliang and Zhu, Xiao Xiang and Mou, Lichao},
        title = { { Reducing Annotation Burden: Exploiting Image Knowledge for Few-Shot Medical Video Object Segmentation via Spatiotemporal Consistency Relearning } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {272 -- 282}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The contributions of the paper are as follows: (a) The authors leverage labeled images to pre-train the object segmentation network such that its performance can be transferred to learning in videos (which have comparatively low amount of annotations) in a self-supervised learning protocol. (b) The authors also propose “Cross-Resolution Feature Fusion” to fuse the learned features and the generate pseudo masks to enhance its segmentation performance. (c) Spatial-temporal consistency module is also introduced in the final learning phase with videos which enforces consistency on the predicted latent embedding space. (d) The pretrained image-based segmentation model is used as pseudo-teacher which guides the learning of video-based segmentation model along with the attention unit.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The authors have proposed to use the labeled medical images, along with its well know object segmentation masks, to pre-train a network which can be successfully used to segment objects in videos which has little or no annotations. In that regard, the authors have proposed “Cross-Resolution Feature Fusion” to enhance the feature learning capability of the model, along with feature tracking capability by ensuring spatio-temporal consistency across the time domain. The authors have provided experiments in-order to substantiate their hypothesis across a number of different datasets.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    (a) One of the fundamental weakness of the paper is that the authors have not provided any justification or motivation as to why the few-shot learning set up has been used in this paper. (b) The authors have not taken into the domain mis-match between the images and videos which are used to train the networks. Even though natural images have been used, but it doesn’t guarantee that this domain gap will not be present. Moreover, the network pretrained on images seemed to be fixed during the training of the video based object segmentation network. Therefore, errors calculated in Eqn (4) and (5) can accumulate over the training process, which may lead into model collapse. Thus training the both the segmentation models as a teacher-student learning setup or accounting for domain shift between the images and the videos could have been a stronger baseline to compare their results, which is missing from the paper. (c) The feature enchancement “Cross-Resolution Feature Fusion” module draws parallel to self/cross attention modules prevelant in any transformer model. The authors have not shown any experiments with respect to it too.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors need to provide a proper justification as to why few-shot learning framework was used as a training protocol during training with annotated labeled images. This justification will definitely increase the impact of the proposed work. Further more, the authors may look into papers which attempt to reduce the domain gap between the image and video domain, and incorporate them into their work which may improve the generalizability of the work. Comparisons against these methods will also be beneficial to understand the overall impact of the work. It will be really impactful if the authors can provide experimental results on other modalities such as CT and MRI.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    (a) The authors need to provide the basic motivation as to why the “few-shot” pre-training setup was chosen over the standard training protocol? What added benefits does few-shot brings in is not clear and the authors need to provide an ablation experiment to emphasize the same. (b) No accountability of domain gap between the image and video domain is taken into account. Experiments with mean-teacher learning setup will provide a beneficial baseline algorithm to compare the proposed work. (c) No ablation experiments to validate the importance of “Cross-Resolution Feature Fusion” is not present in the paper. The authors also should provided a justification as to why “Cross-Resolution Feature Fusion” was chosen over a standard self/cross attention layer in the rebuttal.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    In this study, the authors propose a two-phase framework. The main task is to learn a few-shot segmentation model using labeled images. Subsequently, to improve performance without full supervision, the paper also introduces a spatiotemporal consistency relearning approach on medical videos that enforces consistency between consecutive frames.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper’s use of few-shot learning for segmentation is innovative and meaningful.
    2. The author conducted experiments on multiple datasets, which lends credibility to the results and demonstrates good performance.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    1.The method is lack of novelty.

    1. The author’s statement that this is their first time training on images and then segmenting videos can be misleading to readers. Videos are inherently composed of images with temporal relationships. The author’s approach of training on images is a supplement for the lack of video data but not the optimal method. Currently, video segmentation training still primarily relies on image data.
  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    1. The author should provide a brief explanation of the comparison methods used. This will help readers better understand the methods.
    2. The results of the comparisons only showcase the author’s method without presenting the comparison methods, making the comparisons less intuitive.
    3. The author’s approach of training on image datasets and validating on video datasets is commendable. However, there are differences between the two, and the author should clarify if the results are from the same dataset. If the author used publicly available datasets, it would be beneficial to briefly discuss whether other methods have been researched on the same dataset.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The paper’s use of few-shot learning for segmentation is innovative and meaningful.
    2. The author conducted experiments on multiple datasets, which lends credibility to the results and demonstrates good performance.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    In this study, the authors propose a two-phase framework. The main task is to learn a few-shot segmentation model using labeled images. However, the method is lack of novelty.



Review #3

  • Please describe the contribution of the paper

    The paper proposes a novel method to reduce the need for expensive video annotations by using a minimal-data scenario that utilizes annotations from only a few video frames and pre-existing labeled images. The authors suggest a two-step process. First, train the model with a few-shot segmentation model using labeled images. Then, to improve performance without comprehensive supervision, implement a spatiotemporal consistency relearning method on medical videos that ensures consistency across successive frames. Constraints are also imposed between the image model and the relearning model at both the feature and prediction stages to enhance performance.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper proposes a few-shot medical video object segmentation using image datasets without requiring full video annotations. This method for video data labeling can greatly reduce the time required for manual labeling.

    2. In this paper, the author proposes a novel method that utilizes spatiotemporal consistency as an additional cue to improve segmentations within a self-supervised framework.

    3. This proposed method has been extensively tested on HMC-QU dataset and the ASU-Mayo Clinic Colonoscopy Video (ASU- Mayo) dataset, and the results show that it outperforms existing state-of-the-art few-shot segmentation methods.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The authors did not compare complexity, training time, flops, throughput, and FPS with the baseline methods.

    2. The author has not presented any theoretical justification for the superior performance of the proposed method over the baseline methods.

    3. To ensure reproducibility of the results, it is highly recommended to share the code in a public repository such as GitHub. Without access to the code, reproducing the results may be difficult.

    4. The writing style and structure of the paper could be enhanced to make it more effective, specifically the experiments section.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not provide sufficient information for reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    To ensure reproducibility of the results, it is highly recommended to share the code in a public repository such as GitHub. Without access to the code, reproducing the results may be difficult.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Please refer to the “main weaknesses of the paper” section.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Please refer to the “main weaknesses of the paper” section for more details.

    1. The authors did not compare complexity, training time, flops, throughput, and FPS with the baseline methods.
    2. No code provided to reproduce the results.
    3. The writing style and structure of the paper could be enhanced.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    Thank you to the authors for the detailed response. I am not completely convinced by the answers provided by the authors, but the method they proposed, and the experimental results reported in the paper are good. I’m keeping the same score and I am recommending weak acceptance.




Author Feedback

We sincerely thank the reviewers for providing constructive feedback.

Code (R1&R3&R6) We promise to make our code publicly available.

Reviewer #1 Q1 Novelty Our work explores the underexplored area of FEW-SHOT medical video object segmentation. The novelty lies in leveraging existing abundant annotated images and few video frames through a few-shot learning paradigm, along with the proposed spatiotemporal consistency relearning method to bridge the gap between image and video domains. Our approach achieves SOTA results on multiple benchmarks.

Q2 Misleading statement We reword it as follows:

We explore few-shot medical video object segmentation using image datasets, without reliance on full video annotations.

Q3 Explanation of comparison methods This will be added in Sec 3.2.

Q4 Comparisons are less intuitive. Due to page limit, we did not show visualization results of competing methods. These will be included in the final version of the paper.

Q5 Experiments on the same dataset? Yes. The image and video datasets have no overlap in data or categories. All models used the same train/test sets.

Reviewer #3 Q1 Why few-shot learning set up? For the target (video) domain, our use case involves few-shot segmentation, where the model segments subsequent frames based solely on annotations for the first frame. Consequently, in the source (image) domain, we have to employ the same few-shot learning strategy.

Q2 Domain mismatch was not considered. With full respect, we have taken this into account. The proposed spatiotemporal consistency relearning addresses this issue by leveraging the temporal continuity prior. In addition, the problem of category mismatch between images and videos is solved by the few-shot learning paradigm.

Q3 Model collapse No model collapse observed in our experiments. However, ablation studies (cf. Tab. 3) show that the loss in Eq. (5) is crucial. Removing it causes model collapse.

Q4 Teacher-student learning setup EMA is commonly used in teacher-student learning for updating model weights, but it requires the two models to have the same architecture. Our approach introduces new modules in the second phase, resulting in a different architecture. Thus, it cannot be applied as a baseline in our case.

Q5 Look into papers attempting to reduce the domain gap. Our literature review found no prior work aligning with our concept. F. Perazzi, et al., CVPR’17 cannot handle scenarios with mismatched training and test data categories. The domain agent network (DAN) requires both image and video data for training, unlike our model.

Q6 Cross-resolution feature fusion vs. self/cross attention This technique is widely adopted and is not our novel contribution. Hence, no ablation study was conducted. We chose it for its simplicity and effectiveness, as attention modules tend to have more parameters with limited gains for few-shot segmentation (ASGNet, CVPR’21).

Q7 Experiments on CT and MRI Thanks for the advice. While our current project focuses on medical video segmentation, we acknowledge that extending our approach to 3D medical images is a promising direction worth exploring in future work.

Reviewer #6 Q1 Complexity Our model achieves ~78 fps, comparable to baselines. We will analyze and discuss the complexity of models.

Q2 Theoretical justification Sorry for missing this part.

The shift from medical images to videos leads to performance degradation when deploying image models on videos. The proposed relearning method enables these models to adapt to test video characteristics, mitigating this domain shift. By fine-tuning a subset of parameters, coupled with newly introduced ones, on a few labeled frames from the test data, the models can better capture the underlying distribution of the test data, improving performance.

Q3 Enhance paper writing style and structure. Thanks for the suggestions. To this end, we will include analysis regarding model complexity and provide additional theoretical justification.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Reject

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviewers’ opinions were divided, with two weak accepts and one weak reject. I believe the novelty is limited since many existing methods already predict the next frame in a video based on the previous frame’s GT. Reviewer 1, who gave a weak accept, also questioned the novelty. The authors need to compare their work with more recent few-shot learning techniques that consider the spatial Spatiotemporal Consistency.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The reviewers’ opinions were divided, with two weak accepts and one weak reject. I believe the novelty is limited since many existing methods already predict the next frame in a video based on the previous frame’s GT. Reviewer 1, who gave a weak accept, also questioned the novelty. The authors need to compare their work with more recent few-shot learning techniques that consider the spatial Spatiotemporal Consistency.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviewers and ACs raise some valid concerns that should be addressed in the final version. However, the merits outweigh the limitations and the paper was considered to make a valuable contribution to the conference.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The reviewers and ACs raise some valid concerns that should be addressed in the final version. However, the merits outweigh the limitations and the paper was considered to make a valuable contribution to the conference.



back to top