Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

In-context learning (ICL) has shown promise for generalizing to new visual tasks using a few examples, but current methods are limited. They typically rely on a rigid gridding strategy that restricts the number and resolution of context images. We propose Temporal, a novel approach that overcomes these limitations by reformulating visual ICL as a video object segmentation (VOS) problem. This VOS-based approach naturally handles a variable number of full-resolution context images. To automatically select the most relevant context for a given query, we introduce a prompt retriever pretrained on videos using a time-contrastive objective. This objective learns from the temporal coherence of video, using adjacent frames as positive examples (i.e., useful context images) and distant frames as negatives. For image segmentation, our retriever builds a pseudo-video by prepending the retrieved context images to the query image, which is then processed by the VOS model. For video segmentation, the retriever identifies keyframes, our ICL pipeline generates their masks, and these masks are propagated through the video. On the MICCAI FLARE 2022 challenge, Temporal significantly outperforms baselines, achieving a Dice score of 90.95% for image segmentation (+10.64%) and 92.45% for video segmentation (+14.88%).

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3228_paper.pdf

SharedIt Link: https://rdcu.be/eHwLW

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04927-8_60

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/aswahd/temporal

Link to the Dataset(s)

https://drive.google.com/drive/u/2/folders/1XPEijJrCzLskw7i49zMXZ-u2545bqi-l

BibTex

@InProceedings{WahAss_TimeContrastive_MICCAI2025,
        author = { Wahd, Assefa AND Jaremko, Jacob AND Hareendranathan, Abhilash},
        title = { { Time-Contrastive Pretraining for In-Context Image and Video Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15960},
        month = {September},
        page = {630 -- 639}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper introduces a self-supervised prompt retriever for visual in-context learning.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper introduces a multipositive time-contrastive pretraining method tailored for prompt retrieval.
2. Experiment results on MICCAI FLARE 2022 demonstrate the effectiveness of the proposed approach.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. The overall novelty is somewhat limited. The idea of constructing a VOS (Video Object Segmentation) sequence using multiple images and inferring the final frame via VOS techniques has already been explored in Medical SAM2.
2. The formulation of video segmentation inference is unclear. If each frame x_i requires constructing a sequence for inference, similar to the approach used for a test image, the computational cost for processing an entire video would be substantial.
3. Section 3.1 claims that Temporal automatically identifies the target organs. However, this is questionable: as a retriever, it is unclear how Temporal performs segmentation directly. Furthermore, since the approach builds on SAM2, which depends on prompt-based inputs, the claim of achieving automatic VOS is confusing and potentially misleading.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper contains several significant issues, as outlined in the major weaknesses, and its overall quality does not meet the standards required for acceptance currently.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #2

Please describe the contribution of the paper

This paper proposed “Temporal”, a novel in-context-learning (ICL) prompting method for medical image/video segmentation. Instead of modeling the image prompts as a grid, the proposed method model the prompting information as prepending frames in video and uses a pre-trained video foundation model to segment every frames. Besides, the model proposed to use a retrieval model trained with temporal relationships in a video rather than using common retriever such as CLIP. The proposed method outperforms existing ICL method and vanilla fundation model by a considerable gap.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The proposed method is very intuitive and convincing. Modeling the medical image segmentation prompts as a video is a natural solution. The new retrieval model that pre-trained based on temporal relationship further enhanced the video ICL capability of the model. The preprended frames serve as the context information and helps improve the final prediction. The experimental results further validated the proposed design, which demonstrates a notable and convincing performance improvement against baselines.
2. The diversity-aware context selection is also very inspiring. Instead of simply maximize the similarity, the proposed method further minimize the intra-context similarity, which forces the cotext images to provide different inforamtion. It is also proven to be effective according to the experiment.
3. The paper is well-written and easy to follow.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. One of the major concern on this paper is the limited evaluation. Although the evaluation on the FLARE 2022 dataset is very promising, it would be better to have some more evaluation on different datasets and different baselines. For example, comparing with a supervised method to illustrate the difference between supervised method and ICL methods with fine-tuning. Still, this is not a big problem considering the limited space.
2. Another small issue the fairness in the evaluation. The final best results reported here uses a fine-tuned VOS model, which is naturally much stronger than other baselines. It is expected to compare this fine-tuned verson with baselines that also uses fine-tuned VOS model.
3. In figure 3, the first 3 columns of the proposed method seems to be a zoom-in view which is different from other baselines. The reviewer wonders why these images are different, or if it is just a mistake.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall, this is a promising paper with clear motivation, enough novelty and acceptable experimental evaluation. The weakness in terms of experiment does not harm the core contribution of the paper. Thus, I would like to recommend weak acceptance to this paper.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Review #3

Please describe the contribution of the paper

The paper introduces Temporal, a novel self-supervised learning framework designed for in-context learning in medical image and video segmentation.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Unlike traditional grid-based ICL methods that limit the number and resolution of context images, Temporal supports a variable number of full-resolution context images, which is crucial for medical imaging where fine details matter.
2. Temporal seamlessly extends from image to video segmentation by treating video segmentation as a keyframe-based propagation task, where masks predicted for keyframes are propagated bidirectionally across the sequence.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. As multi-positive time-contrastive pre-training is claimed as a main contribution of this paper, the comparative experiments on such design (e.g. single-positive vs. multi-positive) should be discussed.
2. Another question related to W1 is the discussion on the effect of temporal window.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
1. While fine-tuning, did you train all the network parameters or only a partial model (e.g. decoder)?
2. How many pseudo videos are generated to fine-tune the VOS model?
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

Overall, this paper presents a clear solution to ICL-based image/video segmentation. I give an initial score of weak accept.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We sincerely thank the reviewers and the Area Chair for their dedicated time and feedback on our manuscript. We have carefully considered all comments and provide our responses below, detailing how we have addressed the key points raised during the review process. We believe these revisions and clarifications further strengthen our paper.

Responses to Reviewer #1

Q1: Novelty and relation to relation to Medical SAM2

We acknowledge Medical SAM2’s application to VOS. However, its novelty centers on a diversity-aware memory bank and, crucially, it relies on human-provided initial prompts. In contrast, our method introduces a fundamental shift: it automatically sources prompts from the training dataset itself, enabling a fully automatic inference. This automated prompt generation, by framing promptable VOS as visual In-Context Learning (ICL) and vice-versa, is a key novel contribution of our work.

Q2: Clarity of video segmentation inference and computational cost

We agree that frame-by-frame image-based inference would be too costly. As detailed in Section 2.3 (and Fig. 1b), our video segmentation is designed to be efficient:

Keyframe Selection: We select a small number (K, typically 10-20) of keyframes based on representation, diversity, and confidence.

Keyframe Inference: Image-based inference (Fig. 1a) is applied only to these K keyframes.

Mask Propagation: Masks from keyframes are propagated to other frames via standard VOS techniques. We will ensure this multi-step process is more clearly articulated in the revised manuscript to address any confusion.

Q3: Regarding automated inference

Temporal achieves automatic segmentation by prepending informative images and their ground truth masks (as prompts) from the training set to a test image. This leverages the training data to automatically generate necessary prompts, allowing the VOS model (SAM2) to operate without manual guidance at inference.

Responses to Reviewer #2

Q1: Regarding evaluation

While a direct comparison with supervised methods would indeed provide a more comprehensive overview of segmentation techniques, the primary focus of our current work is to advance In-Context Learning (ICL) methodologies, particularly for foundation models. This focus on improving existing ICL approaches is a common scope for papers centered on foundation model capabilities. However, we agree that including supervised methods in future comparative analyses would be valuable, and we will consider this for extended evaluations.

Q2: Comparison with fine-tuned ICL methods

Our rationale for the comparison was:

Gridding ICL methods: Their non-fine-tuned performance was exceptionally low (~5% Dice, Sec. 3.1), making a fine-tuned comparison less meaningful.

VOS-based ICL methods: We found no existing works with directly comparable fine-tuning strategies for this task.

Responses to Reviewer #3

Q1: Fine-tuning

Yes, we fine-tuned all parameters. Our primary objective with fine-tuning was to showcase its effectiveness in bridging the domain gap and further enhancing the performance of our ICL-based segmentation approach, rather than focusing on parameter-efficient fine-tuning (PEFT) techniques. We acknowledge that methods like LoRA or other PEFT strategies could potentially offer further improvements or efficiencies, and this remains an interesting avenue for future exploration. We will clarify this in the paper.

Meta-Review

Meta-review #1

Your recommendation

Provisional Accept
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A

back to top

Time-Contrastive Pretraining for In-Context Image and Video Segmentation

Author(s):

Abstract

Links to Paper and Supplementary Materials

Link to the Code Repository

Link to the Dataset(s)

BibTex

Reviews

Review #1

Review #2

Review #3

Author Feedback

Responses to Reviewer #1

Q1: Novelty and relation to relation to Medical SAM2

Q2: Clarity of video segmentation inference and computational cost

Q3: Regarding automated inference

Responses to Reviewer #2

Q1: Regarding evaluation

Q2: Comparison with fine-tuned ICL methods

Responses to Reviewer #3

Q1: Fine-tuning

Meta-Review

Meta-review #1