Abstract

Deep learning models have shown remarkable performance in medical video object segmentation. However, addressing the cross-center domain issue is crucial for achieving consistent performance across different medical facilities. Emerging Source-Free Active Domain Adaptation (SFADA) techniques can enhance the performance of target domain segmentation models, ensuring data privacy and security. While current approaches primarily focus on image-level tasks and mainly emphasize intra-frame pixel correlations, they overlook temporal correlations, which restricts their performance in video frame recommendation. Consequently, this paper proposes the first video-level SFADA method and evaluates it on video polyp segmentation across different data centers. Specifically, the Spatial-Temporal Active Recommendation (STAR) strategy is devised to recommend a few highly valuable frames for annotation by comprehensively evaluating the object spatial correlation and temporal movement density across different video frames, along with a Passive Phase Correction (PPC) module is proposed to suppress the noisy source disruptions of the remaining unlabeled data during the fine-tuning stage. Experimental results demonstrate that with a tiny quantity of annotation, our method significantly improves performance over the lower bound and achieves better performance than existing SOTA methods, which is valuable for practical clinical employment.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/0725_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{LiJia_SourceFree_MICCAI2025,
        author = { Li, Jialu and Wang, Hongqiu and Wang, Weiming and Qin, Jing and Wang, Qiong and Zhu, Lei},
        title = { { Source-Free Active Domain Adaptation for Efficient Medical Video Polyp Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15969},
        month = {September},
        page = {498 -- 508}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper introduced a video-level SFADA method and evaluate its performance on video polyp segmentation task across multiple domains, involving a strategy proposed to identify valuable video frames to annotate and a module presented to suppress negative components. Furthermore, this paper also constructed a comprehensive multi-center video polyp segmentation dataset.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    1) This paper proposed the first SFADA framework for medical video object segmentation. 2) This paper organized the first multi-center video polyp segmentation dataset to conduct research on this topic.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    1) In Eq. (1), the authors say that “G denotes the number of prediction results for each frame generated by the diffusion model.”. However, the “diffusion model” is only mentioned once and lacks a detailed explanation. In Eq. (5), please explain the reason for performing the sigmoid operation for the phase rather than the learnable matrix. In Eq. (6), the formula regards X hat as the input of the IFFT operation, please correct it. 2) In section 2.4, the authors claim that “WΦ can suppress negative phase components and emphasize valuable components related to the target domain”, but this lacks of evidence to support it. 3) The ablation study presented in Table 4 is insufficient, please conduct a more comprehensive analysis on all possible combinations. Furthermore, in section 3.3, the authors say that “M3 is constructed by adding the K-order difference calculation based on M2.”. This indicates the ineffectiveness of the proposed K-order difference calculation due to the marginal performance gain between M3 and M2. 4) In Table 5, all metrics of 10% ratio (excluding Dice) are the same as the ones of 5% ratio. Please verify the results. 5) In Tables 2 and 3, the compared method FSM is only designed for source-free domain adaptation, please explain how to employ it under the SFADA setup. 6) The utilized backbone STM is not designed for video poly segmentation, while there have been many studies in this field. Please consider utilizing a stronger backbone as the baseline.

  • Please rate the clarity and organization of this paper

    Poor

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (2) Reject — should be rejected, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I have to reject this paper due to the errors in formulas, experiment results, and textual expression. For more details, please refer to the major weaknesses.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Reject

  • [Post rebuttal] Please justify your final decision from above.

    I have to reject this paper due to various serious errors acknowledged by the authors.



Review #2

  • Please describe the contribution of the paper
    1. This paper introduces a new multi-centre video polyp segmentation dataset by reorganising existing open-source resources.

    2. This paper introduces a new source-free active domain adaptation approach for medical video object segmentation.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. This paper adapts source-free active domain adaptation to video polyp segmentation, which avoids annotation for the source data.

    2. The experimental results show that the proposed method outperformed existing baselines on two centre tasks.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. The paper compares the proposed method mainly with methods from the medical imaging domain, some of which are not designed for polyp segmentation. A broader technical analysis, including comparisons with general source-free active domain adaptation methods for video, would highlight the value of the proposed approach.

    2. Figure 1 is crowded and difficult to follow and understand. It should be redesigned.

    3. The paper claimed that image-focused methods suffer from sub-optimal frame selection and redundant spatial-temporal representations, which the proposed method can address. However, these two concepts are not clearly defined. The experimental section lacks analyses or comparisons to demonstrate how the proposed method resolves them.

    4. Tables 4 and 5 are reversed.

    5. The paper does not clearly specify which encoder and decoder are used for videos.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The experimental results are promising. The new dataset would be useful. The method performs well on video polyp segmentation and shows potential for extension to other clinical domains.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors have addressed most of my concerns, while I remain concerned about the method’s reusability and robustness, as it has only been evaluated on a single self-built dataset.



Review #3

  • Please describe the contribution of the paper

    This paper introduces the first SFADA framework for medical video object segmentation, with a well-designed STAR module for active frame selection based on spatial-temporal uncertainty and a PPC module for robust learning from unlabeled data. The proposed multi-center dataset (MCVPS) is a valuable addition, and the method shows strong performance across benchmarks, making the contributions both novel and practically relevant.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. Strong motivation rooted in data privacy: Given the sensitivity of medical imaging data, the authors’ focus on domain adaptation and limited supervision is highly relevant. The proposed SFADA framework aligns well with real-world constraints in clinical data sharing, making the problem setup both timely and impactful.

    2. Valuable dataset contribution: The introduction of the MCVPS dataset addresses a critical gap in generalizability and cross-domain evaluation. If released publicly, it would provide a strong foundation for further research in robust polyp segmentation and medical video understanding.

    3. Methodological novelty: The STAR module presents a novel spatial-temporal active recommendation strategy that selects uncertain frames based on object motion and spatial coherence—a unique and effective way to guide annotation efforts in videos. Additionally, the PPC module provides a passive correction mechanism that enhances learning from unlabeled data by mitigating noisy pseudo-label propagation. This dual strategy of active and passive learning within the SFADA framework is well-motivated.

    4. The method explaination is easy to follow.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    1. Both STAR and PPC introduce computationally expensive operations—e.g., clustering over video frames, K-order reliability, FFT/IFFT. There is no discussion on runtime or scalability, which is important for practical use in clinical pipelines.
    2. PPC’s use of frequency-domain phase correction is interesting, but it’s not fully justified why this is preferable to, say, attention-based correction in the spatial domain.
    3. unclear usage for different scene types: what happens if motion between frames is minimal (e.g., in static camera views)? Will STAR still identify meaningful differences?
    4. The work shows promising segmentation performance, but it lacks any discussion or experiments on clinical utility or impact—such as how much annotation time is saved or whether the selected frames actually align with clinician preferences.
    5. All the experiments are conducted on the proposed dataset MC-VPS (which could be bias). Is it also adaptable to natural images dataset such as CityScapes?
    6. While the ablation study shows gradual improvements, the interaction between the STAR and PPC modules is not deeply analyzed. It’s unclear whether STAR alone is the main driver of improvement or if PPC contributes significantly.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    dataset contribution, polyp segmentation is novel for medical video segmentation

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The motivation and clinical application are meaningful. The authors conducted an ablation study that demonstrates the effectiveness of the added components. However, I am not fully confident in the literature coverage in this area. As noted by Reviewer 3, the novelty appears to be somewhat overstated. Additionally, it remains unclear whether there is a trade-off between backbone selection and computational efficiency, which warrants further investigation.




Author Feedback

We appreciate the positive comments on our work (R1, R2) and the first benchmark dataset contribution to medical video SFADA (R1, R2, R3).

Reviewer#1 Q1:Thanks. To the best our knowledge, no prior video SFADA methods exist. We thus compare with a general SFADA method (Detective, CVPR24). Our Dice, Jaccard, and Sα have improved by 1.95%, 3.84%, 2.17% compared to Detective, validating our effectiveness for video polyp SFADA. Q2:Thanks. We have carefully redesigned Fig.1 to enhance the readability. Q3:We sincerely apologize for the writing confusion. The intended meaning is: Image-focused SFADA methods emphasize intra-frame pixel correlations but overlook temporal correlations, restricting their performance in temporal video frame recommendation. Q4:Thanks for your careful review. We have corrected the caption of Tables 4 and 5. Q5:Thanks. We use the popular Segformer’s official encoder-decoder.

Reviewer#2 Q1:Thanks. SFADA is an offline task that recommends valuable frames for clinicians. Moreover, STAR has a faster recommendation speed (20.8 FPS) than clinicians to annotate each image. Hence, our method will not affect practical use. Q2:Thanks. With the learnable weight WΦ, PPC can adjust the contributions of low-frequency features(structure, background) and high-frequency features(texture, noise). We have added references in our paper. Ablation experiments between M3(spatial) and Ours(frequency) can evidence this. Q3:Thanks. Although this scenario weakens the temporal relationship, the proposed Cascaded Convincing Prediction and Passive Phase Correction in our method still works for recommending valuable frames. Q4:Thanks. STAR saves 95% of the annotation workload for clinicians and achieves performance close to full annotation(Table 1). Q5:Yes. Our method is adaptable for natural scenarios. Q6:Thanks. We have conducted experiments to evaluate their main contribution. After solely using PPC, the Dice, Jaccard, Sa, and E scores are decreased by 2.17%, 3.69%, 2.63%, and 3.6%. Hence, STAR contributes more to our method.

Reviewer#3 Q1:Thanks for your careful review.1) “diffusion model” is based on the popular Segformer backbone with DDIM process(similar to backbone in TBGDiff, MM24). 2)We apologize for these writing errors, the sigmoid operation is used for phase feature. We have corrected errors in Eq.5,6. Q2:WΦ with the sigmoid operation can adjust the contributions of low-frequency features(structure, background) and high-frequency features(texture, noise). (Ref: Fourier Transform: The Behavior of the Image in the Frequency Domain). Q3:Thanks for the insights.1) We have followed your advice to conduct more comprehensive ablation experiments. 2) Sorry for the confusion. Yes. Because M2 and M3 are parallel ablation settings rather than progressive ones. M3 does not introduce any new modules or contributions; M3 aims to validate how different Spatial-Temporal Reliability (R) representations affect frame selection, not the Difference operation alone. Key difference: M2 directly uses R, M3 applies Difference operations to analyze temporal fluctuations. In Center A, M3 improved Dice/Jaccard by 0.91%/0.52% over M2, proving R’s temporal analysis filters more representative frames and underscoring STAR’s necessity for video SFADA. Q4:We sincerely apologize for the copy-paste error. We have corrected the table after re-verifying all the records. Q5:We compare with official FSM as an SFDA comparison. Since we focus on the cross-domain scenario, we compare both SFDA and SFADA methods for a comprehensive comparison, following previous SFADA paper STDR-TMI24. Q6:Thanks. Advanced backbones can improve performance. However, the purpose of SFADA is to recommend valuable video frames for clinical annotation and reduce the workload. Thus, we follow prior SFADA works to use a common backbone (STDR-TMI, CUP-MICCAI, and UGTST-MICCAI:UNet for image SFDA) as the baseline and isolate our method’s effectiveness. We’ll clarify this in our paper.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The paper proposes the first SFADA framework for medical video segmentation and releases a multi-centre polyp dataset (MC-VPS). Its STAR module actively selects the most informative frames, while PPC refines unlabeled clips via frequency-domain phase correction, cutting annotation to 5–10 % yet boosting Dice/Jaccard/Sα/E over SFDA/SFADA baselines.

    The initial ratings were 4 (WR), 4 (WR), 2 (R). After rebuttal two reviewers moved to Accept; one stayed at Reject but only on presentation issues. The rebuttal fixes notation/table errors, adds runtime data (≈21 fps selection; negligible overhead offline), and details PPC’s frequency weighting. Remaining concerns are editorial, not methodological.

    Given the clear performance gains, the valuable dataset and two firm Accepts, the AC recommends accept (poster). Camera-ready: include timing figures, corrected equations/tables and an expanded related-work section on recent diffusion-based UAD and natural-scene SFADA.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



back to top