Abstract

Medical image segmentation is crucial for accurate diagnosis and effective treatment planning. However, in cross-domain semi-supervised segmentation, the scarcity of labeled data often leads to suboptimal performance and poor generalization across diverse medical imaging domains. Moreover, pseudo-labels generated from unlabeled data are inherently noisy, introducing confirmation bias that destabilizes training and hinders the model’s ability to accurately capture complex anatomical structures. To address these challenges, we propose HARP: Harmonization and Adaptive Refinement of Pseudo-Labels for Cross-Domain Medical Image Segmentation, a framework designed to enhance segmentation performance by integrating two novel modules: the Adaptive Pseudo-label Selection (APS) module and the Cross-Domain Harmonization (CDH) module. The APS module ensures the quality and reliability of pseudo-labels by using a confidence-based filtering mechanism and an iterative refinement strategy. The CDH module uses matrix decomposition to harmonize differences across medical imaging modalities, enhancing data diversity while preserving domain-specific features and improving the model’s adaptability to varying imaging protocols for robust performance across diverse medical datasets. Extensive experiments on three medical datasets demonstrate the effectiveness of HARP, achieving significant improvements across multiple evaluation metrics. The source code is available at https://github.com/lbllyl/HARP.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/1952_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/lbllyl/HARP

Link to the Dataset(s)

N/A

BibTex

@InProceedings{LiuYul_HARP_MICCAI2025,
        author = { Liu, Yulong and Ye, Wenqing and Liu, Hui and Chen, Ziyi and Li, Peilin and Xu, Ronald X. and Sun, Mingzhai},
        title = { { HARP: Harmonization and Adaptive Refinement of Pseudo-Labels for Cross-Domain Medical Image Segmentation } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15965},
        month = {September},
        page = {316 -- 326}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper propose HARP: Harmonization and Adaptive Refinement of Pseudo-Labels for Cross-Domain Medical Image Segmentation, a framework designed to enhance segmentation performance by integrating two novel modules: the Adaptive Pseudo-label Selection (APS) module and the Cross-Domain Harmonization (CDH) module.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The work is compared with the recent state-of-the-art methods, using 3 datasets, and the quantitative outputs shown promising results in the segmentation tasks.

    The novelty is clear where the author used Adaptive Pseudo-label Selection (APS) module improves pseudo-labels through confidence filtering and iterative refinement, while the Cross-Domain Harmonization (CDH) module reduces domain gaps by aligning features using Singular Value Decomposition (SVD). This is indeed crucial to solve the lacking of label data issue in the MIC domain.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    Despite the proposed method shown good improvement in terms of quantitative results, it lacks the qualitative results to show proof of effective segmentation on the downstream task. It is only mentioned in text at last paragraph before conclusion. Could be better if the author can do comparison with state-of-the-art in this context.

    Some clarifications should be added: 1) What does it mean by 5% labelled budget? 2) How threshold for confidence scores is decided? 3) What is the computational time of the overall processing?

    Overall Pipeline. (Remove the full stop beside the title)

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (3) Weak Reject — could be rejected, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The quantitative result are convincing but lack of qualitative results to proof that the segmentation works well.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The author answered to the confusions raised in the review.



Review #2

  • Please describe the contribution of the paper

    This paper filters and refines pseudo-labels by leveraging agreement between local and global models using a confidence score based on Intersection-over-Union and Fréchet distance. They also use k-means on model confidence score to find proper thershold for high, medium, and low confidence. Cross-Domain Harmonization (CDH) bridges domain gaps via Singular Value Decomposition (SVD)-based mixup of images from different domains. It creates harmonized samples that enhance domain generalization while preserving semantic content, effectively increasing data diversity and reducing distributional discrepancies.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. It is novel to use SVD-based mixup technique to enhance generalization while preserving domain-specific semantics.

    2. HARP consistently outperforms both semi-supervised learning (SSL) and active learning (AL) baselines across three diverse datasets: Fundus, Prostate, and M&MS.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    There is no related work session. There are lots of existing methods for using model confidence to improve pseudo-label quality. This paper should compare with them in ablation study or at least discuss about the difference. Besides, it is interesting to see using SVD to mix domain gap. But ablation study should be conducted for comparing the SVD-based with other like Fourier transform based methods.

  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The idea of using SVD-based mixup to mitigate domain gap is interesting and novel. But lack of comparing with existing method weakens this paper.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    The paper presents an approach for semi-supervised, cross-domain image segmentation. The method combines a so-called “adaptive pseudo-label selection module”, selecting samples with low-confidence scores, with a “cross-domain harmonization module”, using singular value decomposition to harmonize domains. The method is demonstrated on fundus, prostate, and heart segmentation datasets. The proposed approach outperforms several active learning and self-supervised learning baselines.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The fairly extensive set of experiments suggests that the proposed methods performs better than the four baselines. The ablation study shows that both of the proposed components contribute to the performance.

    • The combination of active learning and confidence-based sampling is interesting, especially in this cross-domain context. Using the local and global models to obtain the confidence estimates seems to be a nice approach.

    • The authors promise to share their code.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
    • The presentation of the method (Section 2) is quite detailed, but incomplete. For example, the Fréchet distance (Equation 2) uses “non-decreasing reparameterizations of the pseudo-labels”, but what these are is not defined. On page 5, the clusters are suddenly combined with undefined “thresholds” and an “annotation budget”. This is not very reproducible.

    • The choices of the methods could be supported by more arguments. For example, why use a clustering method on the 1-D confidence scores, instead of computing the thresholds directly? Why should we expect to find exactly three clusters? And if we already have an “Intersection score”, why do we also need a “similarity measure” to evaluate “spatial alignment” (page 4). Isn’t that the same?

    • The components of the proposed approach are somewhat independent, but this is not fully reflected in the ablation study. Do we really need the local/global models to do the active learning part? There are other, possibly easier criteria to select samples. And vice versa, we could use local/global models without the active learning part. The use of pseudo labels is also somewhat independent of the active learning and local/global models. It would have been interesting to investigate these components on a more individual basis.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    Some suggestions for improving the paper:

    • Figure 2:

    Figure 2: It is not entirely clear to me what we see here. The images in rows a, b, and c look fairly similar. The first row supposedly shows the labels, but what are they? The very thin circles are hard to see. For row c, what is the “increased diversity”? I see some minor differences in intensity, but are they more diverse? Are the “inherited masks” mentioned in the caption the same as the “labels” from the labeled data?

    Figure 2: Shouldn’t we expect to see images from different domains?

    • Page #4:
    • F (ˆyd, ˆyu)

    Possible typo: There’s a hyphen before F(y_d, y_u). Is this on purpose?

    where α and β are continuous, non-decreasing reparameterizations of the pseudolabels, and d is the distance metric, L is the image diagonal length, used to normalize the Fréchet distance

    Equation (2) could be explained better. How does the reparameterization work? What is t?

    The similarity measure S(ˆyd, ˆyu) evaluates their spatial alignment, ranging from 0 to 1, where higher values denote greater similarity.

    What does this mean, exactly? Doesn’t the intersection score already evaluate the spatial alignment of the segmentations? If they aren’t aligned, the intersection score should be lower. The Fréchet distance should be defined and explained in more detail.

    we employ the k-means clustering algorithm, grouping the scores into high, medium, and low confidence

    Why use clustering instead of a threshold? The confidence score is a 1-D score. Why would we expect to find exactly three clusters?

    • Page #5:

    Here, C1, C2, and C3 represent the clusters for high, medium, and low confidence, respectively. Pseudo-labels with confidence scores above the threshold T = max(Tlarge, 1 − Tsmall) are retained as reliable data.

    How do these thresholds relate to the three clusters? The terminology is somewhat confusing: first there are the three numbered clusters C1, C2, C3, which then turn out to be high, medium, and low (I assume they are renumbered to make this work?), and now we suddenly introduce thresholds T, T_large and T_small.

    Where do T_large and T_small come from?

    the annotation budget for each domain is set N to Bk

    Where does B_k come from? What does it represent? What is k, even? (The only k defined so far was used in k-means clustering, but I assume that is not related to the budget.)

    By prioritizing low-confidence samples, the module ensures that the most challenging and informative examples are included in the training process, optimizing the use of limited annotation resources.

    This is an active learning question, for which there is a large amount of existing work. Are low-confidence samples indeed the most informative? And shouldn’t this be an iterative process, since the confidence might change after each new labeled sample?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (5) Accept — should be accepted, independent of rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This an interesting paper with a nice combination of methods. The evaluation suggests that it works. The paper is quite readable, but missing some details. The architecture slightly complicated for my taste: there are many components, some of which are only loosely related and could easily be removed or replaced with something else.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    I’d like to thank the authors for their responses. I am still satisfied with the paper, and I think the promised clarifications will lead to an improved version.




Author Feedback

First, we express our gratitude to all reviewers for the suggestions. We address the comments as follows:

R1.1) Section 2’s method is incomplete? T_large is the average of C1’s minimum confidence and C2’s maximum confidence, while T_min is the average of C2’s minimum confidence and C3’s maximum confidence. This ensures high-quality pseudo-labels under varying confidence levels without manual tuning. We will clarify these terms and formulas in the final version.

R1.2) The choices of the methods? We tested various confidence score methods (fixed thresholds, top-n sampling) and found clustering-based thresholding adapts to diverse dataset confidence distributions, eliminating manual tuning. Three clusters (high-confidence, low-confidence, uncertain) balance pseudo-label quality: two clusters risk retaining poor labels, while more than three complicate the process. The similarity metric is used to filter out small “spatial noise” around pseudo-labels, such as small and isolated regions that were found harmful to training in experiments. We will include this in the ablation experiment.

R1.3) The components are independent. As noted in the paper’s conclusion, we designed these modules to be plug-and-play with complementary components for synergistic benefits. Their applicability extends to cross-domain and downstream tasks. We appreciate your suggestion and will further validate each module’s effectiveness.

R1.4) Suggestions and potential typos. For Figure 2, domain distinctions and label/mask descriptions will be clarified in the final version. In Equation (2) on Page 4, the missing explanation of reparameterization and variable t will be supplemented. Regarding the notation B_k, we regret the confusion and will clarify that k denotes different domains with B as the total annotation budget. Thanks for your careful review.

R1.5) An iterative process? We initially implemented an iterative process but found it yielded similar results to non-iterative methods while introducing extra computational costs from per-iteration filtering. Thus, we abandoned iteration and adopted a post-convergence judgment method instead.

R2.1) What does it mean by 5% labelled budget? This refers to an active learning setting for manual annotation proportion. Testing 1%-10% showed 1% had weaker performance from insufficient labeling, while 10% had higher costs. Thus, 5% was chosen as a balanced compromise. Baselines used the same 5% ratio for fairness, adaptable in practice. Qualitative results demonstrating effective segmentation will be added.

R2.2) How threshold for confidence scores is decided? As clarified in R1.1, T_large and T_small are calculated as averages, with T = max(T_large, 1 − T_small) for optimal pseudo-labels without manual tuning. This typo will be corrected.

R2.3) The computational time? Data synthesis takes dozens of minutes, with total training time around hours. Costs are acceptable. Synthesized data can be saved for reuse without repeated processing, applying to downstream tasks and reducing overhead.

R2.4) Remove the full stop. This typo will be corrected. Thanks for your careful review.

R3.1) Missing related work. Existing confidence-based methods ([1], [2]) face challenges like fixed-threshold inefficiency and noise interference. Our method integrates semi-supervised learning and active learning with dynamic threshold adaptation and noise-robust data synthesis to boost performance under poor data quality. Compared to Fourier-based methods [3], SVD captures high-dimensional semantic features for cross-domain fusion and editing, while Fourier focuses on low-level frequencies. We will add related work and SVD vs. Fourier ablation experiments.

[1] Yoon, Enhancing source-free domain adaptive object detection with low-confidence pseudo label distillation, 2024. [2] Huang, Divide and adapt: Active domain adaptation via customized learning, 2023. [3] Yang, FDA: Fourier domain adaptation for semantic segmentation, 2020.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    N/A

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    After the rebuttal phase, all three reviewers reached a consensus in recognizing the contribution of this paper and supporting its acceptance. The authors are encouraged to revise the paper by addressing the reviewers’ comments and incorporating clarifications provided during the rebuttal to further improve its quality.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All reviewers appreciate the interesting technical design and the extensive experimental validation. Therefore, an acceptance is given.



back to top