Paper Info Reviews Author Feedback Meta-Review Back to top

List of Papers Browse by Subject Areas Author List

Abstract

Multi-site brain MRI heterogeneity caused by differences in scanner field strengths, acquisition protocols, and software versions poses a significant challenge for consistent analysis. Image-level harmonization, leveraging advanced learning methods, has attracted increasing attention. However, existing methods often rely on paired data (e.g., human traveling phantoms) for training, which are not always available. Some methods perform MRI harmonization by transferring target-style features to source images but require explicitly learning disentangled image styles (e.g., contrast) via encoder-decoder networks, which increases computational complexity. This paper presents an unpaired MRI harmonization (UMH) framework based on a new image style-guided diffusion model. UMH operates in two stages: (1) a coarse harmonizer that aligns multi-site MRIs to a unified domain via a conditional latent diffusion model while preserving anatomical content; and (2) a fine harmonizer that adapts coarsely harmonized images to a specific target using style embeddings derived from a pre-trained Contrastive Language-Image Pre-training (CLIP) encoder, which captures semantic style differences between the original MRIs and their coarsely-aligned counterparts, eliminating the need for paired data. By leveraging rich semantic style representations of CLIP, UMH avoids learning image styles explicitly, thereby reducing computation costs. We evaluate UMH on 4,123 MRIs from three distinct multi-site datasets, with results suggesting its superiority over several state-of-the-art (SOTA) methods across image-level comparison, downstream classification, and brain tissue segmentation tasks.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/3175_paper.pdf

SharedIt Link: https://rdcu.be/eHwR6

SpringerLink (DOI): https://doi.org/10.1007/978-3-032-04947-6_65

Supplementary Material: Not Submitted

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{WuMen_Unpaired_MICCAI2025,
        author = { Wu, Mengqi AND Yu, Minhui AND Lin, Weili AND Yap, Pew-Thian AND Liu, Mingxia},
        title = { { Unpaired Multi-Site Brain MRI Harmonization with Image Style-Guided Latent Diffusion } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15962},
        month = {September},
        page = {683 -- 693}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper proposes an unpaired multi-site MRI harmonization framework that combines latent diffusion models with style guidance from pretrained CLIP encoders. The method is a two-stage harmonization model with coarse and finer harmonization. It achieves promising results across multiple datasets and downstream tasks.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. I appreciate the authors’ effort in clearly formulating the problem and presenting the method in a mathematically coherent way. Additionally, the harmonization challenge they tackle is highly relevant
2. The experimental evaluation is relatively comprehensive as a harmonization/image synthesis paper. It includes both voxel-level image comparisons and downstream task performance (e.g., segmentation, classification), which helps demonstrate the practical impact of the proposed method.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

I have serious concerns about the core motivation and underlying assumptions of the paper. In both the abstract and introduction, the authors claim that existing harmonization methods rely on computationally intensive content/style disentanglement. However, the paper doesn’t actually address the second, in fact, it may make it worse.

The method introduces a two-stage pipeline involving latent diffusion and CLIP-guided fine harmonization, which is arguably more complex and heavier than some of the end-to-end disentanglement-based approaches it critiques [4, 8, 29]. Yet, the paper doesn’t include any comparisons or discussion of its own computational overhead relative to those baseline methods. This contradiction between the stated motivation and the actual method design undermines the overall credibility of the work and raises serious concerns about whether the proposed approach is truly more efficient or practical than what already exists.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html
1. While I appreciate the motivation to move away from paired data and high computational burden, I encourage the authors to revisit how the proposed method aligns with that goal. The current two-stage setup, especially with the use of latent diffusion and CLIP-based style embeddings, may introduce comparable or even greater complexity than the methods it aims to improve upon. It would help to explicitly report training time, inference speed, and memory usage, and to compare these metrics against existing harmonization methods, particularly encoder-decoder-based disentanglement approaches. This would give a more complete picture of the trade-offs and practical value of the framework.
2. If the primary critique of prior work is computational intensity, it might be worth considering how to simplify or streamline the current model (perhaps via joint training or more lightweight alternatives to the CLIP encoder). Even if full simplification isn’t feasible now, discussing potential directions for reducing overhead would strengthen the motivation and clarify the long-term vision of your approach.
3. This is not a critique, but rather a comment for future work. The paper has an OK list of comparison methods as a conference paper, but I found some comparison methods are no longer state of the art. Some other methods to consider comparing against [1-3].
4. I appreciate the authors efforts in this work, and I enjoyed reading it. Some high level thoughts for the authors to consider in future work. The current method focuses on single-contrast harmonization. While I appreciate that this is the focus of some existing literatures, I’d encourage the authors to consider how this approach might be extended to multi-contrast MRI in the future. Many real-world neuroimaging workflows include multiple modalities (e.g., T1, T2, FLAIR), which provide complementary information. Leveraging these jointly could improve harmonization performance and robustness, especially in settings with variable image quality or missing sequences [2]. Expanding the framework to handle multi-contrast input could significantly broaden the impact and clinical relevance of the work.
[1] Wu et al., Disentangled latent energy-based style translation: An image-level structural MRI harmonization framework, Neural Networks, 2025. [2] Zuo et al. HACA3: A unified approach for multi-site MR image harmonization. CMIG 2023. [3] Xu et al. SiMix: A domain generalization method for cross-site brain MRI harmonization via site mixing. Medical Image Analysis. 2024.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(3) Weak Reject — could be rejected, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The paper presents a creative approach to unpaired MRI harmonization using a latent diffusion model guided by pretrained CLIP embeddings. The problem is important and the overall framework is interesting, but the motivation is somewhat inconsistent with the actual method design. Specifically, the paper critiques existing approaches for being computationally intensive, yet proposes a two-stage pipeline that may be equally or more demanding, without reporting any overhead comparisons. With clearer justification, stronger evaluation, and a framing of the trade-offs, this work could be much stronger in the future.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

I thank the authors for their response. My main concerns were originally about the computational cost of the proposed method in the paper and how it fits with the overall claim and motivation, which was that existing methods were usually quite expensive to run. I appreciated the novelty and overall quality of the work, but these concerns led me to initially weak reject it with a rebuttal.

During the rebuttal, the authors provided detailed comparisons of the # of parameters in each model, and the results were convincing. The paper could be a valuable contribution to MICCAI and I therefore recommend accept after rebuttal. Depending on the AC’s final decision and space, I recommend the authors include the number of parameters in the final version of the paper to demonstrate its computational efficiency.

Review #2

Please describe the contribution of the paper

This paper introduces a novel unpaired multi-site brain MRI harmonization framework that consists of a two-stage process. First, a coarse harmonizer maps MRI images from various sites to a unified intermediate domain using instance normalization and style-agnostic training. Second, a fine-tuning stage adapts the harmonized image to a specific target domain using a CLIP-based semantic loss. The proposed method is evaluated across multiple levels: image similarity, histogram alignment, and performance on downstream tasks.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The idea of first mapping all source domain images to a shared unified domain, followed by fine-tuning to a specific target, is compelling and introduces a clear decoupling between global and target-specific harmonization. This strategy has the potential to increase generalizability across diverse domains.
2. The inclusion of a CLIP loss is novel in this context and helps to preserve semantic content, which can be particularly beneficial for maintaining anatomical and clinical relevance in harmonized images.
3. The paper presents extensive evaluation, including image-level, histogram-based, and downstream task performance comparisons. The proposed method consistently achieves state-of-the-art performance, demonstrating its effectiveness.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.
1. While the framework design is well-motivated, the paper would benefit from a more explicit validation of the coarse harmonizer’s ability to consistently map multiple domain images to a truly unified style. Some ablation or visual analysis would strengthen this claim.
2. In Figure 3, the visualization could be improved by placing the output of the proposed method directly next to the target domain images. This would enhance interpretability and make it easier to assess visual similarity.
Please rate the clarity and organization of this paper

Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

N/A
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The concept of decoupled harmonization through a coarse-to-fine pipeline is original and effectively implemented. The use of CLIP loss to guide the fine-tuning stage adds semantic depth to the output images, which is an innovative touch. The method is validated thoroughly through both qualitative and quantitative metrics, including downstream task performance. While some aspects could benefit from further clarity and validation, the overall contribution is strong, and the approach shows clear promise for real-world harmonization scenarios.
Reviewer confidence

Confident but not absolutely certain (3)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

Accept
[Post rebuttal] Please justify your final decision from above.

The rebuttal addressed most of my concerns and I tend to vote for acceptance. Please include the rebuttal contents to a final version.

Review #3

Please describe the contribution of the paper

This paper proposes a denoising diffusion model in an auto-encoder latent space (Latent Diffusion) for harmonized image generation, using image conditioning to change site variation to the desired target site.
Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

This submission checks many of the boxes of a correct harmonization submission: 1) Traveling subject experiments in SRPBS, 2) Post-hoc site-wise classification, 3) Downstream relevance and “biological signal” preservation, both in image space (Dice index similarity of tissue maps) and in a scalar statistics (BrainAge). It presents results that claim moderate improvements over previous, and these facts combined mean that the submission merits publication.
Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

Section 2 is quite dense. It is not clear what the curled brace notation indicates, and overall most of the notation is introduced without purpose or later use. It is also heavy with abbreviations, which, while perhaps necessary, do not make the section any more readable. I think in part this can be blamed on the program committee’s strange and foolish decision to perform appendectomies on the MICCAI proceedings, but nevertheless, it would be better to focus on specific innovations. Is the entire sampling section necessary, since it appears to be exactly the same as the DDIM paper, which itself is a small refinement on the original highly cited DDPM paper?

I think there is also some contradiction here: at the end of the Problem Formulation paragraph the authors state: “This two-stage process allows unseen MRIs from a new site to be harmonized by only fine-tuning the second stage.” However, they then go on to describe harmonization for the coarse-grained (first) stage, and finetuning of that stage. If this is necessary, is the claim that this harmonization is universal? If so, is there evidence for that? As far as I can tell, all experiments train both stages, so we have no validation of the claims about coarse-grained stage generalization.
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

Figure 4 is a valiant attempt but part b is somewhat unreadable.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

(4) Weak Accept — could be accepted, dependent on rebuttal
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The experimental procedure seems sound, the method is poorly explained and not particular innovative but works and has seeming novelty (…albeit simply filling out the outer product space between guided denoising diffusion and harmonization). Thus, it is publishable, and fully acceptable with a better section 2.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

N/A
[Post rebuttal] Please justify your final decision from above.

N/A

Author Feedback

We greatly appreciate constructive feedback from AC and Reviewers. We are encouraged by many positive comments on our “novel” and “creative” method design (R1,R2,R3), “highly relevant” (R2,R3) and “premising&strong contribution” to real applications (R1), and “thorough&extensive evaluations” (R1,R2,R3). We address major concerns below.

R1: Validate coarse-harmonizer (CH) unifies styles across domains; recommend figure update

Our ablated variant UMH-S does validate the coarse harmonizer by omitting the CLIP-based style translation loss. As shown in Fig. 4(a), it achieves a WD score of 0.015—lower than Baseline (0.041), CycleGAN (0.028), and HF (0.016) in Table 1—on the same SRPBS data, indicating that our coarse harmonizer alone reduces inter-site intensity differences.

T-SNE results (not shown due to page limit) indicate better same-subject feature clusters across sites. For example, cluster centers shift from (−71.0, −23.7) and (87.4, 63.6) to (−14.0, −9.4) and (11.6, −6.0) for Sites ATT and HUH, respectively. Our CH reduces the cluster-center distance from 180.9 to 25.8

As suggested, we’ll update Fig. 3 in final version.

R2: Computational overhead; recommendation for future work

We respectfully disagree that our method is “more complex.” As stated at the end of the Problem Formulation and Materials, our two-stage scheme requires the 1st-stage CH to be trained only once on OpenBHB. This enables the model to be reused across new sites via lightweight 2nd-stage fine-tuning, without retraining the entire model. The CLIP encoder is pre-trained [1] and fixed.

Our UMH uses a lightweight latent-diffusion UNet and a 3D autoencoder (trained once) with 3M and 3.3M trainable parameters, respectively, yielding a total of 6.3M, which is smaller than CycleGAN (28.3M), StyleGAN (161.3M), CycleGAN3D (22.6M), and DDPM (10.3M), and comparable to HF (5.7M). Training UMH on SRPBS takes 6.5 hours (3H for the 1st-stage on OpenBHB, 2.5H for 2nd-stage fine-tuning), faster than CycleGAN (16.2H), StyleGAN (10.5H), HF (13.4H), and CycleGAN3D (12.4H), and comparable to DDPM (5.5H). All training was performed on an H100 GPU.

As noted in Conclusion, we are actively extending UMH to multi-contrast MRI harmonization with class-conditioned diffusion models. We’ll compare it with the recommended recent methods.
[1] DOI: 10.1056/AIoa2400640

R3: Generalization of CH; streamline Section 2 notations

We’d like to clarify that the confusion may rise from the use of “fine-tune” in two different contexts. The 2nd-stage training fine-tunes the model from the 1st-stage, not to improve the coarse harmonization itself on new dataset, but to adapt it into a target-specific fine harmonizer via CLIP-style guidance. The 1st-stage model is trained on the large OpenBHB once and reused. When new sites (e.g., SRPBS and DWI-THP in Tasks 2 and 3) appear, we fix the 1st-stage model and fine-tune only the 2nd-stage model, due to their small training data. Our current results show that the 1st-stage model generalizes well without retraining. Retraining both stages could improve results if sufficient data were available.

We appreciate the reviewer’s understanding regarding our use of abbreviations, which was indeed constrained by space. The curled brace notation was used to denote how latent features encode both content and style (e.g., 𝑍={𝐶,𝑆}), key concept of this work. This helps illustrate the overall goal of preserving content C while translating style S—from source to coarsely harmonized and finally to target style. Once established, we omitted this notation for simplicity.

While DDIM sampling follows the original formulation, we felt the need to present the relevant equations within our context to clearly specify each component’s role (e.g., inputs, conditions and outputs). This helps improve clarity and supports reproducibility of our method.

As suggested, we’ll trim notations and highlight key contributions in the final version.

Thank you!

Meta-Review

Meta-review #1

Your recommendation

Invite for Rebuttal
If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

N/A
After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #2

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

N/A

Meta-review #3

After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

Accept
Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

All reviewers agreed that the article has merit and is of interest to the MICCAI community. The rebuttal successfully addressed all concerns and clarified important points. I also believe this is an interesting contribution and, as suggested by Reviewer 2, I strongly recommend that the authors include the number of parameters in the final version of the paper to demonstrate its computational efficiency. Further comparisons with the three mentioned works would also add value.

back to top

Unpaired Multi-Site Brain MRI Harmonization with Image Style-Guided Latent Diffusion

Author(s):