Abstract

Generative models enhance neuroimaging through data augmentation, quality improvement, and rare condition studies. Despite advances in realistic synthetic MRIs, evaluations focus on texture and perception, lacking sensitivity to crucial morphometric fidelity. This study proposes a new metric, called WASABI (Wasserstein-Based Anatomical Brain Index), to assess the morphometric plausibility of synthetic brain MRIs. WASABI leverages SynthSeg, a deep learning-based brain parcellation tool, to derive volumetric measures of brain regions in each MRI and uses the multivariate Wasserstein distance to compare distributions between real and synthetic anatomies. Based on controlled experiments on two real datasets and synthetic MRIs from five generative models, WASABI demonstrates higher sensitivity in quantifying morphometric discrepancies compared to traditional image-level metrics, even when synthetic images achieve near-perfect visual quality. Our findings advocate for shifting the evaluation paradigm beyond visual inspection and conventional metrics, emphasizing morphometric fidelity as a crucial benchmark for clinically meaningful brain MRI synthesis. Our code is available at https://github.com/BahramJafrasteh/wasabi-mri.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2025/paper/2013_paper.pdf

SharedIt Link: Not yet available

SpringerLink (DOI): Not yet available

Supplementary Material: Not Submitted

Link to the Code Repository

https://github.com/BahramJafrasteh/wasabi-mri

Link to the Dataset(s)

N/A

BibTex

@InProceedings{JafBah_WASABI_MICCAI2025,
        author = { Jafrasteh, Bahram and Peng, Wei and Wan, Cheng and Luo, Yimin and Adeli, Ehsan and Zhao, Qingyu},
        title = { { WASABI: A Metric for Evaluating Morphometric Plausibility of Synthetic Brain MRIs } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2025},
        year = {2025},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15961},
        month = {September},
        page = {685 -- 695}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The paper presents WASABI (Wasserstein-Based Anatomical Brain Index), a metric for evaluating the anatomical fidelity of synthetic brain MRIs generated by different models. The authors argue that metrics like FID, MS-SSIM, and MMD focus on perceptual quality but lack sensitivity to anatomical accuracy, which is critical for clinical applications. Experiments on controlled ADNI subsets and five generative models demonstrate WASABI’s superior sensitivity to anatomical discrepancies compared to existing metrics.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    ++ The proposed model addresses the gap in evaluating synthetic brain MRIs by presenting a metric focused on anatomical fidelity, which is clinically relevant. ++ The use of SynthSeg for parcellation and Wasserstein distance for distribution comparison is well-motivated and technically sound ++ The comparison across generative models is comprehensive and highlights the limitations of traditional metrics.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    ++ In order to calculate the Wasserstein distance, the proposed model makes the assumption that the distributions are Gaussian. This makes calculations easier, but it could not apply to all brain area volumes. The authors should justify this assumption using empirical facts or theoretical reasons, show possible drawbacks and how findings might be impacted by non-Gaussianity, and test the viability of alternate strategies (such as the non-parametric Wasserstein distance). ++ The hyperparameter QC > 0.7 for SynthSeg is randomly chosen. The authors should: Provide a rationale for this threshold (e.g., based on segmentation accuracy or literature) and show how varying the threshold affects WASABI scores and conclusions? ++ The exclusion of methods with poor visual quality (e.g., blurry or low-resolution) may bias the evaluation. The authors should: identify whether WASABI can handle such cases or if it is limited to high-quality synthetics and discuss how the metric performs with lower-quality synthetic images? ++ The paper contrasts WASABI with image-level metrics but does not compare it to other anatomical evaluation methods (e.g., univariate Cohen’s d). The authors should include a comparison to highlight WASABI’s advantages over existing anatomical metrics and show if combining WASABI with other metrics could provide a more comprehensive evaluation. ++ The proposed model lacks validation with clinical experts. The authors should show how clinicians might interpret WASABI scores in practice and write the future work involving clinician feedback to validate the proposed metric.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The unexamined assumptions, narrow evaluation scope, and lack of clinical validation.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    Accept

  • [Post rebuttal] Please justify your final decision from above.

    The authors answer all my questions



Review #2

  • Please describe the contribution of the paper

    This paper proposes a new metric, called Wasserstein-Based Anatomical Brain Index, short for WASABI, to assess the anatomical realism of synthetic brain MRIs obtained by the existing generative models.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    A novel formulation of a new metric. The proposed metric measures the multivariate Wasserstein distances of the segmented brain regions between real and synthetic anatomies. The soundness of this metric was validated in the real datasets. A comparison study on five generative models validated the effectiveness of this metric.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    In analogy to Table 1, I suggest adding experiments by setting real ADNI data as references and measuring the similarities/distances between other real/synthetic data and ADNI. Such validation can make the proposed metric more convincing.

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed new metric has the potential of objectively justifying the quality of brain MRI generative models.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A



Review #3

  • Please describe the contribution of the paper

    Although generative models have made progress in synthesizing brain MRI images, current evaluation methods mainly focus on texture and perception, lacking sensitivity to the key anatomical authenticity. This leads to the evaluation of generative models not fully reflecting their effectiveness in clinical applications. This paper introduces a novel metric, WASABI, for assessing brain generative models. WASABI stands for Wasserstein Anatomical Similarity metric for Brain Images and is utilized to gauge the anatomical fidelity of synthetic brain MRI images. The paper’s consideration of this issue is meaningful.

  • Please list the major strengths of the paper: you should highlight a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    WASABI utilizes SynthSeg to obtain volumetric measurements of various brain regions in MRI and employs the Wasserstein distance to compare the distribution of real and synthetic anatomical data. By evaluating on two MRI datasets and five generative models, WASABI demonstrates higher sensitivity than traditional image-level metrics, capable of quantifying anatomical differences even when the visual quality of the synthetic images is nearly perfect. This is a meaningful application, as it can better reflect whether the synthetic images meet clinical requirements compared to traditional metrics.

  • Please list the major weaknesses of the paper. Please provide details: for instance, if you state that a formulation, way of using data, demonstration of clinical feasibility, or application is not novel, then you must provide specific references to prior work.

    The experimental comparison in the article only shows the results of normal anatomical structures, lacking comparison in the presence of lesions (such as tumors).

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Optional: If you have any additional comments to share with the authors, please provide them here. Please also refer to our Reviewer’s guide on what makes a good review and pay specific attention to the different assessment criteria for the different paper categories: https://conferences.miccai.org/2025/en/REVIEWER-GUIDELINES.html

    N/A

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making.

    (4) Weak Accept — could be accepted, dependent on rebuttal

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This is a meaningful application, as it can better reflect whether the synthetic images meet clinical requirements compared to traditional metrics.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the authors’ rebuttal, please state your final opinion of the paper.

    N/A

  • [Post rebuttal] Please justify your final decision from above.

    N/A




Author Feedback

We thank reviewers (R) R1, R2, R3, and meta-reviewer (MR) for their constructive feedback. Below, we address the main concerns raised.

Volume is an oversimplification of anatomy (MR): We agree with MR that the current title is not specific enough and will clarify that WASABI is mainly designed for brain mapping research, where the major goal is to examine morphometric information of brain regions, e.g., volume, thickness, curvature. It currently cannot assess synthetic MRIs with tumor. Note, while our manuscript focuses on volume, it is trivial to extend WASABI to other brain morphometric measures. Accordingly, we will change the title to “WASABI: A Metric for Evaluating Morphometric Plausibility of Synthetic Brain MRIs.”

Clinical scope (R3, MR):
As the first proposed metric designed to assess brain morphometry, our evaluation focused on cognitively normal subjects for initial benchmarking. We will acknowledge the limitation in clinical relevance. The main difficulty in the field is that many generative models can synthesize normal subjects, but very few can synthesize anatomically realistic representations of abnormal cases. This presents a classic catch-22: without a reliable metric, it is nearly impossible to develop models that optimize anatomical realism for patients. We envision WASABI as a critical step toward breaking this cycle to enable clinically meaningful synthesis.

Need for expert validation (R1, MR): We will clarify that the primary motivation for developing WASABI is that expert neurologists are no longer able to reliably judge which generative models produce more realistic brain MRIs, as they are not trained to detect subtle differences in brain morphometry. As noted in the manuscript, “studies have gradually converged to producing brain MRIs with near-perfect visual quality” and “in a recent user study radiologists could only distinguish real from synthetic MRIs with approximately 70% accuracy”. This highlights the need for automated, objective, anatomy-based metrics, such as WASABI, to complement expert judgment.

WASABI only works for high-quality images (R1): Following the previous comment, we focused on high-quality synthetic images because that is where human assessment becomes unreliable, and where a metric like WASABI is most needed. In fact, applying WASABI to detect anatomical failures in low-quality images is a more trivial task. Such images typically yield highly inaccurate volume estimates by SyntheSeg, which guarantees to yield large WASABI values. We will clarify this in the revision and emphasize that high WASABI values under poor conditions are naturally expected and desirable.

non-Gaussianity (R1): While we acknowledge that the Gaussian assumption may be restrictive, we made this simplified choice based on empirical testing of the multivariate normality of ROI volume distributions in the UKB data using the Henze–Zirkler test, which resulted in a good multivariate Gaussianity (p-value = 0.24). We will clarify this point in the manuscript and discuss it explicitly as a limitation and direction for future work.

SynthSeg QC threshold selection (R1):
We will clarify that the selected threshold of 0.7 was not arbitrary but based on the outlier threshold (1.5 x interquartile range (IQR)) in a boxplot of QC scores on ADNI real images.

Comparison to other anatomical metrics (R1): To the best of our knowledge, WASABI is the first metric explicitly designed to assess whole-brain anatomical plausibility of synthetic brain MRIs. Existing alternatives, such as univariate Cohen’s d, can only quantify per-region distributional shifts and must be computed separately for each anatomical region. As a result, they do not offer a unified whole-brain summary and lack sensitivity to multivariate anatomical coherence. In the revision, we will explain this distinction and underscore the advantage of WASABI as a holistic and scalable metric.




Meta-Review

Meta-review #1

  • Your recommendation

    Invite for Rebuttal

  • If your recommendation is “Provisional Reject”, then summarize the factors that went into this decision. In case you deviate from the reviewers’ recommendations, explain in detail the reasons why. You do not need to provide a justification for a recommendation of “Provisional Accept” or “Invite for Rebuttal”.

    The reviewers acknowledge the lack of good tools to estimate anatomical plausibility in synthetic MRI. The authors proposed a very simple, yet effective, measurement for this purpose based on volumetric differences. The reviewers also found important drawbacks. The main assumption of the method is that anatomy can be encoded by volumetric information, which is a huge oversimplification of this difficult problem. Studies for assessing anatomically similarity must include assessments from radiologists, which are missing. I think the method is useful, but its main problem is the claim that it targets anatomical differences. One possible solution to these criticisms could be to change the title of the paper to remove any reference to “anatomical similarity” and discuss this issue in the final version of the paper. An additional issue found by the reviewers is that Table 1 focuses on cognitive normal subjects. It is unclear how the method will perform with synthetic MCI or AD or tumors, which might have a higher clinical value.

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    N/A



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    All reviewers agree that the article has merit and could foster valuable discussion at the conference. The rebuttal addressed all of their concerns.



Meta-review #3

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    While the topic of this paper is quite narrow, and its methodology very simple (as reflects by three “weak accept” scores), it raises an interesting point that is likely to trigger discussion at MICCAI. There are however several limitations to the method, such as relying on an automated tool for volume quantification, which can have its own biases.



back to top