Abstract

Deep learning models generating structural brain MRIs have the potential to significantly accelerate discovery of neuroscience studies. However, their use has been limited in part by the way their quality is evaluated. Most evaluations of generative models focus on metrics originally designed for natural images (such as structural similarity index and Fr'echet inception distance). As we show in a comparison of 6 state-of-the-art generative models trained and tested on over 3000 MRIs, these metrics are sensitive to the experimental setup and inadequately assess how well brain MRIs capture macrostructural properties of brain regions (a.k.a., anatomical plausibility). This shortcoming of the metrics results in inconclusive findings even when qualitative differences between the outputs of models are evident. We therefore propose a framework for evaluating models generating brain MRIs, which requires uniform processing of the real MRIs, standardizing the implementation of the models, and automatically segmenting the MRIs generated by the models. The segmentations are used for quantifying the plausibility of anatomy displayed in the MRIs. To ensure meaningful quantification, it is crucial that the segmentations are highly reliable. Our framework rigorously checks this reliability, a step often overlooked by prior work. Only 3 of the 6 generative models produced MRIs, of which at least 95$\%$ had highly reliable segmentations. More importantly, the assessment of each model by our framework is in line with qualitative assessments, reinforcing the validity of our approach. The code of this framework is available via \url{https://github.com/jiaqiw01/MRIAnatEval.git}.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/0689_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: N/A

Link to the Code Repository

https://github.com/jiaqiw01/MRIAnatEval.git

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Wu_Evaluating_MICCAI2024,
        author = { Wu, Jiaqi and Peng, Wei and Li, Binxu and Zhang, Yu and Pohl, Kilian M.},
        title = { { Evaluating the Quality of Brain MRI Generators } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper
    • Evaluation of existing MRI synthesis top-of-the-line methods (GAN and diffusion based) with a consistent framework that is/will be made accessible publicly
    • Propose the use of segmentation based metrics that take into account anatomical plausibility using SynthSeg’s automated QC model as well as fit of observed region volumes vs expected regional volumes (based on Cohen d)
  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • Use of ADNI 1 data, NCANDA, inhouse data (with significant site/cohort variability) for evaluation -both GAN and diffuion based synthesis methods evaluated -evaluation is based on alternatives to the existing metrics, both traditional (MS-SSIM, PSNR) or newer perceptual or embedding based metrics
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • relatively low number of methods evaluated, particularly there are significantly better GAN methods available -proposed metrics are rather straightforward, of limited novelty -the proposed segmentation QC metrics (both) only work for typically appearing data. Pathology/atypicality would be not be well handled and thus the proposed method would not really be applicative to data e.g. with tumor or significant lesion load, or any data outside the trained segmentation QC model (e.g. infant MRI data). -For the second stage, each region is independentely evaluated (rather than jointly), thus it can very well be that each individual region falls quite well within expected volumetric range, yet the combination of volumes could be totally unexpected
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?
    • sufficiently well described (and method is rather straightforward)
    • code will be available
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • it would be good to also contrast to other perceptual metrics like LPIPS, which can easily be adapted to 3D medical data
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • limited novelty
    • the proposed metrics are not broadly applicable, but rather are specific for certain tasks/domains
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    The rebuttal was fine overall, but my main critiques remain, i.e. that this framework inherently does not extend to data with pathology, and that the work is of limited novelty. Thus, my rating does not move up. The presented work is of sufficient quality that a poster presention at MICCAI could be appropriate.



Review #2

  • Please describe the contribution of the paper

    The paper presents a comprehensive comparison of brain MRI generation models. The comparison is based models implemented in a unified framework, and provides an extensive comparison of quantitative quality metrics and a qualitative assessment of the results. The main insight of the paper is that these metrics, and the qualitative assessment of results diverge. Thereby it makes an important point: we lack an informative basis for comparative evaluation of image generation models in the medical imaging domain, in particular in brain MRI.

    The title resonates primarily with the body of health equity work, where fairness of machine learning models with regard to un-even benefits for different parts of the patient population. This does not reflect the content of the paper. Thus I would suggest the title to e.g. something along the lines of: „A comparative assessment of brain MRI generation models and evaluation metrics“. Of course, I leave this to the authors to decide.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • A critical and comprehensive evaluation of the validity of established metrics, such as metrics capturing perceptual properties. Insights in their limitations.
    • A wide range of models and metrics being part of the comparison makes it a compelling body of evidence.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The title is not ideal and slightly misleading, but this is an easy fix
    • The writing of the paper is overall good, but the results are hard to interpret and the structure of the result section should be improved to facilitate reading. To better follow points such as „traditional metrics struggle to yield conclusive findings, even when qualitative differences are evident.“ it would be great to summarise the principles of the comparison at the beginning of the evaluation: do you trust the quantitative or the qualitative assessment more and why? What are groups of metrics that behave in a similar way (e.g. in failing or succeeding to identify what works and what doesn’t?). Is there a reason for that?
    • A summary explanation of the utilised metrics is lacking making it hard to follow the paper, without reading up substantially in other papers. The authors should state the key aspects of the used metrics, to help the reader appreciate the results.
  • Please rate the clarity and organization of this paper

    Satisfactory

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Providing the code to repeat these experiments would be very helpful.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Please also refer to the answers to question 6 (weaknesses)
    • Please clarify in an overview where user interaction is needed (e.g., as part of QC) and where it is not.
    • You provide a standardised preprocessing pipeline, but there are many established e.g., in the HCP, AFNI or other communities. what is the advantage of the proposed approach, or does it correspond to one of those? Please explain the relationship.
    • Please rethink the title, it leads readers into the wrong area.
    • Please structure the result section better (suggestions above)
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper overall provides compelling evidence about the inconsistency of widely used metrics for assessing generative models. At the same time it needs quite some improvement on the structure of the paper, to help readers to understand the key results.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    The main contribution of this article is the introduction of a comprehensive framework for standardizing the process of generating brain MRIs, covering data preprocessing, model implementation, and evaluation methods. By introducing an innovative two-stage assessment process that integrates qualitative and quantitative indicators, the article effectively enhances the anatomical accuracy and clinical relevance of the generated images, providing a fairer and more reliable method for evaluating and comparing different generative models. In the first stage, the authors use segmentation models to segment both generated and real MRIs, assessing the anatomical quality of each brain region. The introduction of Quality Control (QC) scores and the setting of thresholds to determine the acceptability of segmentation results help filter anatomically credible generated data. In the second stage, further cortical segmentation is performed on data that passed the initial assessment, generating masks

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    1. The paper proposes a unified framework that includes data preprocessing, model implementation, and evaluation metrics. This approach is novel because it provides a common benchmark for various generative models, making the comparisons between them more fair and consistent. This is crucial, as previous studies were often limited by their experimental setups, making it difficult to assess the performance of different models fairly.
    2. The research introduces a two-stage evaluation process as an innovation, which not only enhances the scientific rigor of the assessments but also their clinical relevance. In the first stage, MRI images are filtered through a quality control scoring system, and in the second stage, the ROI of the qualified samples is compared with real data. This method focuses on the anatomical authenticity of the images rather than just their visual quality, making it particularly valuable in medical imaging.
    3. This study comprehensively assesses the quality of generated images by combining qualitative and quantitative evaluation methods. This integrated approach is not commonly seen in previous research, and its advantage lies in its ability to assess both the visual effects and the medical accuracy of the images, which is especially crucial for clinical applications.
    4. The paper was tested across diverse clinical datasets, demonstrating its feasibility in real-world applications. This is a significant strength of the study, as it validates the practicality of the technology and its potential value in medical diagnostics and treatment. 
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    1. The description of the innovation and necessity of the research in the background needs to be sufficiently detailed; there needs to be a more thorough discussion of the deficiencies of existing methods.
    2. The article provides a rather general explanation of the methods’ principles and should provide more detailed information about the scoring standards proposed.
    3. The paper focuses on studying GANs and diffusion models, lacking discussion on other generative models.
    4. The article needs to include a comparison of the performance of generated data in downstream tasks, which would help to visually demonstrate the quality and performance of generated data in actual clinical applications.
    5. The results are based on a single round of experiments; multiple repeated experiments should be conducted to ensure reliability and avoid randomness.
    6. In the final results evaluation section, there needs to be more comparison with other anatomical-level evaluation strategies, and the choice of evaluation metrics needs further comparison to support their use. 
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The paper describes the data sources, preprocessing methods, model implementation, and evaluation metrics used. It utilizes a diverse range of publicly available datasets, and the experimental design is rigorous. The authors have also committed to making their code open source, enhancing their research’s reproducibility. These factors contribute to high reproducibility and are likely to advance scientific development.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Your paper has provided a new method for standardized evaluation of existing generative models. After a detailed review of your paper, I believe the topic and the methodology have significant academic value and practical prospects. However, to further improve the quality of the paper and make the research results more rigorous, I have the following suggestions that could help perfect your paper:  Enhance the description of the research background and necessity: The introduction to the background of your paper needs to sufficiently detail the deficiencies of current methods or the motivation behind proposing a new framework. I think a detailed discussion of the main techniques currently used in the field and their limitations, clearly explaining the advantages and innovations of your method compared to these, and adding a comparative analysis of relevant literature to strengthen the persuasiveness of your paper. Provide detailed explanations of the methodological, and technical details: The two-stage evaluation process and quality control scores you propose are central to your paper, but the descriptions of the techniques’ principles and operational details could be more specific. I recommend providing a detailed description of the techniques’ working principles, algorithm choices, and mathematical explanations to enable other researchers to replicate your experiments, thus enhancing the transparency and reliability of your methodology. Expand the range of models compared: Current research focuses mainly on GAN and Diffusion models. Could you consider introducing and discussing additional generative models, such as Autoencoders, to show the broad applicability of your proposed methods? Demonstrate the performance of the generated data in practical applications: Your paper needs empirical analysis of how the generated data performs in downstream tasks (e.g., disease diagnosis, prediction of treatment effects). I would suggest assessing the effectiveness of the generated data in actual clinical applications, for example, by comparing its performance in the same diagnostic tasks with accurate data, to show the practical value of the generated data. Conduct multiple rounds of experiments to verify result stability: To ensure the reliability and universality of your experimental results, I recommend conducting various rounds of repeated experiments and reporting statistical data from these experiments (such as averages and standard deviations) in your paper to minimize errors due to the randomness of single experiments. Add comparative validation of assessment metrics: In the evaluation methods section, you currently rely mainly on the proposed two-stage evaluation process, but there needs to be more comparison with other anatomical-level assessment strategies. I would suggest adding comparative experiments to show the superiority of the proposed evaluation process. By making these in-depth revisions, your paper will gain broader recognition and application in academic and practical fields. I look forward to seeing further improvements and enhancements to your paper. 

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    1. The paper proposes a unified framework that includes data preprocessing, model implementation, and evaluation metrics. This approach is novel because it provides a common benchmark for various generative models, making the comparisons between them more fair and consistent. This is crucial, as previous studies were often limited by their experimental setups, making it difficult to assess the performance of different models fairly.
    2. The research introduces a two-stage evaluation process as an innovation, which not only enhances the scientific rigor of the assessments but also their clinical relevance. In the first stage, MRI images are filtered through a quality control scoring system, and in the second stage, the ROI of the qualified samples is compared with real data. This method focuses on the anatomical authenticity of the images rather than just their visual quality, making it particularly valuable in medical imaging.
  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Accept — should be accepted, independent of rebuttal (5)

  • [Post rebuttal] Please justify your decision

    Answered the reviewer’s questions.




Author Feedback

We thank all reviewers (R) for their constructive feedback and for recognizing the scientific value, clarity, and organization of our work.

R1: The method would not be applicative to pathology/atypicality data, e.g. tumor or infant MRI data A: The reviewer is correct that we primarily targeted neurocognitive MRI studies focusing on identifying subtle differences, which we will acknowledge in the manuscript. One could extend our framework to the above applications by confining infant MRIs to a narrow age range or tumors only appearing in certain brain regions.

R1: focusing solely on individual regions might overlook the overall brain structure A: This concern is unlikely to materialize as our evaluation complements cortex-level evaluation (important for neurocognitive studies) with global image metrics (like SSIM and FID) and an overall quality score of the segmentation. Furthermore, extending the framework to include global gray and white matter scores is straightforward.

R1: Relatively low number of methods evaluated A: We evaluated all (i.e., 6) state-of-the-art MRI synthesis methods for which implementations were available to us.We even reached out to authors for code for MedGen3D (without getting a response) and implemented a recent model designed for CT (Medsyn).

R1: The proposed metrics are rather straightforward, of limited novelty A: The novelty of our paper lies in the development of a comprehensive framework (unified pipeline) for comparing brain synthesis, which has the potential to significantly enhance neurocognitive studies. Additionally, the proposed metrics provide cortex-level measurements to address a core aspect of neurocognitive studies. Unlike previous metrics, it can accurately measure anatomical plausibility, offering a more detailed and relevant evaluation of synthesized brain MRIs.

R3: Recommend comparing our work with standardized pipelines like HCP and AFNI. A: As we will clarify in the manuscript, we are essentially using a scaled-down version of HCP that only requires the necessary steps to robustly perform a skull strip and align an MRI to a template.

R3: The word fairness in the title is misleading A: We will revise the title to “Towards a Consistent and Anatomically Plausible Evaluation of Brain MRI Generators.”

R3: Do you trust the quantitative or qualitative assessment, why? A: The motivation behind our proposed framework was that many methods performed equally well with respect to quantitative metrics (i.e., global semantic-level similarity), while they greatly differed with respect to their qualitatively. To ensure that qualitative differences were properly captured, we added regional brain measurements that measure anatomical plausibility. At this point, we trust those qualitative assessments more as these are also the metrics used by neuroscience studies.

R3: summary explanation of the utilized metrics A: MS-SSIM: Evaluates perceptual quality by comparing structural similarity across multiple scales. MMD :Measures the distance between distributions of generated and real data in a high-dimensional space. FID : Compares the distribution of features from a pre-trained network for both real and generated MRIs. Lower FID scores indicating higher similarity to real MRIs. For all of them, we provide a statistical measure of the overall similarity between synthetic and real MRIs.

R4: needs more background lacking discussion on other generative models. A: GAN and diffusion methods are the state-of-the-arts. There are also deformation field-based methods, but they fail to increase the data diversity as they are based on an existing MRI. We will add to the introduction as space permits.

R4: compare with other anatomical-level evaluation strategies. A: To the best of our knowledge, all previous generative methods using perception-based metrics did not involve cortex-level evaluation.

R4: needs downstream tasks A: We agree and will include this in the discussion.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The reviewers’ concerns include the limitations of the proposed pipeline, the method’s innovation, the number of comparison methods, and the details of the proposed metrics. Additionally, there are no downstream tasks. The authors responded by clarifying their innovations, verifying some details and limitations, and conducting comparison experiments with six different methods. They also plan to revise the paper title to more accurately describe their method. Therefore, I suggest acceptance.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The reviewers’ concerns include the limitations of the proposed pipeline, the method’s innovation, the number of comparison methods, and the details of the proposed metrics. Additionally, there are no downstream tasks. The authors responded by clarifying their innovations, verifying some details and limitations, and conducting comparison experiments with six different methods. They also plan to revise the paper title to more accurately describe their method. Therefore, I suggest acceptance.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    Overall, I think the manuscript should be accepted. As noted there are some existing limitations especially on the method not being applicable to images with pathology this greatly limits the utility of the proposed methodology in an practical application. However, this is an interesting idea and the key weaknesses are one of applicability not methodology.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    Overall, I think the manuscript should be accepted. As noted there are some existing limitations especially on the method not being applicable to images with pathology this greatly limits the utility of the proposed methodology in an practical application. However, this is an interesting idea and the key weaknesses are one of applicability not methodology.



back to top