Abstract

Fréchet Inception Distance (FID) is a widely used metric for assessing synthetic image quality. It relies on an ImageNet-based feature extractor, making its applicability to medical imaging unclear. A recent trend is to adapt FID to medical imaging through feature extractors trained on medical images. Our study challenges this practice by demonstrating that ImageNet-based extractors are more consistent and aligned with human judgment than their RadImageNet counterparts. We evaluated sixteen StyleGAN2 networks across four medical imaging modalities and four data augmentation techniques with Fréchet distances (FDs) computed using eleven ImageNet or RadImageNet-trained feature extractors. Comparison with human judgment via visual Turing tests revealed that ImageNet-based extractors produced rankings consistent with human judgment, with the FD derived from the ImageNet-trained SwAV extractor significantly correlating with expert evaluations. In contrast, RadImageNet-based rankings were volatile and inconsistent with human judgment. Our findings challenge prevailing assumptions, providing novel evidence that medical image-trained feature extractors do not inherently improve FDs and can even compromise their reliability. Our code is available at https://github.com/mckellwoodland/fid-med-eval.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2251_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/2251_supp.pdf

Link to the Code Repository

https://github.com/mckellwoodland/fid-med-eval

Link to the Dataset(s)

https://sliver07.grand-challenge.org/ https://nihcc.app.box.com/v/ChestXray-NIHCC http://medicaldecathlon.com/ https://www.creatis.insa-lyon.fr/Challenge/acdc/databases.html

BibTex

@InProceedings{Woo_Feature_MICCAI2024,
        author = { Woodland, McKell and Castelo, Austin and Al Taie, Mais and Albuquerque Marques Silva, Jessica and Eltaher, Mohamed and Mohn, Frank and Shieh, Alexander and Kundu, Suprateek and Yung, Joshua P. and Patel, Ankit B. and Brock, Kristy K.},
        title = { { Feature Extraction for Generative Medical Imaging Evaluation: New Evidence Against an Evolving Trend } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15012},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors compared the performance of ImageNet-based extractors against RadImageNet-based ones across sixteen StyleGAN2 networks in four medical imaging modalities to calculate FID score. They used eleven different feature extractors and compared their Fréchet distances with human judgment through visual Turing tests. Results showed that ImageNet-based extractors align more consistently with human judgment, particularly the ImageNet-trained SwAV extractor, which correlated significantly with expert evaluations.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper conducted a comprehensive analysis of ImageNet and RadImageNet models by calculating the Fréchet Inception Distance (FID) across multiple medical imaging datasets and aligned their findings with human judgement.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The reliance on StyleGAN2 networks and a select few feature extractors might make the findings less applicable to other generative models or newer architectures that could behave differently

  • Please rate the clarity and organization of this paper

    Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The authors should explain their choice of StyleGAN2 networks among generative model. The normalization techniques used in RadImageNet and ImageNet may differ, potentially contributing to the observed volatility in model performance. This variation could impact the generation of pretrained weights and subsequently, the stability of the models. The authors should consider adding a discussion section to analyze potential reasons why ImageNet-based extractors outperform those trained on medical images in terms of FID. This analysis could provide deeper insights into the strengths and limitations of using ImageNet-trained models for medical image analysis.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The authors have conducted comprehensive analyses and experiments, and their findings challenge prevailing assumptions. They could enhance the paper by including more detailed discussions to further explore the observed disparities.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This paper utilizes multiple backbones and datasets to demonstrate metrics trained on ImageNet are superior than the ones trained on medical image datasets. The authors identified FSD is better than FID and benchmarked multiple data augmentation techniques. They also introduced a novel method for evaluating visual turning tests.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper pointed out a potential problem of using FID trained on medical image datasets to evaluate the generative model performance. It is important for the community to notice this problem. The paper conducted sufficient experiments to demonstrate metrics trained on ImageNet are superior and collected human response to support the claim. The authors also benchmarked multiple data augmentation techniques.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The authors mentioned three approaches researchers utilized to adapt FID to medical image area. The experiments in this paper can cover the first two approaches with the third approach less discussed. It would make the paper more complete if the author can discuss more about the unquantified bias they mentioned in the introduction. Currently, the paper cited about the unquantified bias is not sufficient to support their claim.

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    In the results section, the tables are described in detail. It would be better to describe Fig.1 in detail as well. Like, what does the multiple bars in the same cell mean (e.g., MSD Human Rankings KS Cell)? How the Pearson coefficients are computed (only using the top 4 rankings or all rankings)?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The extensive experiments covering the first two approaches mentioned in the introduction but the insufficient discussion of the third approach limitations.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper presents a contribution towards the assessment of Fréchet Inception Distance (FID) measurement. The paper argued that FID pretrained on ImageNet provides more efficient and significant scores to indicate the quality, diversity, and synthetic images than FID pretrained on RadImage-Net. The StyleGAN2 was trained on four benchmark X-ray datasets, generated synthetic images, and the quality of the synthetic images is computed using FID score. FID trained on ImageNet provides more efficient and consistent results than FID trained on RadImageNet.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    The paper provides a comprehensive empirical evaluation of FID for X-ray images. The paper indicates supportive evidence for the proposed argument and hypothesis.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    In this work, I have a few concerns regarding the computation of FID scores. How was FID score calculated? Are a number of image samples processed in batches or not?

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    The methodology of FID computation should clearly be presented for reproducibility.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    This paper presents a contribution towards the assessment of Fréchet Inception Distance (FID) measurement. The paper argued that FID pretrained on ImageNet provides more efficient and significant scores to indicate the quality, diversity, and synthetic images than FID pretrained on RadImage-Net. The StyleGAN2 was trained on four benchmark X-ray datasets, generated synthetic images, and the quality of the synthetic images is computed using FID score. FID trained on ImageNet provides more efficient and consistent results than FID trained on RadImageNet. I have a following concern: 1-How was the FID score calculated? Are a number of image samples processed in batches or not? 2-The methodology of FID computation should clearly be presented for reproducibility.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The paper has a good presentation, quantitative and qualitative evaluation of results, and comprehensive experiments and assessments.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

Thank you to the reviewers for your thoughtful feedback.

Regarding other potential feature extractors, the purpose of this work is to question the recent trend of using medical image extractors with the FD without evaluating their efficacy for the task. The most commonly used of these extractors is the RadImageNet InceptionV3 network analyzed in our work. While other extractors may behave differently, we hope that our work will cause the community to empirically evaluate the effectiveness of proposed extractors for the task at hand before they are used to benchmark generative models. We think that self-supervised extractors trained on comprehensive, public medical imaging datasets could prove especially useful in future work.

Regarding private extractors, our work solely evaluates public extractors as they provide consistency and reproducibility. The unquantified bias in private extractors stems from the fact that the algorithm designer creates the metric that evaluates their algorithm. As the efficacy of these metrics is rarely evaluated, the errors present in these metrics are unquantified. Having said that, training a self-supervised model on the dataset used to train the generative model may provide the most accurate rankings (https://doi.org/10.1117/12.2680115). Future work can focus on methodology to create, evaluate, and present these metrics in a consistent and reproducible manner.

Regarding utilizing StyleGAN2 as the sole generative model, our findings are based on four medical imaging modalities and four data augmentation techniques, resulting in 16 models for medical doctors to evaluate. We chose to benchmark data augmentation techniques over model architectures as we felt that it was more novel. While including more models would make our work more robust, doing so would have caused an undue burden on our evaluators. We do not believe that changing generative models would vastly change our results because FDs are calculated on an image level and are thus model agnostic. StyleGAN2 was chosen for its proven ability to produce high-fidelity medical images (http://doi.org/10.1007/978-3-031-16980-9_14) and its readily available data augmentation implementations.

Regarding the discussion of potential reasons why RadImageNet extractors were volatile, our main hypotheses are delineated in the Introduction. First, networks trained for disease detection may focus too heavily on localized regions to effectively evaluate an entire image’s quality. Second, medical images are highly heterogeneous, including differences across modalities, acquisition protocols, patient populations, and image processing techniques. A more in-depth discussion can be found in the discussion section in the earliest version of the arXiv counterpart to this manuscript. It was removed from this manuscript due to space constraints. 

Regarding Figure 1, the description was significantly shorten to avoid pushing our manuscript over the page limit. Vertical bars represent shared rankings.

Regarding FD calculation details, FDs were calculated between the entire real dataset and 50,000 generated images (no batches). For RadImageNet FDs, image normalization mirrored the normalization presented in the official RadImageNet repository, causing us to believe that the inconsistencies of the RadImageNet extractors were not due to improper normalization. ImageNet FDs were calculated via the official StudioGAN repository with a clean resizer. 

Regarding correlation statistics, the Pearson correlation coefficients considered (1) the relationships between raw FDs computed with different extractors and (2) the relationships between raw FDs and the average differences across mean Likert ratings for all 16 models.

Our public GitHub repository provides further methodological details.




Meta-Review

Meta-review not available, early accepted paper.



back to top