Abstract

Generative image reconstruction algorithms such as measurement conditioned diffusion models are increasingly popular in the field of medical imaging. These powerful models can transform low signal-to-noise ratio (SNR) inputs into outputs with the appearance of high SNR. However, the outputs can have a new type of error called hallucinations. In medical imaging, these hallucinations may not be obvious to a Radiologist but could cause diagnostic errors. Generally, hallucination refers to error in estimation of object structure caused by a machine learning model, but there is no widely accepted method to evaluate hallucination magnitude. In this work, we propose a new image quality metric called the hallucination index. Our approach is to compute the Hellinger distance from the distribution of reconstructed images to a zero hallucination reference distribution. To evaluate our approach, we conducted a numerical experiment with electron microcopy images, simulated noisy measurements, and applied diffusion based reconstructions. We sampled the measurements and the generative reconstructions repeatedly to compute the sample mean and covariance. For the zero hallucination reference, we used the forward diffusion process applied to ground truth. Our results show that higher measurement SNR leads to lower hallucination index for the same apparent image quality. We also evaluated the impact of early stopping in the reverse diffusion process and found that more modest denoising strengths can reduce hallucination. We believe this metric could be useful for evaluation of generative image reconstructions or as a warning label to inform radiologists about the degree of hallucinations in medical images.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/3513_paper.pdf

SharedIt Link: https://rdcu.be/dV543

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72117-5_42

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/3513_supp.zip

Link to the Code Repository

N/A

Link to the Dataset(s)

N/A

BibTex

@InProceedings{Tiv_Hallucination_MICCAI2024,
        author = { Tivnan, Matthew and Yoon, Siyeop and Chen, Zhennong and Li, Xiang and Wu, Dufan and Li, Quanzheng},
        title = { { Hallucination Index: An Image Quality Metric for Generative Reconstruction Models } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15010},
        month = {October},
        page = {449 -- 458}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    This paper presents the Hallucination Index, a novel image quality metric, which measures hallucinations of generative models applied to medical image reconstruction. In medical image reconstruction with deep learning, hallucinations are often overlooked and this paper makes a major contribution for raising awareness and quantifying the severity of hallucinations. The method is novel, sound, and evaluated in a reasonable setting.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The paper addresses an important problem in medical image analysis. The problem setting is well motivated in the introduction.
    • A novel and sound method is proposed.
    • The metric is evaluated on a real-world dataset and seems to work fine.
    • The paper is very well written and easy to follow.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • It is unclear how the metric is supposed to be computed when there is no access to the ground truth x. How is the zero-hallucination reference distribution computed, which is necessary for computing the index, when only low SNR measurements are available?
    • The HI seems to need multiple measurement samples. How are these obtained? What if there is only one low SNR sample per measurement?
    • I did not find a code release statement.
  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
    • Please clarify how HI would be computed if there’s no access to the ground truth image.
    • If computation of HI without ground truth is not possible, this should be discussed as major weakness of the metric.
    • While hallucinations are a highly relevant issue, I don’t think that they are new, as framed by the authors.
    • Can this metric be extended to other generative frameworks, such as VAE, GANs, etc.?
    • Typo in sentence between Eq. (1) and (2).
    • Typo in last line on p. 5.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    The proposed method is highly valuable for the community and the issue of hallucinations from generative models in general has to be addressed. I would like to see this paper discussed at MICCAI. However, for this method being useful in practice, the authors have to clarify how HI is computed without access to ground truth.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #2

  • Please describe the contribution of the paper

    This study introduces hallucination index, a metric designed to quantify the amount of structural uncertainty present in a generated distribution. The hallucination index is then applied to diffusion-based reconstruction of electron microscopy images. Additionally, the study highlights the tradeoff between mean squared error and the hallucination index: as the reverse diffusion process continues, the mean squared error decreases, while the hallucination index increases.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper has multiple strengths. First, the proposed metric is novel. The authors do a good job explaining previous measures of “hallucination” and justifying how theirs is different and an addition to the field. Second, the proposed metric is important. It is imperative to understand how often generative models being developed for medical imaging create structures that don’t exist. Third, the manuscript is well-written, with appropriate detail, evaluation, and explanation of the proposed methodology.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    I don’t necessarily have weaknesses to discuss, but I do have several reservations about the manuscript that I’d like the authors to respond to in their rebuttal.

    First, I am uncertain of the hallucination index’s applicability to non-Fourier diffusion models. In the rebuttal, I’d like to see initial ideas on how to create zero-hallucination reference distributions for other generative modeling architectures. If this work were to be expanded, I’d like to see evaluations on other architectures.

    Second, while I appreciate that the hallucination index was compared to the mean squared error, why was it also not compared to hallucination map? If hallucination map could be applied to the study, I would like to see it applied in an expanded version of this manuscript.

    Third, I don’t understand how a distance between two probability distributions defined by flattened images preserves structure enough to detect hallucinated higher-dimensional structures. Would your method be able to detect 3D hallucinations in other medical imaging modalities?

  • Please rate the clarity and organization of this paper

    Excellent

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?

    While there is no link to code, the manuscript utilizes a public dataset and all the methods are carefully delineated.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    In an expanded version of this manuscript, I would like to see a sort of “sanity check” where synthetic hallucinations are purposefully added to images in various quanities, sizes, and shapes. And then see how the hallucination index responds to these hallucinations. I have delineated minor writing critiques and suggestions below.

    Figure feedback:

    • Fig. 1: Between y and x-hat, it is not immediately apparent whether the relationship is defined by x_hat y or x_hat x. I would move the equation for p_theta(x_hat x) somewhere else. Perhaps lower and centered between x and x_hat?
    • Fig. 1: I would keep H between 0 and 1 instead of converting it to percentages. Especially because it is between 0 and 1 elsewhere in the manuscript.
    • I think the manuscript could really benefit from an official algorithm description figure instead of the 1-5 at the end of 2.1. I understand if this can’t be incorporated into a MICCAI manuscript due to the length limitations.
    • Fig. 2.: I would add in arrows pointing to where the hallucinations are.
    • Fig. 3.: I’m sure there isn’t much room for it, but it would be helpful to add labels for each row. The graphs in the bottom row are missing x-axis labels. Again, arrows or some other way to highlight the hallucinations would be helpful.

    Writing feedback:

    • Radiologist doesn’t need to be capitalized in the abstract.
    • I would make it clear earlier than page 5 that you use a Fourier-based forward diffusion process. I was confused about how adding Gaussian noise creates a higher SNR up until page 5.
    • I would be careful about the last sentence of your abstract. The warning label isn’t mentioned elsewhere in your paper. And the label brings up questions of if a radiologist would trust an image with any degree of hallucinations in it.
    • The reference in-text citation format needs to be standardized. Sometimes it says [15,20,21] whereas other places the format would be [15][20][21]. There are also unecessary spaces between some citations.
    • When two words work together like an adjective to describe a noun, they should be hyphenated. So “high spatial resolution” -> “high-spatial resolution” (although this should probably be just “spatial resolution”), “low dwell time” -> “low-dwell time”, “diffusion based reconstructions”->”diffusion-based reconstructions”, “large scale numerical” -> “large-scale numerical”, “spatially correlated noise” -> “spatially-correlated noise”, etc.
    • The end of the introduction would benefit from an explicit delineation of the manuscript’s contributions.
    • Don’t forget to add in an abbreviated title!
    • In my opinion, it’s best practice to add a period at the end of the equations because you are using them in a sentence.
    • At the end of 2.1, there should be an “and” before 5. Again, they are being used in a sentence.
    • It is good practice to mention the authors of a work by name when referring to the work instead of referring to a citation. So, “We followed the methods in Tivnan et al. [17]” instead of “methods in [17]”. This idea can be incorporated elsewhere throughout the paper.
    • Choose “the” or “our” in “we seek to validate the our proposed metric”.
    • I would be more consistent when referring to the signal-to-noise ratio. Sometimes it is referred to as “signal-to-noise”, others SNR, and others “signal to noise”.
    • “and’ between 0.4, and 0.2. when introducing t_start values.
    • Add a comma in “1000”
    • I would refer to the video in the Supplementary Material.
    • In the conclusion, I think stating that the hallucination index can be useful for identifying generative artifacts is a stretch. Quantifying the amount of generative artifacts present in a generated distribution is realistic though. In addition, stating that the hallucination index is a quantitative metric for image quality assessment also feels slightly misleading because it evaluates generative distributions, not individual images.
    • In your future work paragraph, what do you desire to apply hallucination analysis methods to?
    • Inconsistent capitlization of “hallucination index”.
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    This paper presents a novel metric that evaluates an important problem. The manuscript is well-written and contains the essential information to understand the methodology. While initially impressed with the manuscript, I had several reservations about the work that I would like to see the author’s respond to in their rebuttal. While I’d like to see more experiments in an expanded version of this manuscript (applied to more architectures, synthetic hallucinations, and the hallucination map), I do not believe these experiments are necessary for MICCAI publication due the “Call for Papers” stating that novel methodologies can be limited to small-size validation studies.

  • Reviewer confidence

    Somewhat confident (2)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A



Review #3

  • Please describe the contribution of the paper

    This paper introduces a novel image quality metric for measuring hallucination errors in generative image reconstruction models. These hallucinations refer to the error in the estimation of object structure caused by a (generative) machine learning model. The metric is based on the Hellinger distance from the distribution of reconstructed images to a reference distribution, here chosen to be the ground truth images with noise, i.e. the forward diffusion process. Numerical experiments were performed on electron microscopy images with simulated noisy measurements.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
    • The insight that running more iterations reduces the mean squared error but at the same time increases the hallucination index is interesting and showcases the relevance of creating such a quality measure.
    • Creating the hallucination index by defining what hallucinations are not is elegant and intuitive.
    • The explanation of the method is very well-written and clear.
  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
    • The paper introduces the hallucination index, based on the Hellinger distance between a reference (zero-hallucination) distribution and the reconstructed images. The reference distribution is chosen to be the artifacts generated during the forward process, which consists of ground truth and noise. A weakness is the unclear contribution of the choosing the specific distance and reference. Does only the Hellinger distance measure hallucinations, or can other probability distance measures be used? Similarly, what is the influence of the choice of the reference distribution? Does the main contribution include the selection of the distance function, reference distribution, or their specific combination? An ablation would help showcase this interplay.
    • In the abstract, the authors mention that hallucinations could cause diagnostic errors. A simple experiment showcasing the failure of models when exposed to (enough) hallucinations would help underline these statements. A reference showcasing this phenomenon would also suffice if available.
    • Although the authors perform some numerical experiments, they are only computed on a single dataset for electron microscopy images and, although compelling, raise the question of generalization to other medical domains.
    • The authors have used Fourier Diffusion models, which, due to their introduction for image-based reconstruction, make sense. However, further investigation of their methodology to more general generative models would make the work more impactful.
  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.

  • Do you have any additional comments regarding the paper’s reproducibility?
    • For reproducibility, it would be helpful if the generated samples and/or the source code would be released.
    • It would be helpful if a reference implementation of the hallucination index could be provided on a toy example.
  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    The most important suggestions to improve the paper are given together with the main weaknesses, here are some further minor ones.

    • Discussion of the interpretation of Fig. 3 could be stronger, especially for the noise power spectral density plots.
    • Motivation for the usage of image reconstruction models could be made stronger.
    • Minor comment: citations can be harmonized, e.g. either citing each work alone [1][2] or together [1,2].
  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
    • The paper is well-written, organized, and interesting to read.
    • The authors’ methodology seems novel, sound, and elegant.
    • Evaluation is sufficient, although not exhaustive. Mostly due to the
    • lack of detailed information about core components of the method,
    • lack of investigation across medical domains.
    • Generalization of methodology to other generative reconstruction methods is missing.
  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

N/A




Meta-Review

Meta-review not available, early accepted paper.



back to top