List of Papers Browse by Subject Areas Author List
Abstract
Diffusion models, originally introduced for image generation, have recently gained attention as a promising image denoising approach. In this work, we perform comprehensive experiments to investigate the challenges posed by diffusion models when applied to medical image denoising. In medical imaging, retaining the original image content, and refraining from adding or removing potentially pathologic details is of utmost importance. Through empirical analysis and discussions, we highlight the trade-off between image perception and distortion in the context of diffusion-based denoising.
In particular, we demonstrate that standard diffusion model sampling schemes yield a reduction in PSNR by up to 14 % compared to one-step denoising. Additionally, we provide visual evidence indicating that diffusion models, in combination with stochastic sampling, have a tendency to generate synthetic structures during the denoising process, consequently compromising the clinical validity of the denoised images. Our thorough investigation raises questions about the suitability of diffusion models for medical image denoising, underscoring potential limitations that warrant careful consideration for future applications.
Links to Paper and Supplementary Materials
Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/2146_paper.pdf
SharedIt Link: https://rdcu.be/dV55e
SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72117-5_53
Supplementary Material: N/A
Link to the Code Repository
N/A
Link to the Dataset(s)
N/A
BibTex
@InProceedings{Pfa_NoNewDenoiser_MICCAI2024,
author = { Pfaff, Laura and Wagner, Fabian and Vysotskaya, Nastassia and Thies, Mareike and Maul, Noah and Mei, Siyuan and Wuerfl, Tobias and Maier, Andreas},
title = { { No-New-Denoiser: A Critical Analysis of Diffusion Models for Medical Image Denoising } },
booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
year = {2024},
publisher = {Springer Nature Switzerland},
volume = {LNCS 15010},
month = {October},
page = {568 -- 578}
}
Reviews
Review #1
- Please describe the contribution of the paper
The manuscript performs quantitative and qualitative evaluations of diffusion models used exclusively for denoising via reverse diffusion processes. The main finding is that reverse diffusion is net inferior to single-step denoising.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The paper is well-written and timely, given the prevalence of diffusion models. The community definitely needs studies like this one, that put into question some of the claimed benefits.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
The methodology is lacking: a non-standard diffusion training framework was used alongside small datasets, which brings into question the generality of the claims.
- Please rate the clarity and organization of this paper
Excellent
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Do you have any additional comments regarding the paper’s reproducibility?
Despite not mentioning any intent to release code, reproducibility is good, given that the authors use the well-known and publicly available DDM2 method and public datasets.
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
-
The entire study builds on the DDM2 architecture and training methodology. This is significantly different than “standard” (assuming noiseless training data) diffusion model training, to the point that there are no theoretical guarantees or justifications for the loss function used, nor such claims are made in the original DDM2 paper. This makes the title of the paper and all conclusions somewhat misleading. It would have been much clearer to state DDM2 from the beginning.
-
For example, carefully studying the training loss in [DDM2, Section 3.4] reveals that this model is not trained with a regular diffusion objective, but rather using already-noisy data (“x”) as a target. While this is clearly helpful for self-supervised denoising, this makes the learning objective not equal to denoising score matching, and is unclear if all the theory presented about DDPM and reverse diffusion applies here, since reverse sampling relies on a score function oracle.
-
It is not exactly clear how “results for the model prediction after the first iteration” were obtained. It is well-known via Tweedie’s rule, that MMSE denoising can be achieved via a single weighted gradient step, if the exact i.i.d. Gaussian noise level is known (e.g., see Efron, Bradley. “Tweedie’s formula and selection bias.” Journal of the American Statistical Association 106.496 (2011): 1602-1614, Equation 1.5). If this is the case, it is worth clearly stating it.
- Given that the authors used DDM2, the datasets are small-scale in terms of the number of patients (despite being 3D), and there is no investigation of generalization error - like any other regression model, the validation loss could simply be evaluated and presented for the diffusion model used. As it stands, it is not clear if the conclusions are due to overfitting of the model.
-
This is worth checking because the vast majority of diffusion models for medical imaging works trained on (much) larger amounts of data (e.g., the fastMRI dataset in the order of 10k patients). It is not clear if this analysis would hold in that regime as well.
- Beyond MRI image denoising, there is a (perhaps even larger) line of work that uses diffusion models for MRI reconstruction and is clinically relevant. It would improve the message of the manuscript if the authors also address this line of work, what alternatives could be explored there (e.g., supervised reconstruction?), and if their reasoning for why reverse diffusion would be worse there still holds.
-
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making
Weak Reject — could be rejected, dependent on rebuttal (3)
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Overall, this is a very well-intended and executed paper, but it would have been more accurate to lessen the generality of the claims and findings and for the title to say “DDM2” instead of “Diffusion”, indicating that the study holds only for a specific sub-type of diffusion models, which use a different learning objective and a more complicated data pipeline.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
Accept — should be accepted, independent of rebuttal (5)
- [Post rebuttal] Please justify your decision
I thank the authors for addressing my central concern regarding DDM2 by clarifying that it only refers to the model architecture, not the entire pipeline. This is a valuable study with negative conclusions (rare these days), so I have raised my score from 3 to 5 (accept).
Review #2
- Please describe the contribution of the paper
The paper evaluates diffusion models for MRI image denoising on two datasets.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
The state-of-the-art diffusion models for self supervised MRI image denoising strategy is tested.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1) It is unclear how the simulation MRI noise is generated in the paper. The authors should detail this parts. It seems that only one kind of simulated noise distribution is evaluated. Different kinds of simulated noise distribution can be tested for a comprehensive study.
2) As an application paper, this work probably need to be tested on more datasets. The authors should test as diverse settings as possible for a comprehensive study. It may include different simulated noise distributions, different diffusion models, and different datasets (with probably also different imaging modalities). - Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not provide sufficient information for reproducibility.
- Do you have any additional comments regarding the paper’s reproducibility?
N/A
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
It may be helpful to also have a more comprehensive review on diffusion model methods for general inverse problems.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making
Weak Reject — could be rejected, dependent on rebuttal (3)
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
The diversity of tested datasets and imaging modalities are quite limited. For simulation noise test, noise simulation settings should be detailed.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
Reject — should be rejected, independent of rebuttal (2)
- [Post rebuttal] Please justify your decision
The author’s statement on the relationship between MRI’s noise distribution and the Gaussian distribution is not correct. I would recommend the authors to refer to the following paper: Aja-Fernández, S. and Vegas-Sánchez-Ferrero, G., 2016. Statistical analysis of noise in MRI. Switzerland: Springer International Publishing.
Review #3
- Please describe the contribution of the paper
This paper is the first to produce a systematic assessment of diffusion models as a denoiser of magnetic resonance images. They demonstrate that whilst certain noising schedules improves robustness, in general the iterative nature of diffusion models reduces PSNR and leads to the generation of synthetic structures. Importantly they demonstrate that a single denoising step with diffusion models out-performs several steps of denoising in terms of PSNR.
- Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
A key strength of the paper is that it questions the utility of diffusion models for MR denoising systematically. This is evidenced by the variety of noising schedules that were employed (linear, constant, and uniform noise), as well as the numerous sampling regimes (stochastic, deterministic, with and without regularization) that were used. They also varied whether the noise was predicted directly or whether the image was. This is a strength, as it gives diffusion models a fair chance to perform well, given that many forms of implementing them are tested out. A second strength is the novel finding that a single step of denoising performs better than several steps. This is a practically useful result for individuals who want to use these approaches.
- Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
The main weakness with this paper is the small sample size. The real world dataset only had 19 patients for training, which is quite a small amount given that diffusion models are generally very data hungry.
Further, it is unclear to me the exact amount of brain images that appeared in the synthetic dataset. The data subsection of section 3 (Experiments) says that the BrainWeb 20 dataset was “20 anatomical brain models, split into twelve models for training and four for validation and testing.” But figure two says that 50 test images were used.
- Please rate the clarity and organization of this paper
Good
- Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.
The submission does not mention open access to source code or data but provides a clear and detailed description of the algorithm to ensure reproducibility.
- Do you have any additional comments regarding the paper’s reproducibility?
It seems as though the models are used to generate several brains but, could the authors please clarify the exact number of synthetic brains used in the training, validation and test set, and include this in the paper.
- Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html
On the point of a small sample size, the authors could have opted for using a pre-trained MR model and then fine-tuned it for denoising on the data that they had. Concerning the synthetic data, it seems as though the BrainWeb20 anatomical models are used to generate several brains but, could the authors please clarify the exact number of synthetic brains used in the training, validation and test set, and include this in the paper.
Also, I would advise adding (SURE) in brackets next to ID 11 in tables 1, 2 and 3 just to make it super clear that it is the non-diffusion regime.
Below are some comments on the clarity of specific sentences:
Page 3 penultimate paragraph, the last sentence is: “When training the model to predict the clean image x0 rather than the noise ϵt, ϵ_{θ}(t) in Equations 5 and 6 can be derived from Equation 3.” This should be rephrased to make it more clear.
Page 3 last paragraph first sentence: “In traditional diffusion models, the generative process originates from a random sample derived from a Gaussian noise distribution, progressively removing noise to synthesize new images.” The last part of the sentence should be rephrased to something along the lines of: ‘… from a Gaussian noise distribution, from which noise is progressively removed to synthesize new images’. As it currently reads like the Gaussian removes noise from itself.
- Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making
Accept — should be accepted, independent of rebuttal (5)
- Please justify your recommendation. What were the major factors that led you to your overall score for this paper?
Despite the small sample size, and slight lack of clarity in areas, the fundamental idea is well motivated. Diffusion models are gaining alot of traction, but it is important to know where/how they should and should not be used. In particular the insight that a single denoising step performs better than several denoising steps is novel and practically useful for individuals using these types of models.
Further it is possible that the small sample size reflects the sizes of data that are used in real world applications, but hopefully other researchers will validate these results in bigger datasets.
- Reviewer confidence
Confident but not absolutely certain (3)
- [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed
N/A
- [Post rebuttal] Please justify your decision
N/A
Author Feedback
We thank the reviewers for their thoughtful feedback. Our research explores the constraints of diffusion models in medical image denoising, revealing degradation in denoising quality and the emergence of hallucinated image features. We are encouraged by R1’s positive remarks, noting our manuscript as “timely and well-written,” and their recognition of the community’s need for studies like ours. We also appreciate R4’s acknowledgment of our thorough evaluation, which tested various noise schedules, sampling regimes, and network outputs, ensuring that the diffusion model was given a “fair chance to perform well.”
Focus on DDM2 (R1) R1 claims that a non-standard diffusion training framework (DDM2) was used in our experiments, which makes our conclusions misleading. We would like to clarify that we exclusively used the diffusion model network architecture provided by the DDM2 framework, without employing its corresponding three-stage self-supervised training scheme. In contrast, we utilized both simulated data and previously denoised real MR images to enable training with the standard DDPM/DDIM training scheme. This highlights that the diffusion models in our experiments fail despite being trained and tested under ideal conditions. We apologize for any confusion and will clarify this point in the final version of our paper.
Concern of limited data (R1, R3, R4) R1, R3, and R4 raised concerns about the limited sample size in our experiments. It is crucial for us to use noise-free data to eliminate any influence of noise-corrupted ground truth or other confounding factors. Unfortunately, there is a lack of large noise-free medical imaging datasets. While we replicated our experiments on the ImageNet dataset and observed similar outcomes, we opted not to include these results due to their lack of relevance for medical applications. Our work offers a practical proof of concept, reflecting typical dataset sizes in real-world medical applications, as emphasized by R4. We appreciate R4’s suggestion to employ a pre-trained model and aim to expand our analysis to more relevant datasets and advanced training schemes in future studies.
Noise simulation (R3) Reviewer 3 requested clarification on the noise simulation in our simulated MRI. In our experiments, we opted for Gaussian noise for two primary reasons: Firstly, the noise in (complex-valued) MRI can be accurately modeled as a Gaussian distribution, reflecting a realistic setting. Secondly, Gaussian noise aligns well with the standard diffusion sampling scheme, representing an ideal setting for the method to perform effectively. Using different noise distributions would necessitate altering the diffusion model training scheme. For our initial proof of concept, we maintained consistency with Gaussian noise.
Model prediction after first iteration (R1) R1 inquired about the “model prediction after the first iteration.” This refers to the initial model prediction for x_0 after the first sampling iteration during inference. Typically, diffusion models perform N sampling steps during inference to arrive at the final model prediction. We will clarify this in the manuscript.
Investigation of generalization error (R1) R1 raises the important point of investigating generalization error. We have already addressed this in our experiments by closely monitoring the validation loss and implementing a corresponding stopping criterion during training to prevent overfitting.
Synthetic data details (R4) Reviewer 4 highlighted the need for clarification regarding the number of samples used for training, validation, and testing. We will include this information in the final version of the paper: we generated a total of 720 slices for training and 240 each for validation and testing. For Figure 2 we used a random subset of 50 slices from the test set.
We appreciate the reviewers’ additional suggestions and will incorporate them into the final version of the manuscript.
Meta-Review
Meta-review #1
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
N/A
- What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).
N/A
Meta-review #2
- After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.
Accept
- Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’
The authors well addressed the issues raised by the reviewers during rebuttal.
- What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).
The authors well addressed the issues raised by the reviewers during rebuttal.