Abstract

Diffusion models have advanced unsupervised anomaly detection by improving the transformation of pathological images into pseudo-healthy equivalents. Nonetheless, standard approaches may compromise critical information during pathology removal, leading to restorations that do not align with unaffected regions in the original scans. Such discrepancies can inadvertently increase false positive rates and reduce specificity, complicating radiological evaluations. This paper introduces Temporal Harmonization for Optimal Restoration (THOR), which refines the reverse diffusion process by integrating implicit guidance through intermediate masks. THOR aims to preserve the integrity of healthy tissue details in reconstructed images, ensuring fidelity to the original scan in areas unaffected by pathology. Comparative evaluations reveal that THOR surpasses existing diffusion-based methods in retaining detail and precision in image restoration and detecting and segmenting anomalies in brain MRIs and wrist X-rays. Code: https://github.com/compai-lab/2024-miccai-bercea-thor.git.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1315_paper.pdf

SharedIt Link: pending

SpringerLink (DOI): pending

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1315_supp.pdf

Link to the Code Repository

https://github.com/compai-lab/2024-miccai-bercea-thor.git

Link to the Dataset(s)

https://brain-development.org/ixi-dataset/ https://atlas.grand-challenge.org

BibTex

@InProceedings{Ber_Diffusion_MICCAI2024,
        author = { Bercea, Cosmin I. and Wiestler, Benedikt and Rueckert, Daniel and Schnabel, Julia A.},
        title = { { Diffusion Models with Implicit Guidance for Medical Anomaly Detection } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15011},
        month = {October},
        page = {pending}
}


Reviews

Review #1

  • Please describe the contribution of the paper

    The authors introduce THOR, a Diffusion model that guides the backward process in DDPMs. This is achieved by continuously substituting the content of presumably healthy anatomy with the original input image at fixed intervals. Intermediate deviations of input and reconstruction derive the regions of presumably healthy anatomy. The final anomaly score is computed as an ensemble from multiple intermediate reconstructions. The approach’s primary objective is to focus the reconstruction process on restoring only the abnormal regions, thereby aiming to reduce false positives caused by the lack of anatomical coherence and imperfect reconstructions.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    • The paper addresses a critical issue in reconstruction-based UAD, specifically enhancing coherence between input-reconstruction pairs. • The paper is well-written and provides a clear motivation for the proposed method. • The authors provide a thorough literature review, focusing on DDPM-based approaches to UAD and highlighting their processes and shortcomings. • The the code has been made anonymously available, which promotes reproducibility. • The authors have included visually appealing figures that effectively explain their method and allow for a qualitative assessment. • The proposed approach is compared to relevant state-of-the-art models across different modalities, including MRI and X-ray.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    • The paper lacks ablation studies to underscore the impact of the proposed guidance. Specifically, elements such as the post-processing of temporal anomaly maps, the utilization of LPIPs for anomaly scoring, and the ensembling of different noise levels for anomaly scoring are operations that could considerably influence the results. It would be beneficial if these elements were individually examined and compared to understand their respective effects relative to the proposed guidance strategy in THOR. • There are major concerns regarding the selection, fusion, and division of the datasets. Specifically, the merging of training and test sets, coupled with the lack of validation sets, is viewed as a significant problem. • The paper does not provide statistics (mean/standard deviation) for the reported results and does not conduct significance tests to validate the improved performance statistically.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    While this paper addresses an important topic in the field of UAD and nicely presents an intriguing idea, several areas need clarification and exploration. The reviewer wants to clarify that when asking for additional experiments, these are suggestions meant to give potential directions for the future work of the authors and should not be provided in the rebuttal stage as this contradicts the guidelines of the MICCAI review process.

    1. Performance Improvement Analysis: The reviewer is interested if the demonstrated performance improvements are solely attributable to the harmonization steps proposed in THOR. From Figure 2, it is evident that the anomaly map of THOR significantly deviates from the DDPM’s anomaly map at T=N, despite identical reconstructions. This suggests that the LPIPs scoring has a significant impact on the anomaly scoring. This is also demonstrated in the paper [1] referred to by the authors, where utilizing LPIPs for anomaly scoring substantially enhances segmentation performance. Additionally, the ensembling of multiple reconstructions from different selected steps and the morphological operations and post-processing of the temporal and final anomaly score (“cd”) potentially affect the segmentation performance. However, these potential effects are not evaluated in the provided experiments, which hinders a clear conclusion regarding the source of the improved performance. The reviewer believes that the following experiments could significantly enhance the evidence for the proposed approach, enabling a more comprehensive assessment of its individual components:
    2. THOR only with ensembling and no temporal harmonization
    3. THOR with temporal harmonization but evaluated at individual noise levels without aggregating the results of multiple noise levels
    4. THOR without the postprocessing (no “cd”)
    5. THOR without LPIPs.
    6. Data Set Information and experimental design: The experimental design for the data sets raises several concerns and questions.
    7. The full ATLAS dataset of 655 samples is used for testing, while 217 samples from the same dataset have been included in the training data. This overlap could lead to train/test leakage, which is a serious flaw in the evaluation. Could the authors provide their insights on this issue?
    8. Why were some healthy samples from the ATLAS dataset combined with the healthy samples from the IXI dataset for training, even though only the IXI dataset could have been used?
    9. The authors use 217 healthy samples from the ATLAS dataset. However, to the best of my knowledge, the ATLAS dataset does not contain fully healthy MRI volumes. Did the authors use healthy slices of the 3D volumes to derive these healthy samples? What do the authors mean by “samples”?
    10. How is the 3D data processed in general? It appears that the volumes are processed slice-wise. Could the authors explain how the slices are selected during training and which slices are considered during the evaluation?
    11. Since no validation set is mentioned, how did the authors perform hyperparameter tuning?
    12. For the GRAZPEDWRI-DX data set, no information is provided regarding the partitioning into train/evaluation sets. Could the authors provide this information?”
    13. Temporal Anomaly Masks: It would be helpful to include the intermediate temporal anomaly masks in Figure 2. Additionally, the effect of the applied post-processing steps on the temporal anomaly maps is unclear. More information on how sensitive the method is, e.g., to the kernel size of the morphological operations and how the method performs without the post-processing steps, would be insightful.
    14. Harmonization Steps: The manuscript does not clearly state the number of harmonization steps used and at which key intervals the harmonization is performed. It would further be helpful to understand how sensitive the method is to the chosen number of steps and key intervals.
    15. Anomaly Score Calculation: Could the authors clarify what the “selected steps” are in the calculation of the anomaly score S? Also, is the harmonization performed before or after calculating the anomaly score?
    16. Statement Regarding Simplex Noise: The reviewer believes the statement regarding simplex noise should be revised. While the mentioned “self-supervision effect” is possible, there is no clear evidence provided in the paper. In fact, the performance of THOR with simplex noise consistently surpasses that of THOR with Gaussian noise across all pathology sizes in the ATLAS dataset. However, when evaluated on X-Rays, the performance of simplex noise significantly decreases. The paper does not provide evidence that this decrease in performance is due to specific pathology sizes or shapes. Another plausible explanation could be the different image characteristics of MRI and X-Ray. Notably, Simplex noise is specifically designed for MRI [2]. Therefore, these assertions should be revisited and revised in light of the provided evidence.

    [1] Bercea, C., Wiestler, B., Rueckert, D., Schnabel, J., 2023. Generalizing unsupervised anomaly detection: Towards unbiased pathology screening., Medical Imaging with Deep Learning (MIDL). [2] Wyatt, J., Leach, A., Schmon, S.M., Willcocks, C.G., 2022. Anoddpm: Anomaly detection with denoising diffusion probabilistic models using simplex noise, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 650–656.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    Overall, while the paper presents an interesting idea, the reviewer perceives a lack of comprehensive examination of the individual components within the proposed approach. Additionally, the reviewer has concerns regarding the experimental design related to the datasets and perceives potential shortcomings in the evaluation process.

  • Reviewer confidence

    Very confident (4)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Reject — could be rejected, dependent on rebuttal (3)

  • [Post rebuttal] Please justify your decision

    The reviewer still has concerns regarding the proposed work:

    1. The hyperparameter selection process remains unclear, especially considering that no validation set is used, also for the baselines.
    2. While the authors assert that the improvements primarily stem from their method, they only provide qualitative evidence from individual samples. The reviewer still believes that ablating the anomaly scoring (LPIPs) and post-processing steps would be essential to fully evaluate the capabilities of the proposed method. This is particularly relevant since these steps can also be implemented in other baseline methods and are therefore not exclusive to the proposed method.
    3. The authors underscore that domain shifts affect the performance. The reviewer believes that these domain shifts represent a realistic scenario and should be taken into account and represented in the evaluation. The reviewer also believes that incorporating data from the test domain into the training distribution, as done in the proposed work, could result in a skewed evaluation.
    4. The authors’ definition of “sample” remains unanswered during the rebuttal. According to the provided code, only one fixed slice per volume is considered a sample, which raises questions about the selection process for the slice. Moreover, evaluating only one 2D slice per volume limits the evaluation, given the complexity of volumetric brain MRIs.

    Based on these concerns, the reviewer maintains the initial score of 3 - weak reject.



Review #2

  • Please describe the contribution of the paper

    This paper presents Tempora; Harmonization for Optimal Restraotion (THOR) to restore the medical image with detail and precision and detect/segment the anamolies in two modalities and tasks. THOR implicitly apply unsupervised temporal anamaly masks at key interval to offer the guidance during diffusion process.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    THOR adapats the diffusion model and provides an implicit guidance using a specific designed masking strategy for the reconstruction process to closely resemble to the original images and close to healthy tissue profile.

    THOR has evaluated the results on completely different modalities and tasks (brain MRI and wrist X-rays).

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The description of the implicit guidance with intermediate anamoly maps is clear but lacks justification of the effectiveness.

    The performance for downstream tasks does not seem satisfied, especially the low dice coefficient for small pathologies in Table 1 and 2.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The submission has provided an anonymized link to the source code, dataset, or any other dependencies.

  • Do you have any additional comments regarding the paper’s reproducibility?

    Authors released the anonymized link to the code https://anonymous.4open.science/r/THOR-1315.

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Can authors justify the effectiveness of the method (implicit guidance) or it was purely based on the improvement of the results?

    Can authors comment on the low dice coefficient in Table 1, especially for small pathologies? In addition, the scale of dice coefficient should be 0 to 1.

    How did the experiments in Section 4.2 (Table 2) performed? Did authors treat it as a multi-label classification or binary classification. The recall for soft tissue conditions seems very low, can authors explain why? Can authors confirm if the recall for FB are all 75.00 across all methods?

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    A novel implicit guidance using intermediate anomaly maps in diffusion model training. The method lacks justification and the evaluations need to be further explained.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    Weak Accept — could be accepted, dependent on rebuttal (4)

  • [Post rebuttal] Please justify your decision

    I am satisfied with the replies from authors in rebuttal.



Review #3

  • Please describe the contribution of the paper

    This paper introduces THOR a novel diffusion model based anomaly detection framework, which incorporates unsupervised anatomical anomaly detection to increase the fidelity of normative samples. There experiments demonstrate that their approach outperforms existing DDPM methods on this task.

  • Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.

    This paper incorporates very domain specific information to directly alter the sampling trajectory of diffusion models. This is a particular strength as many diffusion model based applications to medical imaging do not tailor elements of the inference process to the problem at hand. The paper also has convincing qualitative and quantitative results demonstrating that it outperforms existing DDPM approaches. They also demonstrate this to be the case in both brain images and X-ray images which shows it’s generalisability.

  • Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.

    The paper does not include DDIM based anomaly detection methods, which I consider to be a key baseline. The background section on diffusion models skips over many important details. There is a slight lack of clarity over some of the implementation details of the model.

  • Please rate the clarity and organization of this paper

    Very Good

  • Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

    The authors claimed to release the source code and/or dataset upon acceptance of the submission.

  • Do you have any additional comments regarding the paper’s reproducibility?

    N/A

  • Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

    Overall a very good paper. As already mentioned I particularly appreciate the adaptation of the reverse process to the specific domain. I think it can inspire other researchers to do similar things for other elements of medical imaging.

    I would personally change the background section of diffusion models. I understand that the authors have limited space but the forward process is presented slightly incorrectly. The forward process should be presented as an approximate posterior q(x_t x_0) = N(x_t ; \sqrt{\alpha_t}x_0, (1 - \alpha_t)I) it is through reparamaterisation that we may express this as x_t = \sqrt{1 - \alpha_{t}}*x_0 + \sqrt{1-\alpha_t} * \epsilon. Similarly the reverse process should be written to reflect the stochastic nature of DDPM. This is because equation 2 is written as a deterministic process. It should be eq2 + \sigma * z where z ~ N(0, I), and \sigma is the predefined variance. See the original DDPM paper for more clarity: https://arxiv.org/abs/2006.11239. However this really doesn’t take away from the paper and is just a point on the formalisms of the background.

    I think the lack of a DDIM based baseline is important as this is a key and ubiquitous sampling regime so having it in the paper would bolster the claims being made.

  • Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

    Accept — should be accepted, independent of rebuttal (5)

  • Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

    I think the paper can inspire more researchers using generative modelling for medical imaging to tailor their inference processes to the specific domain they are dealing with. This is a major contribution. The paper also has very strong qualitative and quantitative results.

  • Reviewer confidence

    Confident but not absolutely certain (3)

  • [Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

    N/A

  • [Post rebuttal] Please justify your decision

    N/A




Author Feedback

​​We thank reviewers R3, R4, R5 for their constructive comments and positive assessment: THOR, our novel diffusion-based method for unsupervised anomaly detection (UAD) in medical imaging “addresses a critical issue” (R4) in reconstruction-based UAD by enhancing coherence between input-reconstruction pairs. R3, R4, R5 appreciated our “very strong evaluation” across “different modalities and tasks,” including brain MRI and wrist X-rays, with “effective and convincing results”, “compared to relevant state-of-the-art models”. Additionally, R5 recognized that our work “can inspire more researchers using generative modeling for medical imaging to tailor their inference processes to the specific domains”.

Main points raised:

  • R4 asked about dataset splits, i.e., potential train/test leakage. We strictly prevent train/test leakage by using healthy slices from one group of patients for training and slices containing pathology from different patients for evaluation. Details are provided in our published codebase, which we will clarify further in the revised manuscript.

  • R3 asked about the motivation of our work. As also highlighted by R4, our main motivation is to maintain fidelity in healthy tissues during the synthesis process while effectively replacing pathological tissues with pseudo-healthy alternatives, as can be seen in Figs. 1, 2, 3, 5, and Supplementary Figs. 1,2. We will emphasize this motivation more clearly in the revised manuscript.

  • R3 asked about the performance on small lesions. Measuring Dice scores (show in %) for small lesions is challenging due to the impact of small displacements and false positives. Although, our method shows substantial improvements for small lesions (103% with Gaussian noise, 44% with Simplex noise), we concur that the detection of small lesions is an important task that needs further research attention.

  • R4 asked about ablation studies: We conducted ablation studies on noise levels and noise type, demonstrating robustness to higher noise levels (Fig. 4), which is essential for effectively removing pathologies. Due to space constraints, additional details on sampling intervals and hyperparameters are available in our codebase. We did not optimize these for specific datasets to maintain generalizability across diverse setups. While LPIPS anomaly maps are also used in other SOTA methods like AutoDDPM, our improvements primarily stem from the novel integration of implicit temporal anomaly maps, which guide the diffusion process more effectively (see Figs.). Other components like the harmonization frequency, anomaly map computation, and dilation operations ensure no anomalies are restored during synthesis but have a less substantial impact. As suggested by R4 we will consider these ablation studies in future work.

Minor Points

  • R5: While we cannot add new experiments in the rebuttal, we recognize the importance of including a DDIM baseline and will consider it for future work. We believe DDIM would perform similarly to DDPM with faster inference.

​​- R4: Domain shifts from training on different datasets are common and impact UAD performance. We included target domain images in training to mitigate these.

  • R4: We appreciate the contribution of Simplex noise and agree that the hypothesis about its self-supervision effect needs more validation. We will revisit this claim. However, our experiments highlight specific limitations, and acknowledging these will help the community to progress.

  • R3/R4: UAD only supports binary classification without detailing anomaly types. We use normal X-rays for training/validation and pathological ones for AD evaluation. All methods detect 3/4 foreign body anomalies. Soft tissue anomalies are subtle and challenging to detect on X-rays.

We sincerely thank the reviewers for their insightful feedback which we will incorporate in our final manuscript where possible.




Meta-Review

Meta-review #1

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    R3, R4, and R5 all agreed that the main premise of the paper, i.e., examining Unsupervised Anomaly Detection with implicit guidance using DDPM, is quite interesting. However, I agree with most of the points raised by R4 and would therefore like to ask that the unaddressed weaknesses of the paper be addressed. For example, the effects of morphological post-processing “cd,” LPIP, and the final anomaly score via harmonic mean, as well as the four key experiments suggested by R4 i.e. THOR only …, THOR with …, etc.).

    Similarly, I would also request that the authors consider points 3 to 5 raised by R4 regarding specific comments on the Temporal Anomaly Masks, Harmonization Steps, and Anomaly Score Calculation. I believe some of these points must have been examined by the authors prior to submission, and including these in the final revision of the paper should come at no extra cost. If not, the amount of work required to include these results addressing these points is not too significant and should be feasible prior to the camera-ready version.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    R3, R4, and R5 all agreed that the main premise of the paper, i.e., examining Unsupervised Anomaly Detection with implicit guidance using DDPM, is quite interesting. However, I agree with most of the points raised by R4 and would therefore like to ask that the unaddressed weaknesses of the paper be addressed. For example, the effects of morphological post-processing “cd,” LPIP, and the final anomaly score via harmonic mean, as well as the four key experiments suggested by R4 i.e. THOR only …, THOR with …, etc.).

    Similarly, I would also request that the authors consider points 3 to 5 raised by R4 regarding specific comments on the Temporal Anomaly Masks, Harmonization Steps, and Anomaly Score Calculation. I believe some of these points must have been examined by the authors prior to submission, and including these in the final revision of the paper should come at no extra cost. If not, the amount of work required to include these results addressing these points is not too significant and should be feasible prior to the camera-ready version.



Meta-review #2

  • After you have reviewed the rebuttal and updated reviews, please provide your recommendation based on all reviews and the authors’ rebuttal.

    Accept

  • Please justify your recommendation. You may optionally write justifications for ‘accepts’, but are expected to write a justification for ‘rejects’

    The decision was split: Accept (A), Weak Accept (WA), and Weak Reject (WR). All reviewers highlighted the interesting method, clear motivation, and well-organized writing as strengths. Reviewer 4 expressed concerns mainly about the experimental parts: the lack of an ablation study, the lack of validation data, and the hyperparameter selection process. The rebuttal addressed some of these issues; however, concerns about the experimental parts still remain. Since this paper may inspire related researchers in MICCAI even though the experiments are not fully sufficient, the meta-reviewer recommends accepting this paper if there is some space.

  • What is the rank of this paper among all your rebuttal papers? Use a number between 1/n (best paper in your stack) and n/n (worst paper in your stack of n papers). If this paper is among the bottom 30% of your stack, feel free to use NR (not ranked).

    The decision was split: Accept (A), Weak Accept (WA), and Weak Reject (WR). All reviewers highlighted the interesting method, clear motivation, and well-organized writing as strengths. Reviewer 4 expressed concerns mainly about the experimental parts: the lack of an ablation study, the lack of validation data, and the hyperparameter selection process. The rebuttal addressed some of these issues; however, concerns about the experimental parts still remain. Since this paper may inspire related researchers in MICCAI even though the experiments are not fully sufficient, the meta-reviewer recommends accepting this paper if there is some space.



back to top