Abstract

Medical anomaly detection (AD) is crucial in pathological identification and localization. Current methods typically rely on uncertainty estimation in deep ensembles to detect anomalies, assuming that ensemble learners should agree on normal samples while exhibiting disagreement on unseen anomalies in the output space. However, these methods may suffer from inadequate disagreement on anomalies or diminished agreement on normal samples. To tackle these issues, we propose D2UE, a Diversified Dual-space Uncertainty Estimation framework for medical anomaly detection. To effectively balance agreement and disagreement for anomaly detection, we propose Redundancy-Aware Repulsion (RAR), which uses a similarity kernel that remains invariant to both isotropic scaling and orthogonal transformations, explicitly promoting diversity in learners’ feature space. Moreover, to accentuate anomalous regions, we develop Dual-Space Uncertainty (DSU), which utilizes the ensemble’s uncertainty in input and output spaces. In input space, we first calculate gradients of reconstruction error with respect to input images. The gradients are then integrated with reconstruction outputs to estimate uncertainty for inputs, enabling effective anomaly discrimination even when output space disagreement is minimal. We conduct a comprehensive evaluation of five medical benchmarks with different backbones. Experimental results demonstrate the superiority of our method to state-of-the-art methods and the effectiveness of each component in our framework.

Links to Paper and Supplementary Materials

Main Paper (Open Access Version): https://papers.miccai.org/miccai-2024/paper/1356_paper.pdf

SharedIt Link: https://rdcu.be/dV5yp

SpringerLink (DOI): https://doi.org/10.1007/978-3-031-72089-5_49

Supplementary Material: https://papers.miccai.org/miccai-2024/supp/1356_supp.pdf

Link to the Code Repository

https://github.com/Rubiscol/D2UE

Link to the Dataset(s)

https://www.kaggle.com/c/rsna-pneumonia-detection-challenge https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection https://www.kaggle.com/datasets/masoudnickparvar/brain-tumor-mri-dataset

BibTex

@InProceedings{Gu_Revisiting_MICCAI2024,
        author = { Gu, Yi and Lin, Yi and Cheng, Kwang-Ting and Chen, Hao},
        title = { { Revisiting Deep Ensemble Uncertainty for Enhanced Medical Anomaly Detection } },
        booktitle = {proceedings of Medical Image Computing and Computer Assisted Intervention -- MICCAI 2024},
        year = {2024},
        publisher = {Springer Nature Switzerland},
        volume = {LNCS 15006},
        month = {October},
        page = {520 -- 530}
}

Reviews

Review #1

Please describe the contribution of the paper

This paper targets unsupervised anomaly detection in medical images based on the ensemble uncertainty-based method. The proposed modules, D2UE and RAR, promotes diversity among the ensemble learners and reconstructs training samples from repulsed feature spaces. Experiments are conducted on the several popular medical image anomaly detection datasets.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. The paper is well-organized and well-written.
2. The motivation is clear. The idea is straightforward but under-explored in the literature, which should be encouraged.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. This paper only considers anomaly classification, while in most of the scenarios, localizing the abnormal regions in the medical diagnosis is very important. The authors are suggested to use some datasets with segmentation masks, to evaluate the anomaly localization performance.
2. In the method, it is unclear how the proposed method is used to detect anomalies during testing. A complete method should include a subsection showing how the model is used for the inference.
3. The subsection in Page 5, named orthogonal transformation invariance, is poor in written. It is hard to understand even after reading several times. The example shown in the draft is not straightforward.
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

See weaknesses.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

See weaknesses.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #2

Please describe the contribution of the paper
This paper proposes a novel ensemble based approach for anomaly detection. This approach is summarized below:
- a unique loss function is proposed that encourages repulsion of individual learner vector spaces by using a similarity loss in addition to reconstruction loss. Invariance properties are also introduced to prevent learners from falling victim to network redundancy such as scaling or reordering senior learners feature vectors. This suggests that the learners in the ensemble will be distinct with this loss. -To prevent the issue illustrated in Figure 2 c (what appears to be agreement in output space but isn’t exactly), dual space uncertainty is then determined by examining the first order derivatives of the model. This allows a distinction between different models that agree in the output space by also looking at potential disagreement in input gradient space
The results of this method are demonstrated on several benchmark datasets and ablation studies are also performed.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
I have outlined the main strengths of this paper below:
- This paper is well written and I appreciate the detailed breakdown of the methodology. The mathematical derivations are clear and easy to follow.
- Extensive ablation is shown by bringing in alternative benchmarks, various datasets and modalities, etc. The supplementary material is also useful to see that several elements of the approach were compared and evaluated - great work.
- The introduction nicely motivates the work, appears to cover the relevant state of the art, and leads into the contributions nicely.
- The content and contributions are very interesting and I think this work has the potential for significant positive impact in medical image computing.
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
I have outlined some of the weaknesses of the paper below:
- The multiple acronyms related to the work make the contribution a bit difficult to follow. Particularly at the beginning when they’re all introduced before they’re explained in detail.
- Model labels on feature spaces are too small for Fig 1
- The placement and details in Fig 1 and 2 are a bit confusing - these are excellent figures but Fig 2 (c) feels a bit out of place given where it comes up in discussion and I had to scroll back after reading the methodology to really understand Figure 1. I appreciate the limited space may have impacted this but I think the work may benefit from rearrangement.
- There is no distinction between experiments and results and you jump back and forth when presenting these - this makes this part of the work hard to follow.
- The ablation discussion on the choice of similarity metric is a bit confusing. This table is quite difficult to follow (Table 3). These details and discussions should be clarified.
- Based on the supplemental information, a lambda of 2 produces the highest the AUC with the D2UE method but lambda of 1 was used in all the experiments - is there a reason for this?
- Minor: In the caption of Fig 1 (supplemental) it is stated that “Left: The horizontal axis is λ. The experiment reveals that the performance enhancement resulting from RAR is not susceptible to variations in λ.” but there is some variation in AUC so this is confusing.
- The improvement over other benchmarks in some cases appear to be significant and in others columns not so much. Without standard deviations its difficult to capture the full benefit of this approach. I also feel that this is important to see to determine if initialization weights impact the ensemble
- Minor: AP is never defined
Please rate the clarity and organization of this paper

Satisfactory
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Do you have any additional comments regarding the paper’s reproducibility?

The detailed methodology and ablation descriptions indicate to me that these results are reproducible.
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

Please see the highlighted weaknesses above for comments to address to improve the manuscript. Overall, I think this is very well written and very interesting and most of my concerns relate to the readability and organization of the paper.

I think that the inclusion of standard deviations is however significant to understand the impact of random initialization of each learning. In my opinion, this must be considered to help drive home the significance of this contribution.
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Weak Accept — could be accepted, dependent on rebuttal (4)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

My overall recommendation is a weak accept because as I’ve outlined above, most of the weaknesses that I’ve pointed out are related to the organization of the content and delivery, rather than the contributions. I think this is a very interesting approach that, with these improvements, would be a good fit for the MICCAI proceedings.
Reviewer confidence

Somewhat confident (2)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Review #3

Please describe the contribution of the paper

The authors proposed a new anomaly detection model based on deep ensembles increasing the diversity among ensemble members by imposing repulsion in feature space and integrating dual space uncertainty to ensure capture anomalies even if their outputs are similar. They validated their model on multiple benchmarks followed by experiments on different components of their models.
Please list the main strengths of the paper; you should write about a novel formulation, an original way to use data, demonstration of clinical feasibility, a novel application, a particularly strong evaluation, or anything else that is a strong aspect of this work. Please provide details, for instance, if a method is novel, explain what aspect is novel and why this is interesting.
1. Extensive experiments on multiple different modalities
2. Ablation studies on the different components of the model
3. Increasing diversity through feature space and identifying anomalies using uncertainty beyond the output space
4. Easy reading and straightforward
Please list the main weaknesses of the paper. Please provide details, for instance, if you think a method is not novel, explain why and provide a reference to prior work.
1. The compared methods are rather old
2. The proposed method is somehow similar to the two ICLR 2023 [1, 2] and at least should be cited
3. So many repetition throughout the paper with almost the same wording (ex. page 3)
4. Number of learners (N) are important in ensembles, however the authors didn’t mentions how they came up with the number and any limitations to the choice of this number
[1] Agree to Disagree: Diversity through Disagreement for Better Transferability (https://openreview.net/pdf?id=K7CbYQbyYhY) [2] Diversify and Disambiguate: Out-of-Distribution Robustness via Disagreement (https://openreview.net/pdf?id=RVTOp3MwT3n)
Please rate the clarity and organization of this paper

Very Good
Please comment on the reproducibility of the paper. Please be aware that providing code and data is a plus, but not a requirement for acceptance.

The authors claimed to release the source code and/or dataset upon acceptance of the submission.
Do you have any additional comments regarding the paper’s reproducibility?

N/A
Please provide detailed and constructive comments for the authors. Please also refer to our Reviewer’s guide on what makes a good review. Pay specific attention to the different assessment criteria for the different paper categories (MIC, CAI, Clinical Translation of Methodology, Health Equity): https://conferences.miccai.org/2024/en/REVIEWER-GUIDELINES.html

– It would be nicer if the authors added the hyper reference, finding the references easier

– The paper is easy to follow but in some cases it feels like Chatgpt written texts, and I think there is no issue in presenting a work in simple English

– Some of the baselines and works need to be cited properly (ex. MemAE should be cited the first time you referred to it)

– The metrics should be defined or at least cited, even though they are known in the literature

– Fig 1 is very small and hard to read
Rate the paper on a scale of 1-6, 6 being the strongest (6-4: accept; 3-1: reject). Please use the entire range of the distribution. Spreading the score helps create a distribution for decision-making

Accept — should be accepted, independent of rebuttal (5)
Please justify your recommendation. What were the major factors that led you to your overall score for this paper?

The importance of diversity in ensemble as an important and easy solution to AD is not well explored. Besides, the experiments and achieved results support the claim of the paper.
Reviewer confidence

Very confident (4)
[Post rebuttal] After reading the author’s rebuttal, state your overall opinion of the paper if it has been changed

N/A
[Post rebuttal] Please justify your decision

N/A

Author Feedback

We appreciate all the reviewers’ valuable comments and acknowledgment of the contributions of our work. We will address the reviewers’ comments below.

R1Q1: The compared methods are rather old.
Our experiment introduces an ensemble framework based on reconstruction Autoencoders. Therefore, to validate the effectiveness of our framework, we chose AE, MemAE, and AEU, which are classic and widely used backbones in the field of anomaly detection. Uncertainty-based anomaly detection is a relatively rare approach. We compared all relevant approaches (Student-Teacher method) in CVPR from 2021-2023 and the uncertainty estimate approach in MICCAI2022.

R1Q2: Cite related papers.
In the 4th paragraph of the Introduction, we cited the paper [1]:

“To address simplicity bias, previous methods attempted to induce repulsion among learners in either the output space [20] or weight space [8].”

Both [1] and [2] restrain the similarity in learners’ output space, which culminates in the underfitting of individual learners. In contrast, our method sets the similarity constraint in learner’s feature space to avoid underfitting.

Additionally, our motivations for inducing diversity differ. Paper [1] aims to improve the model’s transferability, and paper [2] seeks to enhance the model’s out-of-distribution robustness. However, our objective is to amplify the ensemble’s uncertainty in anomalies.

[1] Agree to Disagree: Diversity through Disagreement for Better Transferability
[2] Diversify and Disambiguate: Out-of-Distribution Robustness via Disagreement

R1Q3: Many repetitions.
We will carefully check and reduce repetitions within the same section.

R1Q4: Number of learners.
The framework’s performance improves as the number of learners increases. However, there exists a bottleneck in performance improvement. We chose 3 to balance performance and computational efficiency.

R2Q1: Segmentation task.
Due to the lack of pixel-level labeled medical anomaly detection datasets, we have not currently included a segmentation task. However, to demonstrate our method’s anomaly localization capability, we provide examples based on bounding boxes in Fig. 3.

R2Q2: Inference process.
In the subsection “Dual-Space Uncertainty,” we explain how to detect anomalies during inference. The sample is input to all learners, and the pixel-level anomaly score map is generated according to Eq. (6).

R2Q3: Improve writing on Page 5.
We have revised the example explaining orthogonal transformation invariance to facilitate better understanding.

R3Q1: Acronyms are difficult to follow.
Acronyms in the Abstract will be elaborated further in the Introduction.

R3Q2: Improve Fig.1 and Fig.2.
We have enlarged the text in Fig.1 and included Fig. 2(C) in Fig. 1.

R3Q3: No distinction of experiments.
We will make the Experiment section more concise.

R3Q4: Table 3 is difficult to follow.
We have added an explanation in Table 3’s caption. A checkmark indicates the presence of the given mathematical property, while a fork indicates the opposite.

R3Q5: Why lambda 1?
In our approach, we set lambda to 1 only for simplicity. We do not want to tune too many hyperparameters to validate our method’s effectiveness.

R3Q6: Explain Fig. 1 in the appendix.
Our model optimizes a multi-objective task during training. If λ tends to positive infinity, each learner only focuses on behaving differently and thus fails to learn normal samples. If λ tends to zero, the framework degrades to a randomly initialized one. What we want to illustrate through Fig.1 is that our method can consistently outperform the backbone under a wide range of λs.

R3Q7: Add standard deviation.
We will add more detailed evaluations to future work, including more ablation studies and statistical analysis.

R3Q8: AUC and AP are never defined.
We have added the definition of AUC and AP in the Experiments section.

Meta-Review

Meta-review not available, early accepted paper.

back to top

Revisiting Deep Ensemble Uncertainty for Enhanced Medical Anomaly Detection

Author(s):